MoE | Yue Shui Blog

gpt-oss & GPT-5

In August 2025, the AI field witnessed a period of intensive releases from OpenAI. Following GPT-2 (OpenAI, 2019) in 2019, OpenAI has once again contributed to the open-source community with its first open-weight large language model series, gpt-oss (OpenAI, 2025), available in 120B and 20B sizes. Shortly after, the highly anticipated next-generation flagship model, GPT-5 (OpenAI, 2025), was also officially launched. This series of releases not only marks a new high for open-source models in reasoning and agent capabilities but also reveals OpenAI’s latest advancements in model architecture, training methodologies, and safety alignment. ...

DeepSeek-V2 vs V3

DeepSeek AI successively released DeepSeek-V2 (DeepSeek-AI, 2024) and DeepSeek-V3 (DeepSeek-AI, 2024), two powerful Mixture-of-Experts (MoE) language models that significantly optimize training costs and inference efficiency while maintaining state-of-the-art performance. DeepSeek-V2 has a total of 236B parameters, activating 21B per token, while DeepSeek-V3 further expands to 671B total parameters, activating 37B per token. Both support a 128K context length. The core innovations of these two models lie in the adoption of Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture (Dai et al., 2024). MLA drastically reduces GPU memory usage during inference by compressing the Key-Value (KV) cache into low-dimensional latent vectors, improving efficiency. DeepSeekMoE achieves stronger expert specialization capabilities and more economical training costs through fine-grained expert segmentation and shared expert isolation. Building upon V2, DeepSeek-V3 further introduces an Auxiliary-Loss-Free Load Balancing strategy (Wang et al., 2024) and the Multi-Token Prediction (MTP) (Gloeckle et al., 2024) training objective, further enhancing model performance and training efficiency. ...

Parallelism and Memory Optimization Techniques for Training Large Models

Background Recently, the number of parameters in large models has been continuously increasing, from the initial billions to today’s hundreds of billions or even trillions. While large models have brought unprecedented application effects, they have also triggered a series of severe challenges in computing resources, memory management, and training stability. Therefore, this blog summarizes some commonly used distributed parallel training and memory management techniques, hoping to help everyone better train and optimize large models. ...