DeepSeek-V2 vs V3

DeepSeek AI successively released DeepSeek-V2 (DeepSeek-AI, 2024) and DeepSeek-V3 (DeepSeek-AI, 2024), two powerful Mixture-of-Experts (MoE) language models that significantly optimize training costs and inference efficiency while maintaining state-of-the-art performance. DeepSeek-V2 has a total of 236B parameters, activating 21B per token, while DeepSeek-V3 further expands to 671B total parameters, activating 37B per token. Both support a 128K context length. The core innovations of these two models lie in the adoption of Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture (Dai et al., 2024). MLA drastically reduces GPU memory usage during inference by compressing the Key-Value (KV) cache into low-dimensional latent vectors, improving efficiency. DeepSeekMoE achieves stronger expert specialization capabilities and more economical training costs through fine-grained expert segmentation and shared expert isolation. Building upon V2, DeepSeek-V3 further introduces an Auxiliary-Loss-Free Load Balancing strategy (Wang et al., 2024) and the Multi-Token Prediction (MTP) (Gloeckle et al., 2024) training objective, further enhancing model performance and training efficiency. ...

2025-04-18 · 63 min · 13242 words · Yue Shui

The LLaMA Herd

LLaMA The LLaMA series of open-source models released by Meta AI has become one of the cornerstones of the large language model community, profoundly impacting the advancement of open research and applications. From the pioneering LLaMA released in early 2023, to the significantly improved LLaMA 2 later that year, to derivative models targeting specific domains (like code, safety), and the subsequent new generations LLaMA 3 and LLaMA 4 launched in 2024 and 2025 respectively, Meta has continuously committed to enhancing the performance of open-source models, gradually bringing them closer to state-of-the-art closed-source models. Below, we will introduce the key technical details of each major model in sequence. ...

2025-04-06 · 33 min · 6862 words · Yue Shui

Large Language Model Agents

Agents Since OpenAI released ChatGPT in October 2022, and with the subsequent emergence of projects such as AutoGPT and AgentGPT, LLM-related agents have gradually become a research hotspot and a promising direction for practical applications in AI in recent years. This article will introduce the basic concepts of agents, their core technologies, and the latest advances in their applications. Large Language Model Agents Large Language Model Agents (LLM agents) utilize LLMs as the system’s brain, combined with modules such as planning, memory, and external tools, to achieve automated execution of complex tasks. ...

2025-03-27 · 32 min · 6788 words · Yue Shui

Parallelism and Memory Optimization Techniques for Training Large Models

Background Recently, the number of parameters in large models has been continuously increasing, from the initial billions to today’s hundreds of billions or even trillions. While large models have brought unprecedented application effects, they have also triggered a series of severe challenges in computing resources, memory management, and training stability. Therefore, this blog summarizes some commonly used distributed parallel training and memory management techniques, hoping to help everyone better train and optimize large models. ...

2025-03-01 · 61 min · 12817 words · Yue Shui

LLMs Alignment: DPO

This blog post introduces a streamlined alternative to RLHF called DPO. Like RLHF, DPO is designed to align model outputs with human preferences, but it stands apart with its simplicity and lower resource demands. In scenarios where project resources are limited, DPO emerges as a highly attractive and practical solution worth exploring. Notations Symbol \( x \) User input (Prompt): the question the model needs to answer \( y \) Model-generated response (Response / Completion): the text output by the model \( \pi_\theta(y \mid x) \) Actor model: The trainable policy used to generate response \(y\); parameterized by \(\theta\) \( \pi_{\mathrm{ref}}(y \mid x) \) Reference model: The frozen SFT (Supervised Fine-Tuning) model, serving as the alignment baseline \( r_\phi(x,y) \) Reward model: A reward function (with parameter \(\phi\)) used to evaluate the quality of response \(y\) \( V_\psi(x) \) Critic model: A value function (with parameter \(\psi\)) used to estimate the future cumulative reward given \(x\) \( \pi^*(y \mid x) \) Optimal policy distribution, determined via the reference model and reward function \( r_\theta(x,y) \) Reward derived from the Actor model, constructed from \(\pi_\theta\) and \(\pi_{\mathrm{ref}}\) \(\beta\) Hyperparameter that controls the weight of the KL penalty or the log-ratio difference term \(\mathbb{D}_{\mathrm{KL}}[P \| Q]\) KL divergence, a measure of the difference between probability distributions \(P\) and \(Q\) \(\sigma(z)\) Sigmoid function, defined as: \(\sigma(z)=\frac{1}{1+e^{-z}}\) \(\log\) Logarithm function \(\mathbb{E}\) Expectation operator, used to compute the average value of a random variable \( (y_w, y_l) \) A pair of preference data where \( y_w \) is the preferred (better quality) response and \( y_l \) is the lesser one \( P\left(y_w \succ y_l \mid x\right) \) The probability that response \( y_w \) is preferred over \( y_l \) given input \(x\) \( Z(x) \) Partition function, which normalizes the probability distribution over all responses \(y\) \( \mathcal{L}_{\mathrm{DPO}} \) The loss function of DPO From RLHF to DPO RLHF OpenAI primarily leverages Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017) to train InstructGPT (Ouyang et al., 2022), which forms the basis for LLMs (such as ChatGPT, Llama, etc.). The entire training process generally comprises the following three main steps: ...

2025-02-08 · 13 min · 2577 words · Yue Shui