Parallelism and Memory Optimization Techniques for Training Large Models

Background Recently, the number of parameters in large models has been continuously increasing, from the initial billions to today’s hundreds of billions or even trillions. While large models have brought unprecedented application effects, they have also triggered a series of severe challenges in computing resources, memory management, and training stability. Therefore, this blog summarizes some commonly used distributed parallel training and memory management techniques, hoping to help everyone better train and optimize large models. ...

2025-03-01 · 61 min · 12817 words · Yue Shui

LLMs Alignment: DPO

This blog post introduces a streamlined alternative to RLHF called DPO. Like RLHF, DPO is designed to align model outputs with human preferences, but it stands apart with its simplicity and lower resource demands. In scenarios where project resources are limited, DPO emerges as a highly attractive and practical solution worth exploring. Notations Symbol \( x \) User input (Prompt): the question the model needs to answer \( y \) Model-generated response (Response / Completion): the text output by the model \( \pi_\theta(y \mid x) \) Actor model: The trainable policy used to generate response \(y\); parameterized by \(\theta\) \( \pi_{\mathrm{ref}}(y \mid x) \) Reference model: The frozen SFT (Supervised Fine-Tuning) model, serving as the alignment baseline \( r_\phi(x,y) \) Reward model: A reward function (with parameter \(\phi\)) used to evaluate the quality of response \(y\) \( V_\psi(x) \) Critic model: A value function (with parameter \(\psi\)) used to estimate the future cumulative reward given \(x\) \( \pi^*(y \mid x) \) Optimal policy distribution, determined via the reference model and reward function \( r_\theta(x,y) \) Reward derived from the Actor model, constructed from \(\pi_\theta\) and \(\pi_{\mathrm{ref}}\) \(\beta\) Hyperparameter that controls the weight of the KL penalty or the log-ratio difference term \(\mathbb{D}_{\mathrm{KL}}[P \| Q]\) KL divergence, a measure of the difference between probability distributions \(P\) and \(Q\) \(\sigma(z)\) Sigmoid function, defined as: \(\sigma(z)=\frac{1}{1+e^{-z}}\) \(\log\) Logarithm function \(\mathbb{E}\) Expectation operator, used to compute the average value of a random variable \( (y_w, y_l) \) A pair of preference data where \( y_w \) is the preferred (better quality) response and \( y_l \) is the lesser one \( P\left(y_w \succ y_l \mid x\right) \) The probability that response \( y_w \) is preferred over \( y_l \) given input \(x\) \( Z(x) \) Partition function, which normalizes the probability distribution over all responses \(y\) \( \mathcal{L}_{\mathrm{DPO}} \) The loss function of DPO From RLHF to DPO RLHF OpenAI primarily leverages Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017) to train InstructGPT (Ouyang et al., 2022), which forms the basis for LLMs (such as ChatGPT, Llama, etc.). The entire training process generally comprises the following three main steps: ...

2025-02-08 · 13 min · 2577 words · Yue Shui

Normalization in Deep Learning

Introduction In deep learning, the design of network architectures significantly impacts model performance and training efficiency. As model depth increases, training deep neural networks faces numerous challenges, such as the vanishing and exploding gradient problems. To address these challenges, residual connections and various normalization methods have been introduced and are widely used in modern deep learning models. This article will first introduce residual connections and two architectures: pre-norm and post-norm. Then, it will describe four common normalization methods: Batch Normalization, Layer Normalization, Weight Normalization, and RMS Normalization, and analyze why current mainstream large models tend to adopt an architecture combining RMSNorm and Pre-Norm. ...

2025-02-01 · 13 min · 2576 words · Yue Shui

Deep Reinforcement Learning (Ongoing Updates)

Note: This article is currently being updated. The content is in draft version and may change. Please check back for the latest version. Notations Symbol Meaning \(s, s', S_t, S_{t+1}\) State, next state, state at time \(t\), state at time \(t+1\) \(o, o_t\) Observation, observation at time \(t\) \(a, a', A_t, A_{t+1}\) Action, next action, action at time \(t\), action at time \(t+1\) \(r, r_t\) Immediate reward, reward at time \(t\) \(G_t\) Return at time \(t\) \(R(\tau)\) Return of a trajectory \(\tau\) \(\mathcal{S}\) Set of all possible states \(\mathcal{A}\) Set of all possible actions \(\mathcal{R}\) Set of all possible rewards \(\pi(a\mid s), \pi_\theta(a\mid s)\) Policy (stochastic), parameterized policy \(\mu(s), \mu_\theta(s)\) Policy (deterministic), parameterized policy \(\theta, \phi, w\) Policy or value function parameters \(\gamma\) Discount factor \(J(\pi)\) Expected return of policy \(\pi\) \(V_\pi(s)\) State-value function for policy \(\pi\) \(Q_\pi(s,a)\) Action-value function for policy \(\pi\) \(V_*(s)\) Optimal state-value function \(Q_*(s,a)\) Optimal action-value function \(A_\pi(s,a)\) Advantage function for policy \(\pi\) \(P(s'\mid s,a)\) Transition probability function \(R(s,a,s')\) Reward function \(\rho_0(s)\) Start-state distribution \(\tau\) Trajectory \(D\) Replay memory \(\alpha\) Learning rate, temperature parameter (in SAC) \(\lambda\) Eligibility trace parameter \(\epsilon\) Exploration parameter (e.g., in \(\epsilon\)-greedy), clipping parameter (in PPO) What is Reinforcement Learning? Definition ...

2025-01-31 · 34 min · 7230 words · Yue Shui

OpenAI o1 Replication Progress: DeepSeek-R1

DeepSeek AI recently released DeepSeek-R1 (DeepSeek-AI, 2025), whose reasoning performance on multiple benchmarks approaches the level of OpenAI’s o1 (OpenAI, 2024), marking a significant step for the open-source community in successfully replicating o1. Relevant code for R1 can be found in the huggingface’s attempt to open-source replication project open-r1. While previous research has often relied on massive amounts of supervised data to enhance the performance of Large Language Models (LLMs), the success of DeepSeek-R1 and its earlier experiment, DeepSeek-R1-Zero, powerfully demonstrates the potential of purely large-scale reinforcement learning in improving the reasoning capabilities of LLMs. This success reinforces the profound insight proposed by Richard Sutton in “The Bitter Lesson”: ...

2025-01-27 · 48 min · 10166 words · Yue Shui