Deep Learning

vLLM: High-Throughput, Memory-Efficient LLM Serving

As the parameters of Large Language Models (LLMs) continue to grow, deploying and serving these models presents significant challenges. vLLM is an open-source library designed for fast, convenient, and cost-effective LLM inference and online serving. Its core lies in the PagedAttention algorithm, which efficiently manages the KV Cache in the attention mechanism. Evaluation Metrics To evaluate the performance of LLM inference and serving engines, we primarily focus on the following metrics: ...

DeepSeek-V2 vs V3

DeepSeek AI successively released DeepSeek-V2 (DeepSeek-AI, 2024) and DeepSeek-V3 (DeepSeek-AI, 2024), two powerful Mixture-of-Experts (MoE) language models that significantly optimize training costs and inference efficiency while maintaining state-of-the-art performance. DeepSeek-V2 has a total of 236B parameters, activating 21B per token, while DeepSeek-V3 further expands to 671B total parameters, activating 37B per token. Both support a 128K context length. The core innovations of these two models lie in the adoption of Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture (Dai et al., 2024). MLA drastically reduces GPU memory usage during inference by compressing the Key-Value (KV) cache into low-dimensional latent vectors, improving efficiency. DeepSeekMoE achieves stronger expert specialization capabilities and more economical training costs through fine-grained expert segmentation and shared expert isolation. Building upon V2, DeepSeek-V3 further introduces an Auxiliary-Loss-Free Load Balancing strategy (Wang et al., 2024) and the Multi-Token Prediction (MTP) (Gloeckle et al., 2024) training objective, further enhancing model performance and training efficiency. ...

Parallelism and Memory Optimization Techniques for Training Large Models

Background Recently, the number of parameters in large models has been continuously increasing, from the initial billions to today’s hundreds of billions or even trillions. While large models have brought unprecedented application effects, they have also triggered a series of severe challenges in computing resources, memory management, and training stability. Therefore, this blog summarizes some commonly used distributed parallel training and memory management techniques, hoping to help everyone better train and optimize large models. ...

Normalization in Deep Learning

Introduction In deep learning, the design of network architectures significantly impacts model performance and training efficiency. As model depth increases, training deep neural networks faces numerous challenges, such as the vanishing and exploding gradient problems. To address these challenges, residual connections and various normalization methods have been introduced and are widely used in modern deep learning models. This article will first introduce residual connections and two architectures: pre-norm and post-norm. Then, it will describe four common normalization methods: Batch Normalization, Layer Normalization, Weight Normalization, and RMS Normalization, and analyze why current mainstream large models tend to adopt an architecture combining RMSNorm and Pre-Norm. ...

Deep Reinforcement Learning (Ongoing Updates)

Note: This article is currently being updated. The content is in draft version and may change. Please check back for the latest version. Notations Symbol Meaning \(s, s', S_t, S_{t+1}\) State, next state, state at time \(t\), state at time \(t+1\) \(o, o_t\) Observation, observation at time \(t\) \(a, a', A_t, A_{t+1}\) Action, next action, action at time \(t\), action at time \(t+1\) \(r, r_t\) Immediate reward, reward at time \(t\) \(G_t\) Return at time \(t\) \(R(\tau)\) Return of a trajectory \(\tau\) \(\mathcal{S}\) Set of all possible states \(\mathcal{A}\) Set of all possible actions \(\mathcal{R}\) Set of all possible rewards \(\pi(a\mid s), \pi_\theta(a\mid s)\) Policy (stochastic), parameterized policy \(\mu(s), \mu_\theta(s)\) Policy (deterministic), parameterized policy \(\theta, \phi, w\) Policy or value function parameters \(\gamma\) Discount factor \(J(\pi)\) Expected return of policy \(\pi\) \(V_\pi(s)\) State-value function for policy \(\pi\) \(Q_\pi(s,a)\) Action-value function for policy \(\pi\) \(V_*(s)\) Optimal state-value function \(Q_*(s,a)\) Optimal action-value function \(A_\pi(s,a)\) Advantage function for policy \(\pi\) \(P(s'\mid s,a)\) Transition probability function \(R(s,a,s')\) Reward function \(\rho_0(s)\) Start-state distribution \(\tau\) Trajectory \(D\) Replay memory \(\alpha\) Learning rate, temperature parameter (in SAC) \(\lambda\) Eligibility trace parameter \(\epsilon\) Exploration parameter (e.g., in \(\epsilon\)-greedy), clipping parameter (in PPO) What is Reinforcement Learning? Definition ...