vLLM: High-Throughput, Memory-Efficient LLM Serving

As the parameters of Large Language Models (LLMs) continue to grow, deploying and serving these models presents significant challenges. vLLM is an open-source library designed for fast, convenient, and cost-effective LLM inference and online serving. Its core lies in the PagedAttention algorithm, which efficiently manages the Key-Value cache (KV Cache) in the attention mechanism. Evaluation Metrics To evaluate the performance of LLM inference and serving engines, we primarily focus on the following metrics: ...

2025-05-17 · 20 min · 4222 words · Yue Shui

DeepSeek-V2 vs V3

DeepSeek AI successively released DeepSeek-V2 (DeepSeek-AI, 2024) and DeepSeek-V3 (DeepSeek-AI, 2024), two powerful Mixture-of-Experts (MoE) language models that significantly optimize training costs and inference efficiency while maintaining state-of-the-art performance. DeepSeek-V2 has a total of 236B parameters, activating 21B per token, while DeepSeek-V3 further expands to 671B total parameters, activating 37B per token. Both support a 128K context length. The core innovations of these two models lie in the adoption of Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture (Dai et al., 2024). MLA drastically reduces GPU memory usage during inference by compressing the Key-Value (KV) cache into low-dimensional latent vectors, improving efficiency. DeepSeekMoE achieves stronger expert specialization capabilities and more economical training costs through fine-grained expert segmentation and shared expert isolation. Building upon V2, DeepSeek-V3 further introduces an Auxiliary-Loss-Free Load Balancing strategy (Wang et al., 2024) and the Multi-Token Prediction (MTP) (Gloeckle et al., 2024) training objective, further enhancing model performance and training efficiency. ...

2025-04-18 · 63 min · 13242 words · Yue Shui

Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA

Background The Transformer (Vaswani et al., 2017) is a model based on the encoder-decoder architecture. This model has demonstrated outstanding performance in the field of natural language processing (NLP), leading to a series of optimized models based on it, such as BERT (Devlin et al., 2018) which uses only the encoder, GPT (Radford et al., 2018) series which uses only the decoder, and subsequent large language models (LLMs) like LLaMA (Touvron et al., 2023) and GPT-4 (OpenAI et al., 2024), most of which adopt a decoder-only architecture. ...

2025-01-16 · 29 min · 6139 words · Yue Shui