KV Cache

Large Language Model Inference

In recent years, Large Language Models (LLMs) have achieved revolutionary breakthroughs in fields such as natural language processing, code generation, and even multimodal interaction. However, the powerful capabilities of these models come at the cost of enormous computational and memory overhead, especially during the inference stage. Efficiently deploying and running these models, which have billions or even trillions of parameters, has become a core challenge in scaling LLM technology for real-world applications. ...

vLLM: High-Throughput, Memory-Efficient LLM Serving

As the parameters of Large Language Models (LLMs) continue to grow, deploying and serving these models presents significant challenges. vLLM is an open-source library designed for fast, convenient, and cost-effective LLM inference and online serving. Its core lies in the PagedAttention algorithm, which efficiently manages the KV Cache in the attention mechanism. Evaluation Metrics To evaluate the performance of LLM inference and serving engines, we primarily focus on the following metrics: ...

DeepSeek-V2 vs V3

DeepSeek AI successively released DeepSeek-V2 (DeepSeek-AI, 2024) and DeepSeek-V3 (DeepSeek-AI, 2024), two powerful Mixture-of-Experts (MoE) language models that significantly optimize training costs and inference efficiency while maintaining state-of-the-art performance. DeepSeek-V2 has a total of 236B parameters, activating 21B per token, while DeepSeek-V3 further expands to 671B total parameters, activating 37B per token. Both support a 128K context length. The core innovations of these two models lie in the adoption of Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture (Dai et al., 2024). MLA drastically reduces GPU memory usage during inference by compressing the Key-Value (KV) cache into low-dimensional latent vectors, improving efficiency. DeepSeekMoE achieves stronger expert specialization capabilities and more economical training costs through fine-grained expert segmentation and shared expert isolation. Building upon V2, DeepSeek-V3 further introduces an Auxiliary-Loss-Free Load Balancing strategy (Wang et al., 2024) and the Multi-Token Prediction (MTP) (Gloeckle et al., 2024) training objective, further enhancing model performance and training efficiency. ...

Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA

Background The Transformer (Vaswani et al., 2017) is a model based on the encoder-decoder architecture. This model has demonstrated outstanding performance in the field of natural language processing (NLP), leading to a series of optimized models based on it, such as BERT (Devlin et al., 2018) which uses only the encoder, GPT (Radford et al., 2018) series which uses only the decoder, and subsequent large language models (LLMs) like LLaMA (Touvron et al., 2023) and GPT-4 (OpenAI et al., 2024), most of which adopt a decoder-only architecture. ...