LLM | Yue Shui Blog

Large Language Model Inference

In recent years, Large Language Models (LLMs) have achieved revolutionary breakthroughs in fields such as natural language processing, code generation, and even multimodal interaction. However, the powerful capabilities of these models come at the cost of enormous computational and memory overhead, especially during the inference stage. Efficiently deploying and running these models, which have billions or even trillions of parameters, has become a core challenge in scaling LLM technology for real-world applications. ...

DeepSeek-V2 vs V3

DeepSeek AI successively released DeepSeek-V2 (DeepSeek-AI, 2024) and DeepSeek-V3 (DeepSeek-AI, 2024), two powerful Mixture-of-Experts (MoE) language models that significantly optimize training costs and inference efficiency while maintaining state-of-the-art performance. DeepSeek-V2 has a total of 236B parameters, activating 21B per token, while DeepSeek-V3 further expands to 671B total parameters, activating 37B per token. Both support a 128K context length. The core innovations of these two models lie in the adoption of Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture (Dai et al., 2024). MLA drastically reduces GPU memory usage during inference by compressing the Key-Value (KV) cache into low-dimensional latent vectors, improving efficiency. DeepSeekMoE achieves stronger expert specialization capabilities and more economical training costs through fine-grained expert segmentation and shared expert isolation. Building upon V2, DeepSeek-V3 further introduces an Auxiliary-Loss-Free Load Balancing strategy (Wang et al., 2024) and the Multi-Token Prediction (MTP) (Gloeckle et al., 2024) training objective, further enhancing model performance and training efficiency. ...

The LLaMA Herd

LLaMA The LLaMA series of open-source models released by Meta AI has become one of the cornerstones of the large language model community, profoundly impacting the advancement of open research and applications. From the pioneering LLaMA released in early 2023, to the significantly improved LLaMA 2 later that year, to derivative models targeting specific domains (like code, safety), and the subsequent new generations LLaMA 3 and LLaMA 4 launched in 2024 and 2025 respectively, Meta has continuously committed to enhancing the performance of open-source models, gradually bringing them closer to state-of-the-art closed-source models. Below, we will introduce the key technical details of each major model in sequence. ...

Large Language Model Agents

Agents Since OpenAI released ChatGPT in October 2022, and with the subsequent emergence of projects such as AutoGPT and AgentGPT, LLM-related agents have gradually become a research hotspot and a promising direction for practical applications in AI in recent years. This article will introduce the basic concepts of agents, their core technologies, and the latest advances in their applications. Large Language Model Agents Large Language Model Agents (LLM agents) utilize LLMs as the system’s brain, combined with modules such as planning, memory, and external tools, to achieve automated execution of complex tasks. ...

Normalization in Deep Learning

Introduction In deep learning, the design of network architectures significantly impacts model performance and training efficiency. As model depth increases, training deep neural networks faces numerous challenges, such as the vanishing and exploding gradient problems. To address these challenges, residual connections and various normalization methods have been introduced and are widely used in modern deep learning models. This article will first introduce residual connections and two architectures: pre-norm and post-norm. Then, it will describe four common normalization methods: Batch Normalization, Layer Normalization, Weight Normalization, and RMS Normalization, and analyze why current mainstream large models tend to adopt an architecture combining RMSNorm and Pre-Norm. ...