Agentic RL

As Large Language Models (LLMs) achieve breakthroughs in natural language processing, their applications continue to expand. However, they also exhibit limitations such as knowledge cutoffs, hallucinations, and deficiencies in complex computation and logical reasoning. To address these challenges, Agentic RL, which combines agents with Reinforcement Learning (RL), is emerging as a key research direction. Agentic RL enables LLMs to possess capabilities like autonomous planning, decision-making, tool use, and environmental interaction by creating a closed-loop interaction with the external world (e.g., search engines, code interpreters, databases, browsers) and continuously optimizing through reward signals. In practical applications, it not only understands requirements and plans autonomously but also constantly corrects and optimizes within an execution-feedback loop. ...

Created: 2025-09-30 · Updated: 2025-09-30 · 24 min · 5072 words · Yue Shui

Large Language Model Inference

In recent years, Large Language Models (LLMs) have achieved revolutionary breakthroughs in fields such as natural language processing, code generation, and even multimodal interaction. However, the powerful capabilities of these models come at the cost of enormous computational and memory overhead, especially during the inference stage. Efficiently deploying and running these models, which have billions or even trillions of parameters, has become a core challenge in scaling LLM technology for real-world applications. ...

Created: 2025-06-29 · Updated: 2025-06-29 · 43 min · 9025 words · Yue Shui

DeepSeek-V2 vs V3

DeepSeek AI successively released DeepSeek-V2 (DeepSeek-AI, 2024) and DeepSeek-V3 (DeepSeek-AI, 2024), two powerful Mixture-of-Experts (MoE) language models that significantly optimize training costs and inference efficiency while maintaining state-of-the-art performance. DeepSeek-V2 has a total of 236B parameters, activating 21B per token, while DeepSeek-V3 further expands to 671B total parameters, activating 37B per token. Both support a 128K context length. The core innovations of these two models lie in the adoption of Multi-head Latent Attention (MLA) and the DeepSeekMoE architecture (Dai et al., 2024). MLA drastically reduces GPU memory usage during inference by compressing the Key-Value (KV) cache into low-dimensional latent vectors, improving efficiency. DeepSeekMoE achieves stronger expert specialization capabilities and more economical training costs through fine-grained expert segmentation and shared expert isolation. Building upon V2, DeepSeek-V3 further introduces an Auxiliary-Loss-Free Load Balancing strategy (Wang et al., 2024) and the Multi-Token Prediction (MTP) (Gloeckle et al., 2024) training objective, further enhancing model performance and training efficiency. ...

Created: 2025-04-18 · Updated: 2025-04-18 · 63 min · 13242 words · Yue Shui

The LLaMA Herd

LLaMA The LLaMA series of open-source models released by Meta AI has become one of the cornerstones of the large language model community, profoundly impacting the advancement of open research and applications. From the pioneering LLaMA released in early 2023, to the significantly improved LLaMA 2 later that year, to derivative models targeting specific domains (like code, safety), and the subsequent new generations LLaMA 3 and LLaMA 4 launched in 2024 and 2025 respectively, Meta has continuously committed to enhancing the performance of open-source models, gradually bringing them closer to state-of-the-art closed-source models. Below, we will introduce the key technical details of each major model in sequence. ...

Created: 2025-04-06 · Updated: 2025-04-06 · 33 min · 6862 words · Yue Shui

Normalization in Deep Learning

Introduction In deep learning, the design of network architectures significantly impacts model performance and training efficiency. As model depth increases, training deep neural networks faces numerous challenges, such as the vanishing and exploding gradient problems. To address these challenges, residual connections and various normalization methods have been introduced and are widely used in modern deep learning models. This article will first introduce residual connections and two architectures: pre-norm and post-norm. Then, it will describe four common normalization methods: Batch Normalization, Layer Normalization, Weight Normalization, and RMS Normalization, and analyze why current mainstream large models tend to adopt an architecture combining RMSNorm and Pre-Norm. ...

Created: 2025-02-01 · Updated: 2025-02-01 · 13 min · 2576 words · Yue Shui