Normalization in Deep Learning

Introduction In deep learning, the design of network architectures significantly impacts model performance and training efficiency. As model depth increases, training deep neural networks faces numerous challenges, such as the vanishing and exploding gradient problems. To address these challenges, residual connections and various normalization methods have been introduced and are widely used in modern deep learning models. This article will first introduce residual connections and two architectures: pre-norm and post-norm. Then, it will describe four common normalization methods: Batch Normalization, Layer Normalization, Weight Normalization, and RMS Normalization, and analyze why current mainstream large models tend to adopt an architecture combining RMSNorm and Pre-Norm. ...

2025-02-01 · 13 min · 2576 words · Yue Shui

Deep Reinforcement Learning (Ongoing Updates)

Note: This article is currently being updated. The content is in draft version and may change. Please check back for the latest version. Notations Symbol Meaning \(s, s', S_t, S_{t+1}\) State, next state, state at time \(t\), state at time \(t+1\) \(o, o_t\) Observation, observation at time \(t\) \(a, a', A_t, A_{t+1}\) Action, next action, action at time \(t\), action at time \(t+1\) \(r, r_t\) Immediate reward, reward at time \(t\) \(G_t\) Return at time \(t\) \(R(\tau)\) Return of a trajectory \(\tau\) \(\mathcal{S}\) Set of all possible states \(\mathcal{A}\) Set of all possible actions \(\mathcal{R}\) Set of all possible rewards \(\pi(a\mid s), \pi_\theta(a\mid s)\) Policy (stochastic), parameterized policy \(\mu(s), \mu_\theta(s)\) Policy (deterministic), parameterized policy \(\theta, \phi, w\) Policy or value function parameters \(\gamma\) Discount factor \(J(\pi)\) Expected return of policy \(\pi\) \(V_\pi(s)\) State-value function for policy \(\pi\) \(Q_\pi(s,a)\) Action-value function for policy \(\pi\) \(V_*(s)\) Optimal state-value function \(Q_*(s,a)\) Optimal action-value function \(A_\pi(s,a)\) Advantage function for policy \(\pi\) \(P(s'\mid s,a)\) Transition probability function \(R(s,a,s')\) Reward function \(\rho_0(s)\) Start-state distribution \(\tau\) Trajectory \(D\) Replay memory \(\alpha\) Learning rate, temperature parameter (in SAC) \(\lambda\) Eligibility trace parameter \(\epsilon\) Exploration parameter (e.g., in \(\epsilon\)-greedy), clipping parameter (in PPO) What is Reinforcement Learning? Definition ...

2025-01-31 · 34 min · 7230 words · Yue Shui

OpenAI o1 Replication Progress: DeepSeek-R1

DeepSeek AI recently released DeepSeek-R1 (DeepSeek-AI, 2025), whose reasoning performance on multiple benchmarks approaches the level of OpenAI’s o1 (OpenAI, 2024), marking a significant step for the open-source community in successfully replicating o1. Relevant code for R1 can be found in the huggingface’s attempt to open-source replication project open-r1. While previous research has often relied on massive amounts of supervised data to enhance the performance of Large Language Models (LLMs), the success of DeepSeek-R1 and its earlier experiment, DeepSeek-R1-Zero, powerfully demonstrates the potential of purely large-scale reinforcement learning in improving the reasoning capabilities of LLMs. This success reinforces the profound insight proposed by Richard Sutton in “The Bitter Lesson”: ...

2025-01-27 · 48 min · 10166 words · Yue Shui

Attention Mechanisms in Transformers: Comparing MHA, MQA, and GQA

Background The Transformer (Vaswani et al., 2017) is a model based on the encoder-decoder architecture. This model has demonstrated outstanding performance in the field of natural language processing (NLP), leading to a series of optimized models based on it, such as BERT (Devlin et al., 2018) which uses only the encoder, GPT (Radford et al., 2018) series which uses only the decoder, and subsequent large language models (LLMs) like LLaMA (Touvron et al., 2023) and GPT-4 (OpenAI et al., 2024), most of which adopt a decoder-only architecture. ...

2025-01-16 · 29 min · 6139 words · Yue Shui

Building Domain-Specific LLMs

Background With the widespread application of Large Language Models (LLMs) across various industries, enterprises and research teams face an urgent need to adapt general-purpose models to specific domains. Foundational LLMs often fail to meet deep domain-specific requirements when handling specialized tasks. For example, in the application of closed-source programming languages, existing open-source models lack sufficient understanding of their syntax and semantics, leading to poor performance in tasks such as code generation and error correction. Therefore, injecting domain knowledge and training dedicated LLMs has become a key step in enhancing development efficiency and code quality. ...

2025-01-05 · 21 min · 4340 words · Yue Shui