Reinforcement Learning

Large Language Model Agents

Agents Since OpenAI released ChatGPT in October 2022, and with the subsequent emergence of projects such as AutoGPT and AgentGPT, LLM-related agents have gradually become a research hotspot and a promising direction for practical applications in AI in recent years. This article will introduce the basic concepts of agents, their core technologies, and the latest advances in their applications. Large Language Model Agents Large Language Model Agents (LLM agents) utilize LLMs as the system’s brain, combined with modules such as planning, memory, and external tools, to achieve automated execution of complex tasks. ...

Deep Reinforcement Learning (Ongoing Updates)

Note: This article is currently being updated. The content is in draft version and may change. Please check back for the latest version. Notations Symbol Meaning \(s, s', S_t, S_{t+1}\) State, next state, state at time \(t\), state at time \(t+1\) \(o, o_t\) Observation, observation at time \(t\) \(a, a', A_t, A_{t+1}\) Action, next action, action at time \(t\), action at time \(t+1\) \(r, r_t\) Immediate reward, reward at time \(t\) \(G_t\) Return at time \(t\) \(R(\tau)\) Return of a trajectory \(\tau\) \(\mathcal{S}\) Set of all possible states \(\mathcal{A}\) Set of all possible actions \(\mathcal{R}\) Set of all possible rewards \(\pi(a\mid s), \pi_\theta(a\mid s)\) Policy (stochastic), parameterized policy \(\mu(s), \mu_\theta(s)\) Policy (deterministic), parameterized policy \(\theta, \phi, w\) Policy or value function parameters \(\gamma\) Discount factor \(J(\pi)\) Expected return of policy \(\pi\) \(V_\pi(s)\) State-value function for policy \(\pi\) \(Q_\pi(s,a)\) Action-value function for policy \(\pi\) \(V_*(s)\) Optimal state-value function \(Q_*(s,a)\) Optimal action-value function \(A_\pi(s,a)\) Advantage function for policy \(\pi\) \(P(s'\mid s,a)\) Transition probability function \(R(s,a,s')\) Reward function \(\rho_0(s)\) Start-state distribution \(\tau\) Trajectory \(D\) Replay memory \(\alpha\) Learning rate, temperature parameter (in SAC) \(\lambda\) Eligibility trace parameter \(\epsilon\) Exploration parameter (e.g., in \(\epsilon\)-greedy), clipping parameter (in PPO) What is Reinforcement Learning? Definition ...

OpenAI o1 Replication Progress: DeepSeek-R1

DeepSeek AI recently released DeepSeek-R1 (DeepSeek-AI, 2025), whose reasoning performance on multiple benchmarks approaches the level of OpenAI’s o1 (OpenAI, 2024), marking a significant step for the open-source community in successfully replicating o1. Relevant code for R1 can be found in the huggingface’s attempt to open-source replication project open-r1. While previous research has often relied on massive amounts of supervised data to enhance the performance of Large Language Models (LLMs), the success of DeepSeek-R1 and its earlier experiment, DeepSeek-R1-Zero, powerfully demonstrates the potential of purely large-scale reinforcement learning in improving the reasoning capabilities of LLMs. This success reinforces the profound insight proposed by Richard Sutton in “The Bitter Lesson”: ...