LLMs Alignment: DPO
This blog post introduces a streamlined alternative to RLHF called DPO. Like RLHF, DPO is designed to align model outputs with human preferences, but it stands apart with its simplicity and lower resource demands. In scenarios where project resources are limited, DPO emerges as a highly attractive and practical solution worth exploring. Notations Symbol \( x \) User input (Prompt): the question the model needs to answer \( y \) Model-generated response (Response / Completion): the text output by the model \( \pi_\theta(y \mid x) \) Actor model: The trainable policy used to generate response \(y\); parameterized by \(\theta\) \( \pi_{\mathrm{ref}}(y \mid x) \) Reference model: The frozen SFT (Supervised Fine-Tuning) model, serving as the alignment baseline \( r_\phi(x,y) \) Reward model: A reward function (with parameter \(\phi\)) used to evaluate the quality of response \(y\) \( V_\psi(x) \) Critic model: A value function (with parameter \(\psi\)) used to estimate the future cumulative reward given \(x\) \( \pi^*(y \mid x) \) Optimal policy distribution, determined via the reference model and reward function \( r_\theta(x,y) \) Reward derived from the Actor model, constructed from \(\pi_\theta\) and \(\pi_{\mathrm{ref}}\) \(\beta\) Hyperparameter that controls the weight of the KL penalty or the log-ratio difference term \(\mathbb{D}_{\mathrm{KL}}[P \| Q]\) KL divergence, a measure of the difference between probability distributions \(P\) and \(Q\) \(\sigma(z)\) Sigmoid function, defined as: \(\sigma(z)=\frac{1}{1+e^{-z}}\) \(\log\) Logarithm function \(\mathbb{E}\) Expectation operator, used to compute the average value of a random variable \( (y_w, y_l) \) A pair of preference data where \( y_w \) is the preferred (better quality) response and \( y_l \) is the lesser one \( P\left(y_w \succ y_l \mid x\right) \) The probability that response \( y_w \) is preferred over \( y_l \) given input \(x\) \( Z(x) \) Partition function, which normalizes the probability distribution over all responses \(y\) \( \mathcal{L}_{\mathrm{DPO}} \) The loss function of DPO From RLHF to DPO RLHF OpenAI primarily leverages Reinforcement Learning from Human Feedback (RLHF) (Christiano et al., 2017) to train InstructGPT (Ouyang et al., 2022), which forms the basis for LLMs (such as ChatGPT, Llama, etc.). The entire training process generally comprises the following three main steps: ...