🧠 RLHF / DPO Alignment — AI / ML Interview Guide

Production & Training · interactive visualization + interview prep

Open the interactive RLHF / DPO Alignment visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

A base model predicts likely text, not necessarily HELPFUL text. Alignment fixes this from human PREFERENCES: show people two model responses, let them pick the better one, and nudge the model to produce more of the preferred style. RLHF does this with a reward model + RL; DPO does it directly from the preference pairs.

Mental model

Coaching by thumbs-up / thumbs-down rather than writing the perfect answer. You cannot enumerate the one correct open-ended reply, but you CAN say which of two attempts is better. Feed the model many such "A is better than B" judgments and it shifts probability toward the style you keep preferring. That preference signal is the gap between a raw text-predictor and a helpful, safe assistant.

Theory

A pretrained base model is trained to predict LIKELY text, which is not the same as HELPFUL, harmless, or honest text. Alignment closes that gap by training on human PREFERENCES — teaching the model the behavior people actually want, not just what is statistically probable.

Why preferences instead of labeled "correct" answers? For open-ended generation there is no single gold answer, but humans can reliably and cheaply say which of two responses is better. Preference comparisons are more consistent than writing ideal answers and naturally capture fuzzy qualities like tone, helpfulness, and safety.

Classic RLHF is a three-stage pipeline. (1) Collect preference pairs (prompt, A≻B). (2) Train a REWARD MODEL r(x) to predict which response humans prefer. (3) Use reinforcement learning (PPO) to update the policy to maximize that reward, with a KL-divergence penalty keeping it close to the base model. It is powerful but complex and can be unstable.

DPO (Direct Preference Optimization) achieves the same goal without a separate reward model or RL loop. It derives a simple classification-style loss that optimizes the policy DIRECTLY from the preference pairs — increasing the likelihood of preferred responses and decreasing dispreferred ones, relative to a frozen reference model. Simpler and more stable, often comparable quality, which is why it is now widely used.

The recurring failure mode is REWARD HACKING: the policy exploits flaws in the reward proxy to score high without being genuinely better — e.g. overly long or sycophantic answers. The KL penalty (staying near the base) is the main guard, alongside better reward models and careful evaluation. Biased or inconsistent human labels also propagate straight into the model.

Concrete example

Prompt: "explain photosynthesis to a 5-year-old." Response A is simple and friendly; response B is full of jargon. A human prefers A. Train on many such pairs and the model learns to be clear and helpful — that’s the difference between a raw LLM and a chat assistant like ChatGPT/Claude.

Key equations

collect preference pairs: (prompt, response_A ≻ response_B)
RLHF: train a reward model r(x) on preferences, then RL (PPO) to maximize r
while staying close to the base (a KL penalty stops it drifting too far)
DPO: skip the reward model — optimize the policy directly from the pairs
result: ↑ probability of preferred responses, ↓ of dispreferred

Step by step

A prompt is given to the model.
The model generates two candidate responses.
A human (or reward model) marks which one is preferred.
The policy is updated to make the preferred style more likely.
Repeat over many pairs → an aligned, helpful model.

Interview questions & answers

Why preferences instead of labeled “correct” answers?

For open-ended generation there’s no single correct answer, but humans can reliably say which of two responses is better. Preference comparisons are cheaper and more consistent than writing gold answers, and they capture helpfulness/tone/safety.

RLHF vs DPO?

RLHF trains a separate reward model then optimizes the policy with RL (PPO) — powerful but complex/unstable. DPO derives a loss that optimizes the policy DIRECTLY from preference pairs, no reward model or RL loop — simpler and more stable, often comparable quality.

What is the KL penalty for?

It keeps the fine-tuned policy close to the original model so it doesn’t “reward-hack” into degenerate text that scores high on the reward model but is actually bad.

What is reward hacking?

The policy exploits flaws in the reward model to get high reward without being genuinely better (e.g., overly long or sycophantic answers). Mitigated by the KL term, better reward models, and careful evaluation.

Common pitfalls

Reward hacking — optimizing the proxy reward, not real quality.
Biased/inconsistent human labels propagate into the model.
Over-optimizing → sycophancy or bland, hedged answers.

Where it shows up

ChatGPT / Claude / Llama-chat alignment
RLHF (PPO) and DPO fine-tuning
Preference tuning for helpfulness & safety

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…