🧠 Decoding: Beam Search & Speculative Decoding — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive Decoding: Beam Search & Speculative Decoding visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Decoding is how you turn the model's per-step probabilities into a sequence. Greedy takes the single best token each step. Beam search keeps the top few partial sequences to find a higher-likelihood whole. Speculative decoding uses a small draft model to guess several tokens that the big model verifies in parallel — same output, faster.

Mental model

Two different goals. SEARCH strategies (greedy, beam) ask "which sequence is most likely?" — beam just hedges across a few paths instead of committing to one token at a time. SPECULATIVE decoding is not about quality at all; it's an accelerator: let a cheap model sprint ahead and have the expensive model rubber-stamp the parts it agrees with, redoing only where it disagrees.

Theory

At each step the model outputs a probability over the vocabulary; a decoding strategy chooses tokens from these step-by-step distributions. GREEDY decoding takes the argmax every step. It is fast and deterministic but myopic — a locally best token can lead to a globally worse sequence, and it tends to be repetitive.

BEAM SEARCH keeps the B highest-scoring partial sequences ("beams") at every step, expanding each and pruning back to B. It approximately maximizes total sequence likelihood, so it finds better-scoring outputs than greedy — at B× the compute. It shines for tasks with a clear best answer (translation, summarization) but tends toward safe, generic text and is rarely used for open-ended chat (sampling with temperature/top-p is preferred there — see Softmax & Sampling).

SPECULATIVE DECODING attacks LATENCY, not quality. A small, cheap DRAFT model autoregressively proposes k tokens. The large TARGET model then scores all k proposed tokens in a SINGLE parallel forward pass and accepts the longest prefix consistent with its own distribution, correcting the first mismatch.

The payoff comes from hardware: a single forward pass that verifies k tokens costs about the same as generating one token normally (decoding is memory-bandwidth bound, not compute bound). So when the draft is often right you commit several tokens per expensive pass — typically a 2–3× speedup.

Crucially, speculative decoding is provably EXACT: a rejection-sampling correction makes the accepted tokens follow the target model's own distribution. Output quality is identical to running the target alone; you only spend extra cheap draft compute to save expensive target passes.

Concrete example

Beam search powers most machine-translation systems. Speculative decoding (and variants like Medusa and EAGLE) ships in vLLM, TensorRT-LLM, and major APIs to cut latency ~2–3× — e.g. a 1B draft proposing tokens for a 70B target.

Key equations

greedy: tokenₜ = argmax p(·| context)
beam (width B): keep top-B partials by Σ log p; expand, prune to B; return best
speculative: draft proposes x₁..x_k; target scores all k in ONE pass
accept longest prefix consistent w/ target; correct first mismatch
rejection-sampling correction ⇒ EXACT target distribution

Step by step

Greedy: pick the single most probable token, repeat — one path.
Beam: keep the top-B partial sequences, expand each, prune back to B.
Beam: at the end, return the highest-scoring complete sequence.
Speculative: a small draft model proposes several tokens cheaply.
Speculative: the big model verifies them in one pass, accepts the good prefix, corrects the rest.

Interview questions & answers

Greedy vs beam search — when is beam worth it?

Beam keeps multiple hypotheses to better maximize total likelihood, helping tasks with a clear best answer (translation, summarization) at B× cost. For open-ended chat it often yields bland text, so sampling (temperature/top-p) is preferred.

Does speculative decoding change the output?

No. A rejection-sampling correction guarantees the accepted tokens follow the TARGET model's distribution, so the output is exactly what the target alone would produce — only faster.

Why is speculative decoding faster despite doing extra (draft) work?

Decoding is memory-bandwidth bound, so one target forward pass can verify k proposed tokens for roughly the cost of generating one. When the cheap draft is often right, you commit several tokens per expensive pass — a net 2–3× speedup.

What makes a good draft model?

Cheap and well-aligned with the target so its proposals are usually accepted. A poorly aligned draft is rejected often, wasting work; alignment (and self-draft methods like Medusa/EAGLE) drives the acceptance rate that determines the speedup.

Common pitfalls

Using beam search for open-ended generation — it produces generic, repetitive text.
Thinking speculative decoding trades quality for speed — it is exact.
A misaligned/large draft model → low acceptance → little or no speedup.

Where it shows up

Beam search: machine translation, summarization
Speculative decoding: vLLM, TensorRT-LLM, Medusa/EAGLE, major LLM APIs
Latency optimization for large-model serving

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…