🧠 Decoding: Beam Search & Speculative Decoding — AI / ML Interview Guide
LLM Internals · interactive visualization + interview prep
Open the interactive Decoding: Beam Search & Speculative Decoding visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Decoding is how you turn the model's per-step probabilities into a sequence. Greedy takes the single best token each step. Beam search keeps the top few partial sequences to find a higher-likelihood whole. Speculative decoding uses a small draft model to guess several tokens that the big model verifies in parallel — same output, faster.
Mental model
Two different goals. SEARCH strategies (greedy, beam) ask "which sequence is most likely?" — beam just hedges across a few paths instead of committing to one token at a time. SPECULATIVE decoding is not about quality at all; it's an accelerator: let a cheap model sprint ahead and have the expensive model rubber-stamp the parts it agrees with, redoing only where it disagrees.
Theory
At each step the model outputs a probability over the vocabulary; a decoding strategy chooses tokens from these step-by-step distributions. GREEDY decoding takes the argmax every step. It is fast and deterministic but myopic — a locally best token can lead to a globally worse sequence, and it tends to be repetitive.
BEAM SEARCH keeps the B highest-scoring partial sequences ("beams") at every step, expanding each and pruning back to B. It approximately maximizes total sequence likelihood, so it finds better-scoring outputs than greedy — at B× the compute. It shines for tasks with a clear best answer (translation, summarization) but tends toward safe, generic text and is rarely used for open-ended chat (sampling with temperature/top-p is preferred there — see Softmax & Sampling).
SPECULATIVE DECODING attacks LATENCY, not quality. A small, cheap DRAFT model autoregressively proposes k tokens. The large TARGET model then scores all k proposed tokens in a SINGLE parallel forward pass and accepts the longest prefix consistent with its own distribution, correcting the first mismatch.
The payoff comes from hardware: a single forward pass that verifies k tokens costs about the same as generating one token normally (decoding is memory-bandwidth bound, not compute bound). So when the draft is often right you commit several tokens per expensive pass — typically a 2–3× speedup.
Crucially, speculative decoding is provably EXACT: a rejection-sampling correction makes the accepted tokens follow the target model's own distribution. Output quality is identical to running the target alone; you only spend extra cheap draft compute to save expensive target passes.
Concrete example
Beam search powers most machine-translation systems. Speculative decoding (and variants like Medusa and EAGLE) ships in vLLM, TensorRT-LLM, and major APIs to cut latency ~2–3× — e.g. a 1B draft proposing tokens for a 70B target.
Key equations
greedy: tokenₜ = argmax p(·| context)beam (width B): keep top-B partials by Σ log p; expand, prune to B; return bestspeculative: draft proposes x₁..x_k; target scores all k in ONE passaccept longest prefix consistent w/ target; correct first mismatchrejection-sampling correction ⇒ EXACT target distribution
Step by step
- Greedy: pick the single most probable token, repeat — one path.
- Beam: keep the top-B partial sequences, expand each, prune back to B.
- Beam: at the end, return the highest-scoring complete sequence.
- Speculative: a small draft model proposes several tokens cheaply.
- Speculative: the big model verifies them in one pass, accepts the good prefix, corrects the rest.
Interview questions & answers
Greedy vs beam search — when is beam worth it?
Beam keeps multiple hypotheses to better maximize total likelihood, helping tasks with a clear best answer (translation, summarization) at B× cost. For open-ended chat it often yields bland text, so sampling (temperature/top-p) is preferred.
Does speculative decoding change the output?
No. A rejection-sampling correction guarantees the accepted tokens follow the TARGET model's distribution, so the output is exactly what the target alone would produce — only faster.
Why is speculative decoding faster despite doing extra (draft) work?
Decoding is memory-bandwidth bound, so one target forward pass can verify k proposed tokens for roughly the cost of generating one. When the cheap draft is often right, you commit several tokens per expensive pass — a net 2–3× speedup.
What makes a good draft model?
Cheap and well-aligned with the target so its proposals are usually accepted. A poorly aligned draft is rejected often, wasting work; alignment (and self-draft methods like Medusa/EAGLE) drives the acceptance rate that determines the speedup.
Common pitfalls
- Using beam search for open-ended generation — it produces generic, repetitive text.
- Thinking speculative decoding trades quality for speed — it is exact.
- A misaligned/large draft model → low acceptance → little or no speedup.
Where it shows up
- Beam search: machine translation, summarization
- Speculative decoding: vLLM, TensorRT-LLM, Medusa/EAGLE, major LLM APIs
- Latency optimization for large-model serving
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…