🧠 Softmax, Temperature & Sampling — AI / ML Interview Guide
LLM Internals · interactive visualization + interview prep
Open the interactive Softmax, Temperature & Sampling visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
A language model does not emit a word — it emits a score (logit) for every token in its vocabulary. Softmax turns those raw scores into a probability distribution, temperature controls how "sharp" or "flat" that distribution is, and top-k / top-p trim the long tail before we draw one token at random.
Mental model
The model never hands you a word — it hands you a SCOREBOARD (one logit per vocabulary token). Decoding is how you turn that scoreboard into a single chosen token. Softmax turns scores into a probability distribution; temperature sets how spiky-vs-flat that distribution is; top-k / top-p chop off the unlikely tail so you never pick nonsense; then you roll a weighted die. Greedy is just "always take the tallest bar" — a special case with the die removed.
Theory
A language model's final layer outputs a vector of logits — one raw score per token in the vocabulary — for the next position. These are unbounded real numbers, not probabilities. Softmax, pᵢ = exp(zᵢ)/Σexp(zⱼ), maps them to a proper distribution (non-negative, sums to 1) while preserving their order. Decoding is the policy that selects one token from this distribution.
Temperature T rescales the logits BEFORE softmax: pᵢ ∝ exp(zᵢ/T). T<1 amplifies differences (sharper, more confident, more repetitive); T>1 shrinks them (flatter, more diverse, more risky). The limits are instructive: T→0 collapses to argmax (greedy, deterministic) and T→∞ approaches a uniform random choice. Temperature changes neither the weights nor the ranking of tokens — only how often the non-top tokens get chosen.
Truncation trims the long tail of low-probability tokens that, summed, can still be sampled and produce nonsense. Top-k keeps a FIXED number k of the tallest tokens. Top-p (nucleus) keeps a VARIABLE set — the smallest group whose cumulative probability reaches p — so it adapts: few tokens when the model is confident, many when it is unsure. After trimming you renormalize the survivors so they sum to 1, then sample.
A practical detail interviewers probe: softmax must be numerically stabilized by subtracting max(z) before exponentiating, because exp() of a large logit overflows to infinity. The subtraction cancels in numerator and denominator, so the result is identical but safe.
Why not always greedy? Greedy decoding is locally optimal but produces bland, repetitive text and can loop. Beam search keeps several high-likelihood candidate sequences and is used where one "best" answer matters (translation), but it also tends toward generic text. Sampling with temperature + top-p is the default for open-ended generation; repetition penalties are often layered on top to discourage loops.
Concrete example
Prompt: "The weather today is ___". The model scores sunny > cloudy > rainy > … > purple. At temperature 0.2 it almost always says "sunny" (sharp); at temperature 1.5 it might pick "rainy" or "windy" (creative); top-p 0.9 drops "purple" entirely so you never sample nonsense.
Key equations
logits z = model output, one per vocab tokensoftmax: pᵢ = exp(zᵢ / T) / Σⱼ exp(zⱼ / T)T → 0 ⇒ distribution → argmax (greedy, deterministic)T → ∞ ⇒ distribution → uniform (maximally random)top-k: keep the k highest-p tokens, renormalizetop-p (nucleus): keep the smallest set whose cumulative p ≥ p, renormalizesample one token ~ from the trimmed, renormalized distribution
Step by step
- Start from raw logits (the model’s scores for each candidate token).
- Divide every logit by the temperature T, then softmax → a probability bar per token.
- Apply top-k (keep k tallest bars) and/or top-p (keep the nucleus that sums to p).
- Renormalize the survivors so they sum to 1.
- Draw one token at random from that distribution — that’s the next token.
Interview questions & answers
What does temperature actually do to the logits?
It scales them before softmax: pᵢ ∝ exp(zᵢ / T). T < 1 amplifies differences (sharper, more confident), T > 1 shrinks them (flatter, more diverse). T→0 is greedy/argmax; T→∞ is uniform random.
Difference between top-k and top-p (nucleus) sampling?
Top-k always keeps a FIXED number k of tokens. Top-p keeps a VARIABLE number — the smallest set whose probabilities sum to ≥ p — so it adapts: few tokens when the model is confident, many when it is unsure. Top-p generally handles peaked vs flat distributions more gracefully.
Why not always use greedy decoding (argmax)?
Greedy is deterministic and locally optimal but produces repetitive, bland text and can get stuck in loops. Sampling adds diversity; top-k/top-p keep that diversity from including absurd low-probability tokens.
Does temperature change the model’s weights or the logits ranking?
Neither. The logits (and their order) are fixed by the model for a given context. Temperature only reshapes the softmax over those fixed logits — the argmax token never changes, only how often the others get chosen.
How do top-k and top-p interact when both are set?
They compose: a token must survive BOTH filters (intersection), then you renormalize. In practice many stacks apply top-k first, then top-p on what remains.
Why is softmax numerically stabilized by subtracting the max?
exp() of a large logit overflows to Infinity. Subtracting max(z/T) before exponentiating shifts everything to ≤ 0 so exp() stays in (0,1]; the ratio is unchanged because the shift cancels in numerator and denominator.
Common pitfalls
- Confusing temperature with top-p — temperature reshapes probabilities; top-p trims the tail.
- Setting temperature = 0 with sampling: it collapses to greedy (no diversity).
- Forgetting to renormalize after top-k/top-p — the kept probabilities no longer sum to 1.
- Very high temperature + no top-k/top-p → the model can emit genuinely random tokens.
Where it shows up
- Every LLM decoder (GPT, Claude, Llama) at inference time
- OpenAI/Anthropic API params: temperature, top_p, top_k
- The final softmax of any classifier
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…