🧠 Softmax, Temperature & Sampling — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive Softmax, Temperature & Sampling visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

A language model does not emit a word — it emits a score (logit) for every token in its vocabulary. Softmax turns those raw scores into a probability distribution, temperature controls how "sharp" or "flat" that distribution is, and top-k / top-p trim the long tail before we draw one token at random.

Mental model

The model never hands you a word — it hands you a SCOREBOARD (one logit per vocabulary token). Decoding is how you turn that scoreboard into a single chosen token. Softmax turns scores into a probability distribution; temperature sets how spiky-vs-flat that distribution is; top-k / top-p chop off the unlikely tail so you never pick nonsense; then you roll a weighted die. Greedy is just "always take the tallest bar" — a special case with the die removed.

Theory

A language model's final layer outputs a vector of logits — one raw score per token in the vocabulary — for the next position. These are unbounded real numbers, not probabilities. Softmax, pᵢ = exp(zᵢ)/Σexp(zⱼ), maps them to a proper distribution (non-negative, sums to 1) while preserving their order. Decoding is the policy that selects one token from this distribution.

Temperature T rescales the logits BEFORE softmax: pᵢ ∝ exp(zᵢ/T). T<1 amplifies differences (sharper, more confident, more repetitive); T>1 shrinks them (flatter, more diverse, more risky). The limits are instructive: T→0 collapses to argmax (greedy, deterministic) and T→∞ approaches a uniform random choice. Temperature changes neither the weights nor the ranking of tokens — only how often the non-top tokens get chosen.

Truncation trims the long tail of low-probability tokens that, summed, can still be sampled and produce nonsense. Top-k keeps a FIXED number k of the tallest tokens. Top-p (nucleus) keeps a VARIABLE set — the smallest group whose cumulative probability reaches p — so it adapts: few tokens when the model is confident, many when it is unsure. After trimming you renormalize the survivors so they sum to 1, then sample.

A practical detail interviewers probe: softmax must be numerically stabilized by subtracting max(z) before exponentiating, because exp() of a large logit overflows to infinity. The subtraction cancels in numerator and denominator, so the result is identical but safe.

Why not always greedy? Greedy decoding is locally optimal but produces bland, repetitive text and can loop. Beam search keeps several high-likelihood candidate sequences and is used where one "best" answer matters (translation), but it also tends toward generic text. Sampling with temperature + top-p is the default for open-ended generation; repetition penalties are often layered on top to discourage loops.

Concrete example

Prompt: "The weather today is ___". The model scores sunny > cloudy > rainy > … > purple. At temperature 0.2 it almost always says "sunny" (sharp); at temperature 1.5 it might pick "rainy" or "windy" (creative); top-p 0.9 drops "purple" entirely so you never sample nonsense.

Key equations

Step by step

  1. Start from raw logits (the model’s scores for each candidate token).
  2. Divide every logit by the temperature T, then softmax → a probability bar per token.
  3. Apply top-k (keep k tallest bars) and/or top-p (keep the nucleus that sums to p).
  4. Renormalize the survivors so they sum to 1.
  5. Draw one token at random from that distribution — that’s the next token.

Interview questions & answers

What does temperature actually do to the logits?

It scales them before softmax: pᵢ ∝ exp(zᵢ / T). T < 1 amplifies differences (sharper, more confident), T > 1 shrinks them (flatter, more diverse). T→0 is greedy/argmax; T→∞ is uniform random.

Difference between top-k and top-p (nucleus) sampling?

Top-k always keeps a FIXED number k of tokens. Top-p keeps a VARIABLE number — the smallest set whose probabilities sum to ≥ p — so it adapts: few tokens when the model is confident, many when it is unsure. Top-p generally handles peaked vs flat distributions more gracefully.

Why not always use greedy decoding (argmax)?

Greedy is deterministic and locally optimal but produces repetitive, bland text and can get stuck in loops. Sampling adds diversity; top-k/top-p keep that diversity from including absurd low-probability tokens.

Does temperature change the model’s weights or the logits ranking?

Neither. The logits (and their order) are fixed by the model for a given context. Temperature only reshapes the softmax over those fixed logits — the argmax token never changes, only how often the others get chosen.

How do top-k and top-p interact when both are set?

They compose: a token must survive BOTH filters (intersection), then you renormalize. In practice many stacks apply top-k first, then top-p on what remains.

Why is softmax numerically stabilized by subtracting the max?

exp() of a large logit overflows to Infinity. Subtracting max(z/T) before exponentiating shifts everything to ≤ 0 so exp() stays in (0,1]; the ratio is unchanged because the shift cancels in numerator and denominator.

Common pitfalls

Where it shows up

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…