🧠 KV Cache — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive KV Cache visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

LLMs generate one token at a time, each attending to ALL previous tokens. Naively you’d recompute every past token’s Key and Value vectors at every step — hugely wasteful. The KV cache stores past K/V vectors so each new token only computes its OWN K/V and reuses the rest. It’s the single biggest speedup in autoregressive inference.

Mental model

Autoregressive generation re-reads the whole story to write each next word. But the Key and Value vectors of past tokens NEVER change once computed — they depend only on those past tokens. So cache them. Each new step then computes only the new token's K,V and attends over the stored history. You've turned "recompute everything every step" (quadratic) into "compute one new thing per step" (linear) — paying memory to save compute.

Theory

LLMs decode autoregressively: one token at a time, each new token attending over ALL previous tokens. The naive implementation re-runs attention on the entire growing sequence every step, recomputing every past token's Key and Value vectors — work that is identical to the previous step. The KV cache removes this redundancy.

The key invariant is that a token's K and V depend only on that token (and the frozen weights), so once computed they are fixed for the rest of generation. The cache stores them per layer and per head; each step appends just the new token's K,V. Note Q is NOT cached — only the current token's query is needed to attend over the cached keys; past queries are never reused.

This changes the asymptotics. Without a cache, step t does O(t) K/V work, so n steps cost 1+2+…+n = O(n²). With the cache, each step does O(1) new K/V work, so n steps cost O(n) — orders of magnitude faster for long outputs. The visualization's growing compute gap is exactly this quadratic-vs-linear divergence.

This is why inference splits into two phases with different bottlenecks. PREFILL processes the whole prompt in parallel to populate the cache — compute-bound, high GPU utilization. DECODE then generates tokens one by one reusing the cache — memory-bandwidth-bound, since each step moves the large cache but does little math. They are tuned and even scheduled separately in serving stacks.

The cost is memory: the cache grows as ≈ 2 · layers · heads · d_head · seq_len · batch. For long contexts and large batches it can exceed the model weights and becomes the limit on batch size. Mitigations: Multi-Query / Grouped-Query Attention (share K/V across heads), quantizing the cache to INT8/INT4, paged attention (vLLM) for non-fragmented allocation, and sliding-window or eviction for very long sequences.

Concrete example

Generating a 1000-token answer: without a cache, step 1000 recomputes K/V for all 999 prior tokens (≈500k token-computations total, quadratic). With the cache it’s ≈1000 (linear) — orders of magnitude faster. The trade-off is memory: the cache grows with sequence length.

Key equations

each new token computes Q, K, V; attends to all cached K/V
cache appends the new K, V — past ones are never recomputed
without cache: Σ steps ≈ n(n+1)/2 K/V computes (quadratic)
with cache: n K/V computes (linear) — only the new token each step
cache memory ≈ 2 · layers · heads · d_head · seq_len (grows with length)

Step by step

Generate the first token → compute & cache its K, V.
For each next token: compute only its Q, K, V.
Attend over the cached K/V of all previous tokens (no recompute).
Append the new K, V to the cache and repeat.
Compare: with-cache compute grows linearly; without-cache grows quadratically.

Interview questions & answers

What exactly does the KV cache store, and why not Q?

The Key and Value vectors of all past tokens, per layer and head. Q isn’t cached because each step only needs the CURRENT token’s query to attend over the cached K/V — past queries are never reused.

What’s the trade-off?

Compute for memory: you avoid recomputation (linear instead of quadratic) but the cache grows with sequence length × layers × heads — for long contexts it becomes the dominant memory cost and limits batch size.

How is KV-cache memory reduced in practice?

Multi-Query / Grouped-Query Attention (share K/V across heads), quantizing the cache, paged attention (vLLM) for efficient allocation, and sliding-window/eviction for very long contexts.

Why does prefill differ from decode?

Prefill processes the whole prompt in parallel (compute-bound, fills the cache once). Decode generates one token at a time reusing the cache (memory-bandwidth-bound). They have very different performance characteristics.

Common pitfalls

Ignoring cache memory — it can exceed the model weights for long contexts.
Assuming it changes outputs — it’s purely an optimization (same results).
Forgetting it grows per layer AND per head.

Where it shows up

Every autoregressive LLM at inference (GPT, Llama, Claude)
vLLM paged attention, MQA/GQA, KV-cache quantization

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…