🧠 Self-Attention — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive Self-Attention visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Self-attention lets every token in a sequence look at every other token and pull in the information it needs, weighted by relevance. Instead of processing words in a fixed left-to-right window, each position asks a question (its Query), every position advertises what it offers (its Key), and the match between question and offer decides how much of each position's content (its Value) gets mixed into the result. The output for a token is a relevance-weighted blend of all tokens' values — a context-aware representation built in one parallel step, with no recurrence and a direct path between any two positions.

Mental model

Every token asks ONE question (its Query) and the whole sequence answers at once. Each token also holds up a label (its Key) describing what it offers and a payload (its Value). A token compares its question against every label, softmaxes the match scores into "how much should I listen to each token", and walks away carrying a weighted blend of everyone's payloads. Run that for all tokens in parallel — no loops, no recurrence — and you have one attention layer. The weights are computed, not stored: change the sentence and the routing changes.

Theory

Attention is a content-based, differentiable lookup — a "soft dictionary". Each token x is linearly projected into three roles: a Query (what I'm looking for), a Key (what I contain), and a Value (what I'll contribute), via learned matrices W_Q, W_K, W_V. Unlike a hard hash lookup, every key matches the query a little, and the result is a weighted average of all values — which makes the whole operation differentiable and learnable.

The mechanism is three steps. (1) Score every query against every key: S = QKᵀ, an n×n matrix whose entry Sᵢⱼ is the dot-product similarity of token i's query with token j's key. (2) Scale by √dₖ and softmax each row into a probability distribution. (3) Mix: output = (softmax(S/√dₖ))·V, so each token's result is the relevance-weighted sum of all Values.

Two design choices matter. The √dₖ scaling keeps the dot products from growing with dimension and saturating the softmax (which would kill gradients). The softmax (vs plain normalization) guarantees non-negative weights that sum to 1, so the output is a true convex combination of Values, sharpened toward the best matches but still soft and differentiable.

Attention is permutation-EQUIVARIANT: it treats the input as a set, so it has no inherent notion of order — which is exactly why a positional encoding must be added first. In a decoder we also apply a causal mask (set future scores to −∞ before softmax) so a token can't attend to tokens it hasn't generated yet. When Q comes from one sequence and K,V from another, it's cross-attention (e.g. a translation decoder attending to the encoded source).

The cost is the catch: the n×n score matrix makes attention O(n²·d) time and O(n²) memory — quadratic in sequence length. That single fact drives much of modern LLM engineering: FlashAttention (IO-aware exact attention), sparse/local attention, linear-attention approximations, and the KV cache for efficient autoregressive decoding.

Concrete example

Take "The animal didn't cross the street because IT was too tired." What does "it" refer to? A human instantly resolves "it" -> "animal". Self-attention does the same mechanically: the Query vector for "it" lines up strongly with the Key vector for "animal" (and weakly with "street"), so the softmax puts most of the weight on "animal" and pulls its Value into "it"'s new representation. Change the sentence to "...because it was too wide" and the same mechanism shifts the weight toward "street". Attention IS learned coreference/relevance routing, computed for every token at once.

Key equations

Project each token x into three vectors: Q = xWQ, K = xWK, V = xWV
Raw scores (all pairs): S = Q·Kᵀ (shape n×n for n tokens)
Scale by √dₖ: S/√dₖ (dₖ = key/query dimension)
Normalize per row: A = softmax(S/√dₖ) → rows sum to 1
Mix values: Attention(Q,K,V) = A·V = softmax(QKᵀ/√dₖ)·V
Causal (decoder) mask: set Sᵢⱼ = −∞ for j>i before softmax (no peeking ahead)
Multi-head: run h attentions in parallel on dₘ/h-dim slices, concat, then ·Wᴼ
Cost: O(n²·d) time and O(n²) memory — quadratic in sequence length n

Step by step

Embed each token and (since attention is order-agnostic) add a positional encoding so position information is available.
For every token, linearly project its vector into a Query, a Key, and a Value using learned matrices WQ, WK, WV.
Compute the score matrix S = Q·Kᵀ: entry Sᵢⱼ is the dot-product similarity between token i's query and token j's key.
Divide by √dₖ to keep the scores in a sane range, then apply a causal mask if this is a decoder (a token may only attend to itself and earlier tokens).
Softmax each row so it becomes a probability distribution: "how much should token i attend to each token j?"
Multiply those weights by V: each token's output is the weighted sum of all Values — a context-mixed representation.
In practice do this with multiple heads in parallel (each can specialize, e.g. one tracks syntax, another long-range coreference), concatenate, and project back with Wᴼ.

Interview questions & answers

Why divide the scores by √dₖ?

For high-dimensional Q and K, the dot product QKᵀ has variance that grows with dₖ, so raw scores get large in magnitude. Large logits push softmax into a near one-hot regime where gradients vanish. Dividing by √dₖ rescales the variance back to ~1, keeping the softmax in a well-behaved, trainable range.

Why softmax instead of just normalizing the scores?

Softmax turns arbitrary real scores into a non-negative distribution that sums to 1, so the output is a proper convex combination of Values. It is differentiable, sharpens the contrast between high and low scores (winner-leaning but still soft), and handles negative scores gracefully — plain divide-by-sum does none of these reliably.

Self-attention vs cross-attention — what is the difference?

In self-attention, Q, K, and V all come from the SAME sequence — tokens attend to each other within one input. In cross-attention, Q comes from one sequence and K, V from another — e.g. a translation decoder's tokens (Q) attending to the encoded source sentence (K, V), letting the output condition on the input.

What is the computational complexity, and why does it matter?

The score matrix is n×n for sequence length n, so attention is O(n²·d) time and O(n²) memory. That quadratic blowup is the main cost for long contexts and motivates FlashAttention (IO-aware exact attention), sparse/local attention, linear-attention approximations, and the KV cache for efficient autoregressive decoding.

Why use multiple heads instead of one big attention?

Each head attends in a different learned subspace, so different heads can specialize — one tracks subject-verb agreement, another long-range references, another positional patterns. A single head with one softmax must average all of these into one distribution; multiple heads let the model attend to several relationships at once, then combine them.

Why does a Transformer need positional encodings at all?

Self-attention is permutation-equivariant: it treats the input as a set, so "dog bites man" and "man bites dog" would be indistinguishable without position info. Positional encodings (sinusoidal, learned, or relative/RoPE) inject order so the model can reason about sequence structure.

What does the causal mask do and when do you use it?

In a decoder (autoregressive generation) a token must not attend to future tokens, or it would "cheat" by seeing the answer it is predicting. The causal mask sets future scores to −∞ before softmax so their weight becomes 0. Encoders (e.g. BERT) skip the mask and attend bidirectionally.

Common pitfalls

Forgetting positional encodings: the model becomes order-blind and treats the sequence as a bag of tokens.
Omitting the causal mask in a decoder leaks future tokens and gives unrealistically low training loss that collapses at inference.
Skipping the √dₖ scaling makes softmax saturate, killing gradients and stalling training.
Ignoring the O(n²) memory cost: long sequences OOM quickly; you need FlashAttention, chunking, or sparse attention.
Confusing the three projections — Q, K, V have distinct learned matrices; tying them or swapping them silently breaks the mechanism.
Assuming attention weights are faithful explanations: high attention is correlation, not a guaranteed causal "reason" for the output.

Where it shows up

GPT family / Claude / Llama — decoder-only stacks of masked self-attention.
BERT and encoders — bidirectional self-attention for understanding tasks.
The original "Attention Is All You Need" Transformer (encoder–decoder translation).
Vision Transformers (ViT) — attention over image patches.
Multimodal and speech models (Whisper, CLIP-style encoders) that fuse modalities via attention.

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…