🧠 The Transformer Block — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive The Transformer Block visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

A Transformer is a stack of identical BLOCKS. One block does two things in turn: mix information across tokens (self-attention), then transform each token on its own (a feed-forward MLP) — each wrapped in a LayerNorm and a residual connection. Stack N of these and you have a GPT/Llama.

Mental model

Think of one block as "communicate, then think". Attention is the COMMUNICATION phase — every token gathers what it needs from the others. The MLP is the per-token THINKING phase — each token processes what it gathered, alone. Residual connections mean each phase only ADDS a correction to a running representation, so the signal (and the gradient) flows straight through the whole deep stack.

Theory

The block is the repeated unit of the architecture. Modern LLMs use PRE-NORM: x = x + Attention(LayerNorm(x)), then x = x + FFN(LayerNorm(x)). Normalizing BEFORE each sublayer (rather than after, as in the original paper) keeps the residual path clean and makes very deep stacks train stably.

Sublayer one is multi-head self-attention — the only place tokens exchange information. Sublayer two is a position-wise feed-forward network: a 2-layer MLP (typically ~4× the model width with a GELU/SwiGLU nonlinearity) applied independently to each token. Attention moves information between positions; the MLP transforms it within a position.

Residual connections are load-bearing. Writing each sublayer as x + f(x) means f only has to learn a CORRECTION to the identity, and gradients get an unobstructed path back through the addition — this is what made training networks dozens to hundreds of layers deep feasible (the same idea as ResNets).

Everything is shape-preserving: a block maps a (seq_len × d_model) tensor to the same shape, which is exactly why you can stack N identical blocks. Depth (more blocks) and width (larger d_model / more heads) are the two axes you scale.

A full decoder LM is: token embedding + positional info → N blocks → a final LayerNorm → an unembedding (often tied to the input embedding) → softmax over the vocabulary. The block is the part that repeats; the rest is the input/output plumbing.

Concrete example

GPT-3 is 96 of these blocks with d_model = 12288 and 96 heads; Llama-3-8B is 32 blocks. Same block, just stacked more / wider. The MLP holds roughly two-thirds of a model's parameters; attention is the rest.

Key equations

pre-norm block:
x = x + MultiHeadAttention(LayerNorm(x))
x = x + FFN(LayerNorm(x))
FFN(x) = W₂ · σ(W₁·x + b₁) + b₂ (σ = GELU/SwiGLU, hidden ≈ 4·d)
shape in = shape out = (seq_len × d_model) → stack N blocks
decoder LM = embed → N blocks → final LN → unembed → softmax

Step by step

Input tokens (embeddings + positional info) enter the block.
LayerNorm, then multi-head self-attention mixes information across tokens.
Add the attention output back to the input (residual).
LayerNorm again, then a per-token feed-forward MLP transforms each token.
Add the MLP output back (residual) → block output, fed to the next block.

Interview questions & answers

What are the two sublayers of a Transformer block and what does each do?

Multi-head self-attention (tokens exchange information across the sequence) and a position-wise feed-forward MLP (each token is transformed independently). "Communicate, then think."

Why residual connections?

Writing x + f(x) lets each sublayer learn only a correction and gives gradients a direct path backward through the addition, so very deep stacks remain trainable (no vanishing gradient through dozens of layers).

Pre-norm vs post-norm — what's the difference and which is used?

Post-norm (original) applies LayerNorm AFTER the sublayer+residual; pre-norm applies it BEFORE the sublayer. Pre-norm keeps the residual path clean and trains deep models more stably, so nearly all modern LLMs use it.

Where are most of the parameters in a block?

In the feed-forward MLP — with hidden width ~4× d_model it holds roughly two-thirds of the parameters; the attention projections (Q,K,V,O) are the rest.

Common pitfalls

Forgetting the residual adds — without them deep stacks won't train.
Confusing depth (more blocks) with width (bigger d_model / more heads).
Thinking the MLP mixes tokens — only attention does; the MLP is per-token.

Where it shows up

Every Transformer LLM (GPT, Llama, Mistral, Claude) — stacked decoder blocks
BERT/ViT — stacked encoder blocks
The repeating unit you scale in depth and width

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…