🧠 The Transformer Block — AI / ML Interview Guide
LLM Internals · interactive visualization + interview prep
Open the interactive The Transformer Block visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
A Transformer is a stack of identical BLOCKS. One block does two things in turn: mix information across tokens (self-attention), then transform each token on its own (a feed-forward MLP) — each wrapped in a LayerNorm and a residual connection. Stack N of these and you have a GPT/Llama.
Mental model
Think of one block as "communicate, then think". Attention is the COMMUNICATION phase — every token gathers what it needs from the others. The MLP is the per-token THINKING phase — each token processes what it gathered, alone. Residual connections mean each phase only ADDS a correction to a running representation, so the signal (and the gradient) flows straight through the whole deep stack.
Theory
The block is the repeated unit of the architecture. Modern LLMs use PRE-NORM: x = x + Attention(LayerNorm(x)), then x = x + FFN(LayerNorm(x)). Normalizing BEFORE each sublayer (rather than after, as in the original paper) keeps the residual path clean and makes very deep stacks train stably.
Sublayer one is multi-head self-attention — the only place tokens exchange information. Sublayer two is a position-wise feed-forward network: a 2-layer MLP (typically ~4× the model width with a GELU/SwiGLU nonlinearity) applied independently to each token. Attention moves information between positions; the MLP transforms it within a position.
Residual connections are load-bearing. Writing each sublayer as x + f(x) means f only has to learn a CORRECTION to the identity, and gradients get an unobstructed path back through the addition — this is what made training networks dozens to hundreds of layers deep feasible (the same idea as ResNets).
Everything is shape-preserving: a block maps a (seq_len × d_model) tensor to the same shape, which is exactly why you can stack N identical blocks. Depth (more blocks) and width (larger d_model / more heads) are the two axes you scale.
A full decoder LM is: token embedding + positional info → N blocks → a final LayerNorm → an unembedding (often tied to the input embedding) → softmax over the vocabulary. The block is the part that repeats; the rest is the input/output plumbing.
Concrete example
GPT-3 is 96 of these blocks with d_model = 12288 and 96 heads; Llama-3-8B is 32 blocks. Same block, just stacked more / wider. The MLP holds roughly two-thirds of a model's parameters; attention is the rest.
Key equations
pre-norm block:x = x + MultiHeadAttention(LayerNorm(x))x = x + FFN(LayerNorm(x))FFN(x) = W₂ · σ(W₁·x + b₁) + b₂ (σ = GELU/SwiGLU, hidden ≈ 4·d)shape in = shape out = (seq_len × d_model) → stack N blocksdecoder LM = embed → N blocks → final LN → unembed → softmax
Step by step
- Input tokens (embeddings + positional info) enter the block.
- LayerNorm, then multi-head self-attention mixes information across tokens.
- Add the attention output back to the input (residual).
- LayerNorm again, then a per-token feed-forward MLP transforms each token.
- Add the MLP output back (residual) → block output, fed to the next block.
Interview questions & answers
What are the two sublayers of a Transformer block and what does each do?
Multi-head self-attention (tokens exchange information across the sequence) and a position-wise feed-forward MLP (each token is transformed independently). "Communicate, then think."
Why residual connections?
Writing x + f(x) lets each sublayer learn only a correction and gives gradients a direct path backward through the addition, so very deep stacks remain trainable (no vanishing gradient through dozens of layers).
Pre-norm vs post-norm — what's the difference and which is used?
Post-norm (original) applies LayerNorm AFTER the sublayer+residual; pre-norm applies it BEFORE the sublayer. Pre-norm keeps the residual path clean and trains deep models more stably, so nearly all modern LLMs use it.
Where are most of the parameters in a block?
In the feed-forward MLP — with hidden width ~4× d_model it holds roughly two-thirds of the parameters; the attention projections (Q,K,V,O) are the rest.
Common pitfalls
- Forgetting the residual adds — without them deep stacks won't train.
- Confusing depth (more blocks) with width (bigger d_model / more heads).
- Thinking the MLP mixes tokens — only attention does; the MLP is per-token.
Where it shows up
- Every Transformer LLM (GPT, Llama, Mistral, Claude) — stacked decoder blocks
- BERT/ViT — stacked encoder blocks
- The repeating unit you scale in depth and width
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…