🧠 Normalization (LayerNorm / RMSNorm) — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive Normalization (LayerNorm / RMSNorm) visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Normalization keeps activations in a stable range so deep networks train well. LayerNorm rescales each token's feature vector to zero mean and unit variance (then a learned scale/shift). RMSNorm is a cheaper variant that just divides by the root-mean-square — and works just as well, so modern LLMs prefer it.

Mental model

A volume normalizer on each token's activations. Whatever the incoming "loudness", it's rescaled to a consistent level before the next sublayer, so no feature blows up or vanishes as signals pass through dozens of layers. LayerNorm re-centers AND rescales; RMSNorm skips the re-centering and only rescales — one fewer step, same effect.

Theory

Deep stacks suffer when the scale of activations drifts layer to layer — gradients explode or vanish and training destabilizes. Normalization fixes the scale at each sublayer so every layer sees inputs in a predictable range, which is what makes very deep Transformers trainable.

LayerNorm normalizes ACROSS THE FEATURES of a single token: subtract that token's mean, divide by its standard deviation, then apply a learned per-feature scale γ and shift β. Crucially it is independent of other tokens and of batch size, so it behaves identically at training and inference and for any sequence length.

This is the key contrast with BatchNorm, which normalizes each feature across the BATCH dimension. BatchNorm depends on batch statistics (awkward for variable-length sequences and small batches at inference), which is why Transformers use LayerNorm, not BatchNorm.

RMSNorm drops the mean-centering entirely: it divides by the root-mean-square of the features (√(mean(x²))) and scales by γ, with no β. It is cheaper (no mean, no subtraction, no bias) and empirically matches LayerNorm quality, so Llama, Mistral, and T5 adopt it.

Placement matters too (covered in the Transformer Block): modern models put the norm BEFORE each sublayer (pre-norm) so the residual path stays unnormalized and clean, which trains more stably than the original post-norm design.

Concrete example

GPT and BERT use LayerNorm; Llama, Mistral, and T5 use RMSNorm. Switching LayerNorm → RMSNorm removes the mean and bias computations from every norm in the network — a small but real inference speedup at no quality cost.

Key equations

LayerNorm: μ = mean(x), σ² = var(x) (over features of one token)
y = γ · (x − μ)/√(σ² + ε) + β
RMSNorm: rms = √(mean(x²) + ε)
y = γ · x / rms (no mean subtraction, no bias)
BatchNorm: normalize each feature across the BATCH (not used in transformers)

Step by step

Take one token's vector of activations.
LayerNorm: compute its mean and std across features.
Subtract the mean and divide by the std → zero mean, unit variance.
RMSNorm instead just divides by the root-mean-square (no mean step).
Apply the learned scale γ (and shift β for LayerNorm) → the sublayer's input.

Interview questions & answers

LayerNorm vs BatchNorm — why do Transformers use LayerNorm?

LayerNorm normalizes across a single token's features (independent of batch and sequence length); BatchNorm normalizes across the batch, whose statistics are unstable for variable-length sequences and small inference batches. LayerNorm behaves the same in train and inference.

What does RMSNorm change vs LayerNorm?

It removes the mean-centering and the bias β — it only divides by the root-mean-square and scales by γ. Cheaper to compute, comparable quality; standard in Llama/Mistral/T5.

Why is normalization needed at all?

It keeps activation scale stable across many layers, preventing the exploding/vanishing gradients that otherwise make deep stacks fail to train, and it lets you use higher learning rates.

What are γ and β?

Learned per-feature scale (γ) and shift (β) applied after normalizing, so the layer can recover any representation it needs rather than being locked to zero mean / unit variance. RMSNorm keeps γ but drops β.

Common pitfalls

Using BatchNorm in a Transformer — batch stats are wrong for variable-length text.
Forgetting the ε term → divide-by-zero on near-constant activations.
Confusing what each norm reduces over: LayerNorm = features, BatchNorm = batch.

Where it shows up

LayerNorm: original Transformer, GPT, BERT
RMSNorm: Llama, Mistral, T5, most modern open LLMs
Pre-norm placement inside every Transformer block

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…