🧠 Positional Encoding — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive Positional Encoding visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Self-attention has no built-in sense of ORDER — it sees a set of tokens, so "dog bites man" and "man bites dog" would look the same. Positional encoding fixes this by adding a unique, position-dependent vector to each token embedding, so the model knows where each token sits.

Mental model

Attention is order-blind: shuffle the tokens and it computes the same thing. So before attention we STAMP each position with a unique fingerprint vector and add it to that token's embedding. The token now carries both "what I am" (the embedding) and "where I am" (the fingerprint). Sinusoidal encoding makes that fingerprint a stack of clock hands ticking at different speeds — fast hands distinguish neighbors, slow hands distinguish far-apart positions.

Theory

Self-attention is permutation-equivariant — it treats its input as a SET, so "dog bites man" and "man bites dog" are indistinguishable to it. Since order is essential to language, we must inject position information explicitly. The original Transformer does this by ADDING a position-dependent vector to each token embedding (not concatenating — it keeps the dimension unchanged and lets attention's dot products pick up positional structure).

The sinusoidal scheme assigns dimension i of the encoding a sine (or cosine) of the position at a wavelength that grows geometrically with i: low dimensions oscillate slowly (coarse, long-range position), high dimensions oscillate fast (fine, local position). Together they act like a smooth, continuous binary clock, giving every position a unique bounded fingerprint.

A key property is that the encoding makes RELATIVE position linearly accessible: PE(pos+k) is a fixed linear function (a rotation) of PE(pos), so attention can learn to reason about offsets like "three tokens back" with a linear map. Because the functions are bounded and defined for any position, sinusoidal encodings also extrapolate to sequence lengths never seen in training.

There is a design spectrum. ABSOLUTE encodings add a per-position vector — either fixed sinusoidal or learned parameters (BERT/GPT-2); learned ones are flexible but capped at the trained length. RELATIVE encodings inject the distance between tokens directly into attention scores. RoPE (Rotary Position Embedding) rotates the Q and K vectors by an angle proportional to position, so their dot product depends only on relative position — now standard in modern LLMs (Llama) for strong length generalization. ALiBi instead adds a distance-proportional bias to the scores.

Keep the two roles distinct: the token embedding encodes WHAT a token is; the positional encoding encodes WHERE it is. They are summed into a single input vector, and the model learns to disentangle and use both.

Concrete example

The original Transformer adds SINUSOIDAL encodings: each dimension is a sine/cosine of the position at a different frequency. Low dimensions oscillate slowly (coarse position), high dimensions fast (fine position) — a smooth “binary clock” the model reads to recover order.

Key equations

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
each dimension i = a sinusoid of a different wavelength
input = token embedding + PE(position)
relative offsets become linear functions of position (helps generalization)

Step by step

Build a matrix: rows = positions, columns = embedding dimensions.
Each column is a sine/cosine wave at its own frequency (slow → fast).
Each position gets a unique row — its fingerprint vector.
Add that row to the token’s embedding so attention can use order.

Interview questions & answers

Why does self-attention need positional encoding?

Attention is permutation-invariant — it treats inputs as an unordered set. Without position info, the model can’t distinguish word order, which is essential for language.

Why sinusoidal functions specifically?

They give every position a unique, bounded encoding, extend to sequence lengths unseen in training, and make relative positions expressible as linear transforms — useful for attention to reason about offsets.

Absolute vs relative vs rotary (RoPE) positional encoding?

Absolute adds a per-position vector (sinusoidal/learned). Relative encodes distances between tokens. RoPE rotates Q/K by position so attention depends on relative position — now common in modern LLMs (Llama) for better length generalization.

Learned vs fixed positional embeddings?

Learned ones are trained parameters (flexible, but limited to trained lengths). Fixed sinusoidal ones need no training and extrapolate better to longer sequences.

Common pitfalls

Forgetting positional info entirely → the model ignores word order.
Using absolute encodings that don’t extrapolate beyond trained length.
Confusing positional encoding (where) with the token embedding (what).

Where it shows up

Original Transformer (sinusoidal), BERT/GPT (learned)
RoPE in Llama/modern LLMs; ALiBi for long context

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…