🧠 Positional Encoding — AI / ML Interview Guide
LLM Internals · interactive visualization + interview prep
Open the interactive Positional Encoding visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Self-attention has no built-in sense of ORDER — it sees a set of tokens, so "dog bites man" and "man bites dog" would look the same. Positional encoding fixes this by adding a unique, position-dependent vector to each token embedding, so the model knows where each token sits.
Mental model
Attention is order-blind: shuffle the tokens and it computes the same thing. So before attention we STAMP each position with a unique fingerprint vector and add it to that token's embedding. The token now carries both "what I am" (the embedding) and "where I am" (the fingerprint). Sinusoidal encoding makes that fingerprint a stack of clock hands ticking at different speeds — fast hands distinguish neighbors, slow hands distinguish far-apart positions.
Theory
Self-attention is permutation-equivariant — it treats its input as a SET, so "dog bites man" and "man bites dog" are indistinguishable to it. Since order is essential to language, we must inject position information explicitly. The original Transformer does this by ADDING a position-dependent vector to each token embedding (not concatenating — it keeps the dimension unchanged and lets attention's dot products pick up positional structure).
The sinusoidal scheme assigns dimension i of the encoding a sine (or cosine) of the position at a wavelength that grows geometrically with i: low dimensions oscillate slowly (coarse, long-range position), high dimensions oscillate fast (fine, local position). Together they act like a smooth, continuous binary clock, giving every position a unique bounded fingerprint.
A key property is that the encoding makes RELATIVE position linearly accessible: PE(pos+k) is a fixed linear function (a rotation) of PE(pos), so attention can learn to reason about offsets like "three tokens back" with a linear map. Because the functions are bounded and defined for any position, sinusoidal encodings also extrapolate to sequence lengths never seen in training.
There is a design spectrum. ABSOLUTE encodings add a per-position vector — either fixed sinusoidal or learned parameters (BERT/GPT-2); learned ones are flexible but capped at the trained length. RELATIVE encodings inject the distance between tokens directly into attention scores. RoPE (Rotary Position Embedding) rotates the Q and K vectors by an angle proportional to position, so their dot product depends only on relative position — now standard in modern LLMs (Llama) for strong length generalization. ALiBi instead adds a distance-proportional bias to the scores.
Keep the two roles distinct: the token embedding encodes WHAT a token is; the positional encoding encodes WHERE it is. They are summed into a single input vector, and the model learns to disentangle and use both.
Concrete example
The original Transformer adds SINUSOIDAL encodings: each dimension is a sine/cosine of the position at a different frequency. Low dimensions oscillate slowly (coarse position), high dimensions fast (fine position) — a smooth “binary clock” the model reads to recover order.
Key equations
PE(pos, 2i) = sin(pos / 10000^(2i/d))PE(pos, 2i+1) = cos(pos / 10000^(2i/d))each dimension i = a sinusoid of a different wavelengthinput = token embedding + PE(position)relative offsets become linear functions of position (helps generalization)
Step by step
- Build a matrix: rows = positions, columns = embedding dimensions.
- Each column is a sine/cosine wave at its own frequency (slow → fast).
- Each position gets a unique row — its fingerprint vector.
- Add that row to the token’s embedding so attention can use order.
Interview questions & answers
Why does self-attention need positional encoding?
Attention is permutation-invariant — it treats inputs as an unordered set. Without position info, the model can’t distinguish word order, which is essential for language.
Why sinusoidal functions specifically?
They give every position a unique, bounded encoding, extend to sequence lengths unseen in training, and make relative positions expressible as linear transforms — useful for attention to reason about offsets.
Absolute vs relative vs rotary (RoPE) positional encoding?
Absolute adds a per-position vector (sinusoidal/learned). Relative encodes distances between tokens. RoPE rotates Q/K by position so attention depends on relative position — now common in modern LLMs (Llama) for better length generalization.
Learned vs fixed positional embeddings?
Learned ones are trained parameters (flexible, but limited to trained lengths). Fixed sinusoidal ones need no training and extrapolate better to longer sequences.
Common pitfalls
- Forgetting positional info entirely → the model ignores word order.
- Using absolute encodings that don’t extrapolate beyond trained length.
- Confusing positional encoding (where) with the token embedding (what).
Where it shows up
- Original Transformer (sinusoidal), BERT/GPT (learned)
- RoPE in Llama/modern LLMs; ALiBi for long context
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…