🧠 Tokenization (Byte-Pair Encoding) — AI / ML Interview Guide
LLM Internals · interactive visualization + interview prep
Open the interactive Tokenization (Byte-Pair Encoding) visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Models don’t read characters or whole words — they read "tokens", chunks learned from data. Byte-Pair Encoding (BPE) starts from single characters and repeatedly merges the most frequent adjacent pair, so common pieces ("ing", "un", "er") become single tokens while rare words stay split.
Mental model
A tokenizer is a learned compression scheme for text. It watches a corpus and whatever short character-sequences co-occur most often earn their own symbol. Frequent words and suffixes collapse to a single token; rare words stay broken into pieces. The "vocabulary" is just the ordered list of merges it learned — at encode time you replay those same merges on new text. So the model never sees letters or words, it sees these learned chunks.
Theory
Tokenization is the bridge between raw text and the model: it maps a string to a sequence of integer IDs. The granularity is a real trade-off. Word-level vocabularies explode in size and can't represent unseen words (the out-of-vocabulary problem). Character-level handles anything but makes sequences very long and forces the model to relearn spelling. Subword tokenization (BPE, WordPiece, Unigram) is the middle ground: a fixed-size vocabulary that can still encode ANY string by falling back to smaller pieces.
Byte-Pair Encoding learns its vocabulary greedily. Start with every character as its own token. Repeatedly count all adjacent token pairs across the corpus, merge the single most frequent pair into a new token, and record that merge. Each merge adds one vocabulary entry and can build on previous merges, so longer subwords emerge over time. You stop after a target number of merges (≈ the vocab size).
The ORDERED merge list IS the tokenizer. At encode time you apply the same merges in the same order to new text — this determinism is what makes "token" a precise, reproducible unit. Byte-level BPE (used by GPT) runs this over raw bytes rather than Unicode characters, so it can encode any input — emoji, code, other scripts — with no true OOV.
The main families differ in the merge criterion. BPE merges by raw frequency. WordPiece (BERT) merges the pair that most increases the corpus likelihood. Unigram / SentencePiece starts from a large candidate set and PRUNES tokens by a probabilistic model. All yield subword vocabularies; the differences mostly affect edge cases and morphology.
The practical consequences matter for any LLM engineer. A token is ~0.75 English words on average but varies wildly: rare words, code, and non-English text fragment into many more tokens — which directly raises API cost and consumes context window. Whitespace and capitalization create distinct tokens (" the", "the", "The" all differ), a common source of subtle prompt bugs.
Concrete example
"tokenizer" might split into "token" + "izer"; "unhappiness" into "un" + "happiness". Frequent suffixes like "ing" across "reading/singing/running" get merged early because the pair (i, n)→(in)→(ing) recurs often. This is why token counts (and API cost) don’t equal word or character counts.
Key equations
start: every character is its own tokenrepeat: count all adjacent token pairs → merge the most frequent pair into oneeach merge adds one entry to the vocabularystop after a fixed number of merges (≈ vocab size)fewer/longer tokens as merges increase; rare words stay fragmented
Step by step
- Split the text into individual characters (the starting tokens).
- Count every adjacent character pair across the corpus.
- Merge the most frequent pair everywhere into a single new token.
- Repeat — each merge can build longer subwords from the previous ones.
- After N merges you have the final segmentation; drag the slider to watch it form.
Interview questions & answers
Why subword tokenization instead of words or characters?
Word vocabularies explode and can’t handle unseen words (OOV); character models make sequences very long. Subwords (BPE/WordPiece/Unigram) balance both: a fixed vocab that still represents any string by falling back to smaller pieces.
How does BPE actually build its vocabulary?
Greedily: start from characters, repeatedly merge the most frequent adjacent pair into a new token, recording each merge. The ordered merge list IS the tokenizer; at encode time you apply the same merges in order.
Why don’t token counts equal word counts?
A common word may be one token; a rare or long word may be several; whitespace and capitalization create distinct tokens. That’s why API pricing and context limits are measured in tokens, not words.
BPE vs WordPiece vs Unigram?
BPE merges by frequency; WordPiece merges by likelihood gain; Unigram (SentencePiece) starts big and prunes by a probabilistic model. All produce subword vocabularies.
Common pitfalls
- Assuming 1 token ≈ 1 word — it’s closer to ~0.75 words in English, and varies by language.
- Forgetting whitespace/case create separate tokens (" the" ≠ "the" ≠ "The").
- Non-English / code text often tokenizes into far more tokens (more cost).
Where it shows up
- GPT (BPE), BERT (WordPiece), Llama/T5 (SentencePiece)
- Every LLM API’s token counting & pricing
- tiktoken / Hugging Face tokenizers
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…