🧠 Tokenization (Byte-Pair Encoding) — AI / ML Interview Guide

LLM Internals · interactive visualization + interview prep

Open the interactive Tokenization (Byte-Pair Encoding) visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Models don’t read characters or whole words — they read "tokens", chunks learned from data. Byte-Pair Encoding (BPE) starts from single characters and repeatedly merges the most frequent adjacent pair, so common pieces ("ing", "un", "er") become single tokens while rare words stay split.

Mental model

A tokenizer is a learned compression scheme for text. It watches a corpus and whatever short character-sequences co-occur most often earn their own symbol. Frequent words and suffixes collapse to a single token; rare words stay broken into pieces. The "vocabulary" is just the ordered list of merges it learned — at encode time you replay those same merges on new text. So the model never sees letters or words, it sees these learned chunks.

Theory

Tokenization is the bridge between raw text and the model: it maps a string to a sequence of integer IDs. The granularity is a real trade-off. Word-level vocabularies explode in size and can't represent unseen words (the out-of-vocabulary problem). Character-level handles anything but makes sequences very long and forces the model to relearn spelling. Subword tokenization (BPE, WordPiece, Unigram) is the middle ground: a fixed-size vocabulary that can still encode ANY string by falling back to smaller pieces.

Byte-Pair Encoding learns its vocabulary greedily. Start with every character as its own token. Repeatedly count all adjacent token pairs across the corpus, merge the single most frequent pair into a new token, and record that merge. Each merge adds one vocabulary entry and can build on previous merges, so longer subwords emerge over time. You stop after a target number of merges (≈ the vocab size).

The ORDERED merge list IS the tokenizer. At encode time you apply the same merges in the same order to new text — this determinism is what makes "token" a precise, reproducible unit. Byte-level BPE (used by GPT) runs this over raw bytes rather than Unicode characters, so it can encode any input — emoji, code, other scripts — with no true OOV.

The main families differ in the merge criterion. BPE merges by raw frequency. WordPiece (BERT) merges the pair that most increases the corpus likelihood. Unigram / SentencePiece starts from a large candidate set and PRUNES tokens by a probabilistic model. All yield subword vocabularies; the differences mostly affect edge cases and morphology.

The practical consequences matter for any LLM engineer. A token is ~0.75 English words on average but varies wildly: rare words, code, and non-English text fragment into many more tokens — which directly raises API cost and consumes context window. Whitespace and capitalization create distinct tokens (" the", "the", "The" all differ), a common source of subtle prompt bugs.

Concrete example

"tokenizer" might split into "token" + "izer"; "unhappiness" into "un" + "happiness". Frequent suffixes like "ing" across "reading/singing/running" get merged early because the pair (i, n)→(in)→(ing) recurs often. This is why token counts (and API cost) don’t equal word or character counts.

Key equations

start: every character is its own token
repeat: count all adjacent token pairs → merge the most frequent pair into one
each merge adds one entry to the vocabulary
stop after a fixed number of merges (≈ vocab size)
fewer/longer tokens as merges increase; rare words stay fragmented

Step by step

Split the text into individual characters (the starting tokens).
Count every adjacent character pair across the corpus.
Merge the most frequent pair everywhere into a single new token.
Repeat — each merge can build longer subwords from the previous ones.
After N merges you have the final segmentation; drag the slider to watch it form.

Interview questions & answers

Why subword tokenization instead of words or characters?

Word vocabularies explode and can’t handle unseen words (OOV); character models make sequences very long. Subwords (BPE/WordPiece/Unigram) balance both: a fixed vocab that still represents any string by falling back to smaller pieces.

How does BPE actually build its vocabulary?

Greedily: start from characters, repeatedly merge the most frequent adjacent pair into a new token, recording each merge. The ordered merge list IS the tokenizer; at encode time you apply the same merges in order.

Why don’t token counts equal word counts?

A common word may be one token; a rare or long word may be several; whitespace and capitalization create distinct tokens. That’s why API pricing and context limits are measured in tokens, not words.

BPE vs WordPiece vs Unigram?

BPE merges by frequency; WordPiece merges by likelihood gain; Unigram (SentencePiece) starts big and prunes by a probabilistic model. All produce subword vocabularies.

Common pitfalls

Assuming 1 token ≈ 1 word — it’s closer to ~0.75 words in English, and varies by language.
Forgetting whitespace/case create separate tokens (" the" ≠ "the" ≠ "The").
Non-English / code text often tokenizes into far more tokens (more cost).

Where it shows up

GPT (BPE), BERT (WordPiece), Llama/T5 (SentencePiece)
Every LLM API’s token counting & pricing
tiktoken / Hugging Face tokenizers

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…