🧠 Quantization — AI / ML Interview Guide
Production & Training · interactive visualization + interview prep
Open the interactive Quantization visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
A model’s weights are stored as high-precision numbers (FP16/FP32). Quantization rounds them to a smaller set of discrete levels (INT8, INT4…), so each weight takes fewer bits. The model gets much smaller and faster, at the cost of a little rounding error.
Mental model
Saving a photo with a smaller color palette. The original weights are a smooth continuum; quantization snaps each one to the nearest rung of a COARSE grid and stores the rung number (a few bits) instead of the full number. Fewer rungs means a smaller file that is faster to move — at the price of rounding every weight to its nearest rung. For most weights that rounding is invisible; the art is handling the few that are not.
Theory
A trained model's weights are stored as high-precision floats (FP32/FP16). Quantization maps them to a small set of discrete levels represented in fewer bits (INT8, INT4), shrinking the model and speeding inference. It is a lossy compression of the weights: you trade a controlled amount of rounding error for large memory and bandwidth savings.
The mechanics are an affine map. Pick a range [min, max] for a group of weights and split it into 2^bits evenly spaced levels with step size scale = (max−min)/(2^bits−1). Each weight is rounded to the nearest level and stored as a small integer; at compute time it is dequantized back (value ≈ q·scale + min). Going 16→8→4 bits roughly halves memory at each step while coarsening the grid.
Why does this speed up inference? LLM decoding is largely MEMORY-BANDWIDTH bound — the bottleneck is moving weights from memory to the compute units, not the math itself. Fewer bits per weight means fewer bytes to move (and integer kernels where supported), so throughput rises — provided the hardware supports the format.
There are two regimes. Post-Training Quantization (PTQ) quantizes an already-trained model — fast, no retraining, the common case. Quantization-Aware Training (QAT) simulates quantization DURING training so the model learns weights robust to it — better low-bit accuracy, but more expensive.
The hard part is outliers: a few large-magnitude weights stretch the range and make every level coarser, amplifying error for all the others. This is why naive uniform quantization degrades, and why methods like per-channel/per-group scales, GPTQ, and AWQ exist to protect sensitive weights. Rule of thumb: INT8 is usually near-lossless, INT4 is noticeable but often acceptable — always MEASURE on your task.
Concrete example
A 7B model in FP16 is ~14 GB and won’t fit on a consumer GPU. Quantized to INT4 it’s ~3.5 GB and runs on a laptop — with only a small quality drop. That’s how llama.cpp / GGUF / GPTQ models run locally.
Key equations
pick a range [min, max]; split it into 2^bits levelsq(w) = round((w − min) / scale) · scale + min, scale = (max − min)/(2^bits−1)memory per weight: 16-bit → 8-bit → 4-bit halves each steperror ↑ and size ↓ as bits ↓ — the core trade-off
Step by step
- Start from full-precision weights (≈ continuous).
- Choose a bit-width → that many quantization levels across the range.
- Snap each weight to the nearest level (rounding).
- Fewer bits → coarser grid → smaller memory but larger rounding error.
Interview questions & answers
What does quantization trade off?
Memory/speed vs accuracy. Lower bit-width shrinks the model and speeds inference (less memory bandwidth) but adds rounding error that can degrade quality — INT8 is usually near-lossless, INT4 noticeable but often acceptable.
Post-training quantization (PTQ) vs quantization-aware training (QAT)?
PTQ quantizes an already-trained model (fast, no retraining). QAT simulates quantization during training so the model adapts to it (better low-bit accuracy, more expensive).
Why do outlier weights matter?
A few large-magnitude weights stretch the range, making every level coarser. Methods like per-channel scales, GPTQ, and AWQ handle outliers to preserve accuracy at low bits.
Does quantization speed up inference, and why?
Often yes: LLM inference is memory-bandwidth bound, so moving fewer bytes per weight (and using int math) speeds it up — provided the hardware/kernels support the format.
Common pitfalls
- Quantizing everything uniformly — sensitive layers may need higher precision.
- Ignoring outliers → a few weights blow up the range and the error.
- Assuming INT4 is free — quality drop varies by model and task; measure it.
Where it shows up
- llama.cpp / GGUF, GPTQ, AWQ, bitsandbytes
- On-device & edge LLMs
- Cheaper, faster cloud inference
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…