🧠 Quantization — AI / ML Interview Guide

Production & Training · interactive visualization + interview prep

Open the interactive Quantization visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

A model’s weights are stored as high-precision numbers (FP16/FP32). Quantization rounds them to a smaller set of discrete levels (INT8, INT4…), so each weight takes fewer bits. The model gets much smaller and faster, at the cost of a little rounding error.

Mental model

Saving a photo with a smaller color palette. The original weights are a smooth continuum; quantization snaps each one to the nearest rung of a COARSE grid and stores the rung number (a few bits) instead of the full number. Fewer rungs means a smaller file that is faster to move — at the price of rounding every weight to its nearest rung. For most weights that rounding is invisible; the art is handling the few that are not.

Theory

A trained model's weights are stored as high-precision floats (FP32/FP16). Quantization maps them to a small set of discrete levels represented in fewer bits (INT8, INT4), shrinking the model and speeding inference. It is a lossy compression of the weights: you trade a controlled amount of rounding error for large memory and bandwidth savings.

The mechanics are an affine map. Pick a range [min, max] for a group of weights and split it into 2^bits evenly spaced levels with step size scale = (max−min)/(2^bits−1). Each weight is rounded to the nearest level and stored as a small integer; at compute time it is dequantized back (value ≈ q·scale + min). Going 16→8→4 bits roughly halves memory at each step while coarsening the grid.

Why does this speed up inference? LLM decoding is largely MEMORY-BANDWIDTH bound — the bottleneck is moving weights from memory to the compute units, not the math itself. Fewer bits per weight means fewer bytes to move (and integer kernels where supported), so throughput rises — provided the hardware supports the format.

There are two regimes. Post-Training Quantization (PTQ) quantizes an already-trained model — fast, no retraining, the common case. Quantization-Aware Training (QAT) simulates quantization DURING training so the model learns weights robust to it — better low-bit accuracy, but more expensive.

The hard part is outliers: a few large-magnitude weights stretch the range and make every level coarser, amplifying error for all the others. This is why naive uniform quantization degrades, and why methods like per-channel/per-group scales, GPTQ, and AWQ exist to protect sensitive weights. Rule of thumb: INT8 is usually near-lossless, INT4 is noticeable but often acceptable — always MEASURE on your task.

Concrete example

A 7B model in FP16 is ~14 GB and won’t fit on a consumer GPU. Quantized to INT4 it’s ~3.5 GB and runs on a laptop — with only a small quality drop. That’s how llama.cpp / GGUF / GPTQ models run locally.

Key equations

Step by step

  1. Start from full-precision weights (≈ continuous).
  2. Choose a bit-width → that many quantization levels across the range.
  3. Snap each weight to the nearest level (rounding).
  4. Fewer bits → coarser grid → smaller memory but larger rounding error.

Interview questions & answers

What does quantization trade off?

Memory/speed vs accuracy. Lower bit-width shrinks the model and speeds inference (less memory bandwidth) but adds rounding error that can degrade quality — INT8 is usually near-lossless, INT4 noticeable but often acceptable.

Post-training quantization (PTQ) vs quantization-aware training (QAT)?

PTQ quantizes an already-trained model (fast, no retraining). QAT simulates quantization during training so the model adapts to it (better low-bit accuracy, more expensive).

Why do outlier weights matter?

A few large-magnitude weights stretch the range, making every level coarser. Methods like per-channel scales, GPTQ, and AWQ handle outliers to preserve accuracy at low bits.

Does quantization speed up inference, and why?

Often yes: LLM inference is memory-bandwidth bound, so moving fewer bytes per weight (and using int math) speeds it up — provided the hardware/kernels support the format.

Common pitfalls

Where it shows up

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…