🧠 Knowledge Distillation — AI / ML Interview Guide

Production & Training · interactive visualization + interview prep

Open the interactive Knowledge Distillation visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Knowledge distillation trains a small, cheap STUDENT model to mimic a large TEACHER. Instead of (or alongside) the hard "correct answer" labels, the student learns from the teacher's full SOFT probability distribution — which carries much richer signal — so a far smaller model keeps most of the teacher's quality.

Mental model

An apprentice learning from a master. The hard label only says "this is a cat". The master's soft labels say "cat — but also a bit dog, definitely not car", revealing how the master THINKS about the example. Copying that whole judgement teaches the apprentice far more than the one-word answer, so a much smaller apprentice gets surprisingly close to the master.

Theory

Big models are accurate but expensive to serve. Distillation compresses their capability into a smaller architecture by training the student to match the teacher's OUTPUTS rather than (only) the ground-truth labels — a cheap student that approximates an expensive teacher.

The key idea is "dark knowledge" in the soft labels. A hard one-hot label throws away everything except the winner; the teacher's full distribution encodes relative similarities — that a cat image is judged closer to a dog than to a car. Those relative probabilities are a much denser training signal than a single 1.

A TEMPERATURE T softens the teacher (and student) logits before the softmax: dividing logits by T > 1 spreads probability mass onto the runner-up classes, exposing the dark knowledge that a peaked distribution would hide. Higher T → more of the relative structure is visible to the student.

The loss is typically a blend: a distillation term (KL divergence between student and softened-teacher distributions, scaled by T²) plus a standard cross-entropy term on the true labels. The first transfers the teacher's behavior; the second keeps the student grounded in correct answers.

Distillation comes in flavors: response/logit-based (match outputs, as above), feature-based (match intermediate hidden states), and for LLMs sequence-level / on-policy (train on the teacher's generated text or token distributions). DistilBERT, TinyLlama-style students, and many production "small but strong" models are distilled this way. It pairs naturally with quantization and LoRA for cheap deployment.

Concrete example

DistilBERT keeps ~97% of BERT's performance at ~40% of the size and ~60% faster, by distilling from BERT. Many small chat models are distilled from a larger teacher's outputs to get big-model behavior at small-model cost.

Key equations

soften with temperature T: pᵢ = softmax(zᵢ / T)
distillation loss: L_KD = T² · KL( student_T || teacher_T )
total: L = α · L_KD + (1 − α) · CrossEntropy(student, hard label)
T > 1 exposes "dark knowledge" (relative probs of non-top classes)
student ≪ teacher in size, ≈ teacher in quality

Step by step

Run the teacher to get its soft probability distribution for each input.
Soften both teacher and student outputs with temperature T.
Train the student to match the teacher's distribution (KL divergence).
Add a cross-entropy term on the true labels to stay grounded.
Result: a small student that reproduces most of the teacher's behavior.

Interview questions & answers

Why train on soft labels instead of just the hard labels?

Soft labels carry "dark knowledge" — the teacher's relative probabilities across classes (cat is closer to dog than to car). That's a far richer, lower-variance signal than a single one-hot label, so the student learns more from each example.

What does the temperature do in distillation?

Dividing logits by T > 1 softens the distributions, spreading mass onto runner-up classes so their relative structure is visible. The KD loss is scaled by T² to keep gradient magnitudes consistent.

What's in the distillation loss?

Usually a weighted sum: KL divergence between the softened student and teacher distributions (transfers behavior) plus standard cross-entropy on the ground-truth labels (keeps the student correct).

How does distillation relate to quantization and pruning?

All three make models cheaper but differently: distillation trains a smaller model to mimic a bigger one; quantization lowers weight precision; pruning removes weights. They're complementary and often combined for deployment.

Common pitfalls

Using only hard labels — you throw away the teacher's dark knowledge.
Forgetting to scale the KD term by T² → mismatched gradient magnitudes.
Expecting a tiny student to fully match a much larger teacher on hard tasks.

Where it shows up

DistilBERT, TinyBERT, many "small but strong" production models
Distilling large LLM outputs into smaller deployable chat models
Edge / on-device deployment (with quantization + LoRA)

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…