🧠 Knowledge Distillation — AI / ML Interview Guide
Production & Training · interactive visualization + interview prep
Open the interactive Knowledge Distillation visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Knowledge distillation trains a small, cheap STUDENT model to mimic a large TEACHER. Instead of (or alongside) the hard "correct answer" labels, the student learns from the teacher's full SOFT probability distribution — which carries much richer signal — so a far smaller model keeps most of the teacher's quality.
Mental model
An apprentice learning from a master. The hard label only says "this is a cat". The master's soft labels say "cat — but also a bit dog, definitely not car", revealing how the master THINKS about the example. Copying that whole judgement teaches the apprentice far more than the one-word answer, so a much smaller apprentice gets surprisingly close to the master.
Theory
Big models are accurate but expensive to serve. Distillation compresses their capability into a smaller architecture by training the student to match the teacher's OUTPUTS rather than (only) the ground-truth labels — a cheap student that approximates an expensive teacher.
The key idea is "dark knowledge" in the soft labels. A hard one-hot label throws away everything except the winner; the teacher's full distribution encodes relative similarities — that a cat image is judged closer to a dog than to a car. Those relative probabilities are a much denser training signal than a single 1.
A TEMPERATURE T softens the teacher (and student) logits before the softmax: dividing logits by T > 1 spreads probability mass onto the runner-up classes, exposing the dark knowledge that a peaked distribution would hide. Higher T → more of the relative structure is visible to the student.
The loss is typically a blend: a distillation term (KL divergence between student and softened-teacher distributions, scaled by T²) plus a standard cross-entropy term on the true labels. The first transfers the teacher's behavior; the second keeps the student grounded in correct answers.
Distillation comes in flavors: response/logit-based (match outputs, as above), feature-based (match intermediate hidden states), and for LLMs sequence-level / on-policy (train on the teacher's generated text or token distributions). DistilBERT, TinyLlama-style students, and many production "small but strong" models are distilled this way. It pairs naturally with quantization and LoRA for cheap deployment.
Concrete example
DistilBERT keeps ~97% of BERT's performance at ~40% of the size and ~60% faster, by distilling from BERT. Many small chat models are distilled from a larger teacher's outputs to get big-model behavior at small-model cost.
Key equations
soften with temperature T: pᵢ = softmax(zᵢ / T)distillation loss: L_KD = T² · KL( student_T || teacher_T )total: L = α · L_KD + (1 − α) · CrossEntropy(student, hard label)T > 1 exposes "dark knowledge" (relative probs of non-top classes)student ≪ teacher in size, ≈ teacher in quality
Step by step
- Run the teacher to get its soft probability distribution for each input.
- Soften both teacher and student outputs with temperature T.
- Train the student to match the teacher's distribution (KL divergence).
- Add a cross-entropy term on the true labels to stay grounded.
- Result: a small student that reproduces most of the teacher's behavior.
Interview questions & answers
Why train on soft labels instead of just the hard labels?
Soft labels carry "dark knowledge" — the teacher's relative probabilities across classes (cat is closer to dog than to car). That's a far richer, lower-variance signal than a single one-hot label, so the student learns more from each example.
What does the temperature do in distillation?
Dividing logits by T > 1 softens the distributions, spreading mass onto runner-up classes so their relative structure is visible. The KD loss is scaled by T² to keep gradient magnitudes consistent.
What's in the distillation loss?
Usually a weighted sum: KL divergence between the softened student and teacher distributions (transfers behavior) plus standard cross-entropy on the ground-truth labels (keeps the student correct).
How does distillation relate to quantization and pruning?
All three make models cheaper but differently: distillation trains a smaller model to mimic a bigger one; quantization lowers weight precision; pruning removes weights. They're complementary and often combined for deployment.
Common pitfalls
- Using only hard labels — you throw away the teacher's dark knowledge.
- Forgetting to scale the KD term by T² → mismatched gradient magnitudes.
- Expecting a tiny student to fully match a much larger teacher on hard tasks.
Where it shows up
- DistilBERT, TinyBERT, many "small but strong" production models
- Distilling large LLM outputs into smaller deployable chat models
- Edge / on-device deployment (with quantization + LoRA)
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…