🧠 LoRA / PEFT Fine-Tuning — AI / ML Interview Guide
Production & Training · interactive visualization + interview prep
Open the interactive LoRA / PEFT Fine-Tuning visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
Fine-tuning a full model updates billions of weights — expensive and storage-heavy. LoRA (a Parameter-Efficient Fine-Tuning method) FREEZES the base weights and learns two small low-rank matrices A and B added alongside: ΔW = B·A. You train only A and B — often <1% of the parameters — yet adapt the model’s behavior.
Mental model
The base model is a printed textbook you cannot rewrite. LoRA adds thin sticky notes in the margins — two small matrices whose product B·A is the change — that are cheap to write, removable, and swappable per task. You train ONLY the notes; the textbook stays frozen and shared across every task. Want a different behavior? Peel off one set of notes and stick on another, same book underneath.
Theory
Full fine-tuning updates all of a model's billions of weights, which is expensive to train and forces you to store an entire new copy of the model PER task. Parameter-Efficient Fine-Tuning (PEFT) avoids this; LoRA (Low-Rank Adaptation) is the dominant method.
LoRA freezes the pretrained weight W and learns a low-rank UPDATE alongside it: the effective weight is W + ΔW where ΔW = B·A, with A of shape (r×d) and B of shape (d×r) for a small rank r ≪ d. Only A and B are trained — 2·d·r parameters versus d² for full fine-tuning, often well under 1% of the model.
Why can a low-rank update suffice? Empirically the weight change needed to adapt a pretrained model to a new task has low "intrinsic rank" — it lives in a small subspace — so a rank-r B·A can capture most of it without touching the vast majority of parameters. The rank r is the capacity dial: too low underfits, too high loses the efficiency and can overfit (common r = 8–64).
A key practical property: at inference you can MERGE the adapter, W ← W + B·A, folding it into the base weights so there is zero added latency. Or you can keep it separate and HOT-SWAP different adapters on one shared frozen base — serving many per-customer or per-task variants from a single model in memory, each adapter just a few MB.
LoRA composes with quantization: QLoRA fine-tunes LoRA adapters on top of a 4-bit quantized frozen base, combining quantization's memory savings with LoRA's cheap training so you can fine-tune large models on a single GPU. For good results, adapt enough layers — typically the attention and MLP projections, not just one.
Concrete example
To specialize a 7B model on your support tickets, full fine-tuning means storing a whole new 7B model. With LoRA you train ~10M adapter params and ship a few-MB adapter file — and you can hot-swap different adapters on the same frozen base.
Key equations
frozen base weight W (d×d); output uses W + ΔWΔW = B·A with A (r×d), B (d×r), rank r ≪ dtrainable params = 2·d·r vs d·d for full fine-tuninge.g. d=512, r=8 → 8,192 vs 262,144 params (~3%)at inference you can MERGE: W ← W + B·A (no extra latency)
Step by step
- Freeze the pretrained base weights W (no gradients).
- Add low-rank adapters A and B beside them (rank r).
- Train ONLY A and B — a tiny fraction of the parameters.
- Optionally merge B·A back into W for zero inference overhead.
Interview questions & answers
Why does low rank work for fine-tuning?
The weight UPDATE needed to adapt a model tends to be low “intrinsic rank” — it lives in a small subspace. So a rank-r B·A can capture it without touching most parameters.
LoRA vs full fine-tuning?
Full FT updates all weights (max flexibility, huge cost/storage). LoRA trains a tiny adapter (cheap, swappable, small files) with near-comparable quality on many tasks; the base stays shared and frozen.
What does the rank r control?
Capacity of the adapter: higher r = more trainable params = more expressive but larger and more prone to overfit; lower r = cheaper but may underfit. Common r = 8–64.
What is QLoRA?
LoRA on top of a 4-bit QUANTIZED frozen base — combines quantization’s memory savings with LoRA’s cheap training, so you can fine-tune large models on a single GPU.
Common pitfalls
- Setting rank too high — loses the efficiency point and can overfit.
- Adapting too few layers — under-adapts; attention + MLP projections usually need it.
- Forgetting you can merge adapters to remove inference overhead.
Where it shows up
- PEFT / LoRA / QLoRA fine-tuning
- Per-customer / per-task adapters on a shared base
- Stable Diffusion LoRAs for styles
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Evals & LLM-as-Judge
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…