🧠 LoRA / PEFT Fine-Tuning — AI / ML Interview Guide

Q: Why does low rank work for fine-tuning?

The weight UPDATE needed to adapt a model tends to be low “intrinsic rank” — it lives in a small subspace. So a rank-r B·A can capture it without touching most parameters.

Q: LoRA vs full fine-tuning?

Full FT updates all weights (max flexibility, huge cost/storage). LoRA trains a tiny adapter (cheap, swappable, small files) with near-comparable quality on many tasks; the base stays shared and frozen.

Q: What does the rank r control?

Capacity of the adapter: higher r = more trainable params = more expressive but larger and more prone to overfit; lower r = cheaper but may underfit. Common r = 8–64.

Q: What is QLoRA?

LoRA on top of a 4-bit QUANTIZED frozen base — combines quantization’s memory savings with LoRA’s cheap training, so you can fine-tune large models on a single GPU.

Production & Training · interactive visualization + interview prep

Open the interactive LoRA / PEFT Fine-Tuning visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

Fine-tuning a full model updates billions of weights — expensive and storage-heavy. LoRA (a Parameter-Efficient Fine-Tuning method) FREEZES the base weights and learns two small low-rank matrices A and B added alongside: ΔW = B·A. You train only A and B — often <1% of the parameters — yet adapt the model’s behavior.

Mental model

The base model is a printed textbook you cannot rewrite. LoRA adds thin sticky notes in the margins — two small matrices whose product B·A is the change — that are cheap to write, removable, and swappable per task. You train ONLY the notes; the textbook stays frozen and shared across every task. Want a different behavior? Peel off one set of notes and stick on another, same book underneath.

Theory

Full fine-tuning updates all of a model's billions of weights, which is expensive to train and forces you to store an entire new copy of the model PER task. Parameter-Efficient Fine-Tuning (PEFT) avoids this; LoRA (Low-Rank Adaptation) is the dominant method.

LoRA freezes the pretrained weight W and learns a low-rank UPDATE alongside it: the effective weight is W + ΔW where ΔW = B·A, with A of shape (r×d) and B of shape (d×r) for a small rank r ≪ d. Only A and B are trained — 2·d·r parameters versus d² for full fine-tuning, often well under 1% of the model.

Why can a low-rank update suffice? Empirically the weight change needed to adapt a pretrained model to a new task has low "intrinsic rank" — it lives in a small subspace — so a rank-r B·A can capture most of it without touching the vast majority of parameters. The rank r is the capacity dial: too low underfits, too high loses the efficiency and can overfit (common r = 8–64).

A key practical property: at inference you can MERGE the adapter, W ← W + B·A, folding it into the base weights so there is zero added latency. Or you can keep it separate and HOT-SWAP different adapters on one shared frozen base — serving many per-customer or per-task variants from a single model in memory, each adapter just a few MB.

LoRA composes with quantization: QLoRA fine-tunes LoRA adapters on top of a 4-bit quantized frozen base, combining quantization's memory savings with LoRA's cheap training so you can fine-tune large models on a single GPU. For good results, adapt enough layers — typically the attention and MLP projections, not just one.

Concrete example

To specialize a 7B model on your support tickets, full fine-tuning means storing a whole new 7B model. With LoRA you train ~10M adapter params and ship a few-MB adapter file — and you can hot-swap different adapters on the same frozen base.

Key equations

frozen base weight W (d×d); output uses W + ΔW
ΔW = B·A with A (r×d), B (d×r), rank r ≪ d
trainable params = 2·d·r vs d·d for full fine-tuning
e.g. d=512, r=8 → 8,192 vs 262,144 params (~3%)
at inference you can MERGE: W ← W + B·A (no extra latency)

Step by step

Freeze the pretrained base weights W (no gradients).
Add low-rank adapters A and B beside them (rank r).
Train ONLY A and B — a tiny fraction of the parameters.
Optionally merge B·A back into W for zero inference overhead.

Interview questions & answers

Why does low rank work for fine-tuning?

The weight UPDATE needed to adapt a model tends to be low “intrinsic rank” — it lives in a small subspace. So a rank-r B·A can capture it without touching most parameters.

LoRA vs full fine-tuning?

Full FT updates all weights (max flexibility, huge cost/storage). LoRA trains a tiny adapter (cheap, swappable, small files) with near-comparable quality on many tasks; the base stays shared and frozen.

What does the rank r control?

Capacity of the adapter: higher r = more trainable params = more expressive but larger and more prone to overfit; lower r = cheaper but may underfit. Common r = 8–64.

What is QLoRA?

LoRA on top of a 4-bit QUANTIZED frozen base — combines quantization’s memory savings with LoRA’s cheap training, so you can fine-tune large models on a single GPU.

Common pitfalls

Setting rank too high — loses the efficiency point and can overfit.
Adapting too few layers — under-adapts; attention + MLP projections usually need it.
Forgetting you can merge adapters to remove inference overhead.

Where it shows up

PEFT / LoRA / QLoRA fine-tuning
Per-customer / per-task adapters on a shared base
Stable Diffusion LoRAs for styles

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…