🧠 Evals & LLM-as-Judge — AI / ML Interview Guide
Production & Training · interactive visualization + interview prep
Open the interactive Evals & LLM-as-Judge visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.
What it is
You can’t improve what you don’t measure. Evals run a model over a fixed test set and SCORE each output. For open-ended tasks where exact-match fails, an LLM-as-JUDGE grades each answer against a rubric (or compares two answers), giving a repeatable quality score you can track across model versions.
Mental model
Unit tests for prose. Each test case is a prompt plus an expectation; the "assertion" is a JUDGE — often an LLM grading against a rubric — that returns pass/fail or a score. You get green/red per case plus an aggregate you watch across versions, so a prompt tweak or model upgrade is proven to help (or caught as a regression) BEFORE it ships, instead of being discovered by users after.
Theory
You cannot improve what you do not measure. An eval is a fixed test set run through the model to produce a repeatable quality metric you can track across prompt changes and model versions — the difference between "this feels better" and "pass rate went from 71% to 84%".
The scoring method depends on the task. Exact-match works for closed answers but is too brittle for open-ended generation, where many wordings are valid. So you use a rubric, or LLM-AS-JUDGE: another model grades each answer against criteria (pointwise pass/fail or 1–5) or compares two answers (pairwise win-rate). It scales cheaply and correlates reasonably with human raters.
But the judge is itself a model with biases you must control: POSITION bias (favoring the first answer shown), LENGTH/verbosity bias, and SELF-PREFERENCE (rating its own style higher), plus run-to-run inconsistency. Mitigations: explicit rubrics, randomizing answer order, sampling multiple times, and CALIBRATING the judge against human labels before trusting it.
Offline and online evaluation are complementary. Offline evals run on a fixed set pre-ship — fast, controlled, reproducible. Online metrics (A/B tests, user thumbs, task-success rates) measure real impact in production. You need both: offline to gate releases, online to confirm they actually helped.
Evals decay if untended. Version the eval set, guard against train/test leakage, refresh cases as the product evolves, and periodically re-validate the judge. And beware Goodhart's law: once a metric becomes the target, teams optimize the EVAL rather than real quality — so treat the score as a proxy, not the goal.
Concrete example
Testing a support bot: 200 real questions with expected behaviors. A judge LLM marks each answer pass/fail with a reason ("missed the refund window"). The aggregate pass rate tells you if a prompt change or model upgrade actually helped — before shipping.
Key equations
eval set: [(prompt, expected/criteria), …]run the model → answer per promptscore: exact-match / rubric / LLM-as-judge verdict (pass·fail·1–5)aggregate: pass rate, mean score, win-rate vs a baselinetrack the metric across versions (regression detection)
Step by step
- Take a fixed set of test prompts with expectations.
- Run the model to get an answer for each.
- A judge scores each answer (here: pass/fail with a reason).
- Accumulate a running score across the set.
- Compare the aggregate across prompts/models to decide what ships.
Interview questions & answers
Why use an LLM as a judge instead of exact match?
Open-ended answers have many valid forms, so string-match is too brittle. An LLM judge evaluates meaning against a rubric, scales cheaply, and correlates reasonably with humans — though you must validate it against human labels.
What are the risks of LLM-as-judge?
Bias (position bias toward the first answer, length/verbosity bias, self-preference for its own outputs) and inconsistency. Mitigate with rubrics, randomized order, multiple samples, and calibration against human judgments.
Offline evals vs online metrics?
Offline evals run on a fixed set before shipping (fast, controlled). Online metrics (A/B tests, user thumbs, task success) measure real impact in production. You need both.
How do you keep evals trustworthy over time?
Version the eval set, guard against train/test leakage, refresh cases as the product evolves, and periodically re-validate the judge against human raters.
Common pitfalls
- Trusting the judge without validating it against humans.
- A stale or leaked eval set that no longer reflects real usage.
- Optimizing to the eval (Goodhart’s law) instead of real quality.
Where it shows up
- LLM-as-judge frameworks (MT-Bench, AlpacaEval, Ragas)
- CI evals / regression suites for prompts & models
- Model release decisions
More AI / ML interview concepts
- Neural Networks & Backpropagation
- Gradient Descent & Optimizers
- Activation Functions
- K-Means Clustering
- Self-Attention
- Multi-Head Attention
- Softmax, Temperature & Sampling
- Tokenization (Byte-Pair Encoding)
- Positional Encoding
- KV Cache
- Rotary Position Embedding (RoPE)
- The Transformer Block
- Normalization (LayerNorm / RMSNorm)
- Multi-Query & Grouped-Query Attention
- Flash Attention
- Decoding: Beam Search & Speculative Decoding
- Embeddings & Cosine Similarity
- RAG (Retrieval-Augmented Generation) Pipeline
- Vector Search (HNSW)
- Chunking & Reranking
- ReAct Agent Loop
- Tool / Function Calling
- Multi-Agent Orchestration
- Planning & Task Decomposition
- Agent Memory
- Model Context Protocol (MCP)
- Quantization
- LoRA / PEFT Fine-Tuning
- Mixture of Experts (MoE)
- RLHF / DPO Alignment
- Prompt Injection & Guardrails
- Knowledge Distillation
PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…