🧠 Evals & LLM-as-Judge — AI / ML Interview Guide

Production & Training · interactive visualization + interview prep

Open the interactive Evals & LLM-as-Judge visualization on PrepGrind → Step through a live animation, tune the parameters, and read the full theory, math, reference code, and interview Q&A below — free, in your browser.

What it is

You can’t improve what you don’t measure. Evals run a model over a fixed test set and SCORE each output. For open-ended tasks where exact-match fails, an LLM-as-JUDGE grades each answer against a rubric (or compares two answers), giving a repeatable quality score you can track across model versions.

Mental model

Unit tests for prose. Each test case is a prompt plus an expectation; the "assertion" is a JUDGE — often an LLM grading against a rubric — that returns pass/fail or a score. You get green/red per case plus an aggregate you watch across versions, so a prompt tweak or model upgrade is proven to help (or caught as a regression) BEFORE it ships, instead of being discovered by users after.

Theory

You cannot improve what you do not measure. An eval is a fixed test set run through the model to produce a repeatable quality metric you can track across prompt changes and model versions — the difference between "this feels better" and "pass rate went from 71% to 84%".

The scoring method depends on the task. Exact-match works for closed answers but is too brittle for open-ended generation, where many wordings are valid. So you use a rubric, or LLM-AS-JUDGE: another model grades each answer against criteria (pointwise pass/fail or 1–5) or compares two answers (pairwise win-rate). It scales cheaply and correlates reasonably with human raters.

But the judge is itself a model with biases you must control: POSITION bias (favoring the first answer shown), LENGTH/verbosity bias, and SELF-PREFERENCE (rating its own style higher), plus run-to-run inconsistency. Mitigations: explicit rubrics, randomizing answer order, sampling multiple times, and CALIBRATING the judge against human labels before trusting it.

Offline and online evaluation are complementary. Offline evals run on a fixed set pre-ship — fast, controlled, reproducible. Online metrics (A/B tests, user thumbs, task-success rates) measure real impact in production. You need both: offline to gate releases, online to confirm they actually helped.

Evals decay if untended. Version the eval set, guard against train/test leakage, refresh cases as the product evolves, and periodically re-validate the judge. And beware Goodhart's law: once a metric becomes the target, teams optimize the EVAL rather than real quality — so treat the score as a proxy, not the goal.

Concrete example

Testing a support bot: 200 real questions with expected behaviors. A judge LLM marks each answer pass/fail with a reason ("missed the refund window"). The aggregate pass rate tells you if a prompt change or model upgrade actually helped — before shipping.

Key equations

eval set: [(prompt, expected/criteria), …]
run the model → answer per prompt
score: exact-match / rubric / LLM-as-judge verdict (pass·fail·1–5)
aggregate: pass rate, mean score, win-rate vs a baseline
track the metric across versions (regression detection)

Step by step

Take a fixed set of test prompts with expectations.
Run the model to get an answer for each.
A judge scores each answer (here: pass/fail with a reason).
Accumulate a running score across the set.
Compare the aggregate across prompts/models to decide what ships.

Interview questions & answers

Why use an LLM as a judge instead of exact match?

Open-ended answers have many valid forms, so string-match is too brittle. An LLM judge evaluates meaning against a rubric, scales cheaply, and correlates reasonably with humans — though you must validate it against human labels.

What are the risks of LLM-as-judge?

Bias (position bias toward the first answer, length/verbosity bias, self-preference for its own outputs) and inconsistency. Mitigate with rubrics, randomized order, multiple samples, and calibration against human judgments.

Offline evals vs online metrics?

Offline evals run on a fixed set before shipping (fast, controlled). Online metrics (A/B tests, user thumbs, task success) measure real impact in production. You need both.

How do you keep evals trustworthy over time?

Version the eval set, guard against train/test leakage, refresh cases as the product evolves, and periodically re-validate the judge against human raters.

Common pitfalls

Trusting the judge without validating it against humans.
A stale or leaked eval set that no longer reflects real usage.
Optimizing to the eval (Goodhart’s law) instead of real quality.

Where it shows up

LLM-as-judge frameworks (MT-Bench, AlpacaEval, Ragas)
CI evals / regression suites for prompts & models
Model release decisions

More AI / ML interview concepts

PrepGrind runs entirely in your browser, free, no installation required. Loading the interactive playground…