LLM Evals & Testing

LLM-as-Judge (Rubric Build + Verdict Parse)

Exact-match, F1, and ROUGE all need a gold answer to compare against. But for open-ended outputs, "was this explanation clear?", "was the tone professional?", there is no single right string. The scalable answer the industry settled on is LLM-as-judge: you give a second model the answer plus a rubric and ask it to score each criterion. On a well-specified rubric it often lands in the 80-90% agreement range with human raters, but this varies a lot by task and can drop sharply on nuanced or adversarial cases, so treat any single number as task-specific rather than a law. Even so, it is usually consistent enough to run on thousands of outputs nightly, as long as you periodically spot-check it against human labels.

Here is the honest part, and the reason this lesson is gradable. The judge's actual scoring is a model call, and model calls are non-deterministic, so we never grade them. What you engineer, and what you are graded on, is the deterministic scaffolding around the call: building the judge prompt and parsing the verdict back out. In production the middle step is a real call:

# the ONE non-deterministic step (shown, not graded):
verdict = call_model(build_judge_prompt(answer, rubric))
scores = parse_judge_scores(verdict)

The on-device WebLLM in this course can play the judge live in the browser, but tests never depend on it. They check the prompt you build and the parser you write.

The prompt has to pin the judge down: state the role, list the rubric criteria numbered, show the answer, and demand a machine-readable verdict so you can parse it. The standard trick is to force one line per criterion in a fixed shape, here score: N/5:

You are a strict grader. Score the answer against the rubric.

RUBRIC:
1. Is it factually correct?
2. Is it concise?

ANSWER:
<the answer text>

For each rubric item, reply on its own line as 'score: N/5'.

Parsing is then a regex sweep for every score: N/5 line (case-insensitive), pulling out the integers. Aggregation turns a list of scores into a mean and a pass-rate: the fraction of scores at or above a threshold. The pass-rate is what you alert on; a single bad answer in a batch is noise, a sliding pass-rate is a regression.

What to build.

build_judge_prompt(answer, rubric) -> a prompt string containing each numbered rubric item, the answer, and a score: N/5 directive.
parse_judge_scores(text) -> the list of integers from every score: N/5 line, in order, case-insensitive.
aggregate(scores, threshold=3) -> {"mean", "pass_rate"} (both rounded to 4 places); an empty list returns zeros.

Press Run to build a judge prompt, parse a sample verdict, and aggregate.

Your turn

Write build_judge_prompt(answer, rubric) returning a prompt string that contains every numbered rubric item, the answer text, and a directive to reply score: N/5 per item. Write parse_judge_scores(text) returning the list of integers from every score: N/5 line in order (case-insensitive). Write aggregate(scores, threshold=3) returning {'mean', 'pass_rate'} rounded to 4 places, where pass_rate is the fraction of scores >= threshold and an empty list returns zeros.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

LLM-as-Judge (Rubric Build + Verdict Parse)

This lesson is locked

Best on a laptop