Syllabus Lesson 232 of 239 · Project: Prompt Evaluation CI
Project: Prompt Evaluation CI

LLM-as-Judge: Rubric + Verdict Parse

Exact-match and token-F1 break down the moment answers get open-ended. "Summarize this ticket" has a thousand good answers and no single gold string. The industry workaround is LLM-as-judge: you ask a second model to read the answer against a rubric and return a verdict. That is exactly how teams grade chatbots, RAG answers, and agent transcripts at scale.

Here is the engineering reality, and the thing a hiring manager probes for: the judge is a language model, so its reply is messy free text. It might say VERDICT: PASS on one run and Final verdict - fail on the next, sometimes with a paragraph of reasoning first. The fragile part of any judge harness is not the prompt, it is the parser that pulls a clean, machine-usable result out of that text. So in this lesson the judge is mocked (no live model, no network) and you build the two deterministic pieces around it: the rubric prompt and the verdict parser.

Build the rubric prompt. build_rubric(criteria) takes a list of criteria strings and returns a single prompt string for the judge. It must:

  • contain the word Rubric so the judge knows the task,
  • list every criterion on its own line, numbered from 1 (1. accuracy, 2. clarity, ...),
  • and end by telling the judge the exact output contract: a line VERDICT: PASS or VERDICT: FAIL and a line SCORE: <0-10>. (You are defining the format your own parser will read.)

Build the parser. parse_verdict(judge_text) scans the judge's raw reply and returns {"verdict": "PASS" or "FAIL", "score": int}. Robustness rules that real judge output forces on you:

  • Be case-insensitive: pass, PASS, and Pass all mean PASS.
  • Take the last verdict and the last score mentioned. Judges often "think out loud" ("this could be a fail, but...") before committing, so the final word is the real answer. A regex with re.findall gives you every match in order -> take [-1].
  • Return a safe default when nothing is found: verdict "FAIL" and score 0. An unparseable judge reply must never silently count as a pass.

Find verdicts with something like re.findall(r"\b(pass|fail)\b", text, re.I) and scores with re.findall(r"score\s*[:=]?\s*(\d+)", text, re.I), then take the last of each. We grade the parser and the rubric only; the live judge call is prose in this lesson, never run in the tests. Press Run to parse a few judge replies.

Your turn

Build the deterministic half of an LLM-as-judge harness (the judge is mocked). Write build_rubric(criteria) returning a prompt string that contains Rubric, numbers each criterion from 1 on its own line, and states the VERDICT: PASS/FAIL + SCORE: <0-10> output contract. Write parse_verdict(judge_text) returning {"verdict": "PASS"|"FAIL", "score": int}: case-insensitive, taking the LAST verdict and LAST score found, and defaulting to {"verdict": "FAIL", "score": 0} when none is present.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output