LLM-as-Judge: Rubric + Verdict Parse
Exact-match and token-F1 break down the moment answers get open-ended. "Summarize this ticket" has a thousand good answers and no single gold string. The industry workaround is LLM-as-judge: you ask a second model to read the answer against a rubric and return a verdict. That is exactly how teams grade chatbots, RAG answers, and agent transcripts at scale.
Here is the engineering reality, and the thing a hiring manager probes for: the judge is a language model, so its reply is messy free text. It might say VERDICT: PASS on one run and Final verdict - fail on the next, sometimes with a paragraph of reasoning first. The fragile part of any judge harness is not the prompt, it is the parser that pulls a clean, machine-usable result out of that text. So in this lesson the judge is mocked (no live model, no network) and you build the two deterministic pieces around it: the rubric prompt and the verdict parser.
Build the rubric prompt. build_rubric(criteria) takes a list of criteria strings and returns a single prompt string for the judge. It must:
- contain the word
Rubricso the judge knows the task, - list every criterion on its own line, numbered from 1 (
1. accuracy,2. clarity, ...), - and end by telling the judge the exact output contract: a line
VERDICT: PASSorVERDICT: FAILand a lineSCORE: <0-10>. (You are defining the format your own parser will read.)
Build the parser. parse_verdict(judge_text) scans the judge's raw reply and returns {"verdict": "PASS" or "FAIL", "score": int}. Robustness rules that real judge output forces on you:
- Be case-insensitive:
pass,PASS, andPassall mean PASS. - Take the last verdict and the last score mentioned. Judges often "think out loud" ("this could be a fail, but...") before committing, so the final word is the real answer. A regex with
re.findallgives you every match in order -> take[-1]. - Return a safe default when nothing is found: verdict
"FAIL"and score0. An unparseable judge reply must never silently count as a pass.
Find verdicts with something like re.findall(r"\b(pass|fail)\b", text, re.I) and scores with re.findall(r"score\s*[:=]?\s*(\d+)", text, re.I), then take the last of each. We grade the parser and the rubric only; the live judge call is prose in this lesson, never run in the tests. Press Run to parse a few judge replies.
Build the deterministic half of an LLM-as-judge harness (the judge is mocked). Write build_rubric(criteria) returning a prompt string that contains Rubric, numbers each criterion from 1 on its own line, and states the VERDICT: PASS/FAIL + SCORE: <0-10> output contract. Write parse_verdict(judge_text) returning {"verdict": "PASS"|"FAIL", "score": int}: case-insensitive, taking the LAST verdict and LAST score found, and defaulting to {"verdict": "FAIL", "score": 0} when none is present.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.