Syllabus Lesson 231 of 239 · Project: Prompt Evaluation CI
Project: Prompt Evaluation CI

Gold Set + Scorers

You shipped a new prompt and it feels better. Your team lead asks the only question that matters: better by how much, and how do you know? "Vibes" do not survive code review. The job is to turn a fuzzy claim into a number you can defend, and the foundation of every prompt-eval harness is a gold set (a list of inputs paired with their known-correct answers) plus a scorer (a function that says how close a model's answer is to the gold answer).

This is lesson 1 of a three-part project. By the end of the module you will have a real prompt-evaluation harness with a CI regression gate -> the kind of thing a hiring manager sees on a resume as "built a prompt-eval harness + regression gate" and immediately knows you can ship LLM features responsibly. Today you build the measuring stick: three scorers and the runner that applies one across a whole gold set.

One honest note on scope: the harness and regression-gate logic you build here is real and transferable, but the judge model is mocked and "accuracy" here is exact-match. On a real project you would wire these same pieces to a live judge model and richer scorers.

Different tasks need different scorers, so you build three:

  • score_exact(pred, gold) -> 1.0 if the prediction equals the gold answer after a .strip() on both, else 0.0. This is the right call for a classifier ("positive" vs "negative") where only an exact match counts.
  • score_contains(pred, gold) -> 1.0 if the gold string appears anywhere inside the prediction (case-insensitive, both stripped), else 0.0. Good for "did the answer mention the key fact?" checks where extra words are fine.
  • token_f1(pred, gold) -> a partial-credit score in [0.0, 1.0] based on word overlap, the workhorse for free-text answers where a near-miss should beat a total miss.

How token F1 works. Lowercase and split both strings into word tokens. Count how many tokens they share, respecting duplicates (treat the token lists as multisets, so collections.Counter and its & intersection are your friend). Then:

overlap   = sum((Counter(pred_tokens) & Counter(gold_tokens)).values())
precision = overlap / len(pred_tokens)   # of the words I said, how many were right?
recall    = overlap / len(gold_tokens)   # of the words I should have said, how many did I?
f1        = 2 * precision * recall / (precision + recall)

Worked example: pred="the quick brown fox", gold="the lazy brown dog". They share the and brown -> overlap 2. Precision is 2/4, recall is 2/4, so F1 is 0.5. Mind the edges: if either side has zero tokens, or the overlap is zero, F1 is 0.0 (never divide by zero).

Finally the runner that turns a scorer into an evaluation:

  • run_suite(predictions, golds, scorer) -> score each (pred, gold) pair with the given scorer function and return {"score": mean_score, "per_case": [list of each case's score]}. The mean over an empty suite is 0.0.

Passing the scorer in (rather than hardcoding it) is the design that lets one harness grade exact-match classifiers and free-text answers alike. Press Run to score a tiny gold set three different ways.

Your turn

Build the scoring core of a prompt-eval harness. Write score_exact(pred, gold) (1.0 on a stripped exact match else 0.0), score_contains(pred, gold) (1.0 if the stripped, lowercased gold is a substring of the prediction else 0.0), and token_f1(pred, gold) (token-overlap F1 over lowercased word tokens, 0.0 when either side is empty or there is no overlap). Then write run_suite(predictions, golds, scorer) returning {"score": mean, "per_case": [scores]} (mean 0.0 on an empty suite).

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output