Project: Prompt Evaluation CI

The Regression Gate

This is the finale, and the line on your resume: built a prompt-eval harness + regression gate. You have scorers (lesson 1) and a judge parser (lesson 2). Now you wire the piece that makes it CI: a gate that runs on every prompt change and blocks the merge if quality dropped. This is what stops a teammate's "small tweak" from quietly tanking accuracy in production -> the automated reviewer that never gets tired.

The model: you have a gold set, and two versions of a prompt each producing a list of predictions over that set. Accuracy here is simply the fraction of predictions that exactly match the gold answer (stripped). First, compare the two:

compare(v1_preds, v2_preds, golds) -> {"v1": acc1, "v2": acc2, "delta": acc2 - acc1}, where each accuracy is the exact-match rate against golds and delta is how much v2 moved relative to v1. A positive delta means the candidate improved; negative means it regressed.

Then the gate itself. In CI you do not block on noise -> a min_delta threshold lets a candidate dip by a tiny, tolerated amount (often 0.0, meaning "must not get worse") before it is rejected:

gate(baseline, candidate, min_delta) -> baseline and candidate are each a (preds, golds) pair. Score both as exact-match accuracy, compute delta = candidate_acc - baseline_acc, and return {"pass": bool, "delta": delta, "report": str}. The gate passes when delta >= min_delta and fails otherwise.

The report is a human-readable one-liner a developer sees in the CI log. Make it state the outcome and the numbers, for example "PASS: delta +0.100 (baseline 0.800 -> candidate 0.900), threshold 0.000" or a FAIL: ... variant. The tests check that the report starts with PASS or FAIL matching the boolean and mentions both accuracies, so the exact wording is yours but the facts must be in it.

Concretely: a baseline at 0.8 and a candidate at 0.6 with min_delta=0.0 -> delta -0.2, gate fails, the merge is blocked. Swap them and the candidate improves -> gate passes. That single boolean is what you wire into CI to turn "the prompt seems better" into "the prompt is provably not worse, or the build is red." Press Run to gate an improvement and a regression.

Your turn

Build the CI regression gate. Write compare(v1_preds, v2_preds, golds) returning {"v1", "v2", "delta"} where each value is the exact-match accuracy (fraction of predictions equal to the stripped gold) and delta = v2 - v1. Write gate(baseline, candidate, min_delta) taking two (preds, golds) pairs and returning {"pass": bool, "delta": float, "report": str}, passing iff delta >= min_delta. The report must start with PASS/FAIL (matching the boolean) and mention both accuracies. A regression must fail the gate; an improvement must pass.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

The Regression Gate

This lesson is locked

Best on a laptop