LLM Evals & Testing

A Prompt Regression Gate

You have scorers (exact match, F1, an LLM judge). The last piece turns them into a gate: a yes/no decision your CI runs on every prompt change so a "small tweak" can never quietly make things worse. This is the difference between measuring quality and defending it.

("Regression" here means a change for the worse, the everyday software sense, not the modelling technique from the ML module.)

You compare two runs over the SAME eval set: the baseline (the prompt in production) and the candidate (your change), as two lists of per-example scores in [0, 1]. The gate blocks the change unless it clears three bars:

No overall drop: the candidate's mean score is at least the baseline's mean (a tiny tolerance absorbs noise).
Few new failures: the number of examples where the candidate scored worse than the baseline stays within a budget (one big win must not hide several quiet regressions).
A floor: an absolute minimum mean, if you set one.

Build regression_gate(baseline, candidate, tol=0.01, max_regressions=0, min_mean=None) returning a report dict with baseline_mean, candidate_mean, delta (candidate minus baseline), regressions (count of indices where candidate[i] < baseline[i] - 1e-9), and passed (a bool, True only when ALL the bars are cleared). The two lists are the same length. Press Run to gate a candidate prompt.

Your turn

Write regression_gate(baseline, candidate, tol=0.01, max_regressions=0, min_mean=None) over two equal-length lists of per-example scores. Return {"baseline_mean", "candidate_mean", "delta", "regressions", "passed"} where delta = candidate_mean - baseline_mean, regressions counts indices with candidate[i] < baseline[i] - 1e-9, and passed is True only when candidate_mean >= baseline_mean - tol AND regressions <= max_regressions AND (min_mean is None or candidate_mean >= min_mean).

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

A Prompt Regression Gate

This lesson is locked

Best on a laptop