Syllabus Lesson 162 of 239 · LLM Evals & Testing
LLM Evals & Testing

Assertion-Based Evals (exact-match & contains)

You changed the prompt and the output "feels better." That feeling is not a result. The job of an eval is to replace it with a number you can put in a pull request: this prompt scores 0.86 on the gold set, the old one scored 0.74. Before you reach for anything fancy, the workhorse eval is the cheapest one: take a labeled set of inputs with known-correct answers, run your system, and count how many it got right.

The catch is deciding what "right" means. An LLM almost never returns the gold string byte for byte. It adds a period, capitalizes a word, wraps the answer in a sentence. So you pick a match mode:

  • exact -> the prediction equals the gold answer after you normalize (lowercase, strip surrounding whitespace). Use this for short, closed answers: a label, a yes/no, a city.
  • contains -> the gold answer appears somewhere inside the prediction. Use this when the model is allowed to be chatty: "The capital is Paris." should still count as correct for the gold paris.

Normalizing matters. Without it, "Paris" and "paris" score as a miss and your eval lies to you. Here is the shape of one comparison:

def score_one(pred, gold, mode):
    p = pred.strip().lower()
    g = gold.strip().lower()
    if mode == "exact":
        return 1.0 if p == g else 0.0
    if mode == "contains":
        return 1.0 if g in p else 0.0
    raise ValueError("unknown mode: " + mode)

Accuracy is then just the mean of those per-example scores. In a real harness the predictions come from a live model call; here you are graded on the scoring math, which is the part that has to be exactly right. Two more checks live in every safety-conscious harness: an allow-list (every output must be one of a fixed set of labels) and a forbidden-pattern scan (no output may contain a banned phrase like "as an AI").

What to build.

  • evaluate(predictions, gold, mode) -> the mean exact-or-contains score over two parallel lists, rounded to 4 places. An empty list returns 0.0, never a crash. Raise ValueError on an unknown mode.
  • label_in_set(outputs, allowed) -> {"passed", "failed"} counting how many stripped outputs are in the allowed set.
  • has_forbidden(text, patterns) -> True if any regex in patterns is found in text.

Press Run to score a gold set under both modes.

Your turn

Write evaluate(predictions, gold, mode) returning the mean score over two parallel lists, where mode is 'exact' (normalized equality) or 'contains' (gold appears inside the prediction); normalize with strip().lower(), round the accuracy to 4 places, return 0.0 on an empty set, and raise ValueError on an unknown mode. Also write label_in_set(outputs, allowed) -> {'passed', 'failed'} tallying stripped outputs against the allowed set, and has_forbidden(text, patterns) -> True if any regex pattern is found in text.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output