Syllabus Lesson 127 of 239 · Prompt Engineering for AI Engineers
Prompt Engineering for AI Engineers

Evaluating Prompts (Accuracy on a Gold Set)

"The new prompt feels better" is not an answer. To know, you keep a gold set: inputs paired with the correct outputs you trust. You run the prompt over the inputs, compare each prediction to its gold answer, and report a number. That number is how you compare prompt v1 to v2 without fooling yourself.

You will build the scorer. score_prediction(pred, gold) returns 1.0 for a match and 0.0 otherwise, comparing case-insensitively and ignoring surrounding whitespace, because "Yes", "yes", and " yes " are all the same answer in practice.

evaluate(predictions, golds) walks the two lists in lockstep, scores each pair, and returns a summary:

evaluate(["yes", "NO", "maybe"], ["yes", "no", "yes"])
# {"accuracy": 0.6667, "correct": 2, "total": 3}

Round accuracy to 4 places. Make the empty set safe: zero golds returns {"accuracy": 0.0, "correct": 0, "total": 0} rather than dividing by zero, because eval sets get filtered down and you do not want the harness to crash on an empty slice.

One of the hidden tests runs your evaluate on 50 randomly generated prediction/gold sets and compares it to a reference computed in the test. A lookup table of the visible cases will not survive that, which is the point: you are building a real scorer, not memorizing answers. Press Run to grade.

Your turn

Write score_prediction(pred, gold) returning 1.0 if the two match ignoring case and surrounding whitespace, else 0.0. Write evaluate(predictions, golds) returning {"accuracy", "correct", "total"} over the paired lists, with accuracy rounded to 4 places and an empty set returning all zeros (no crash).

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output