Evaluating Prompts (Accuracy on a Gold Set)
"The new prompt feels better" is not an answer. To know, you keep a gold set: inputs paired with the correct outputs you trust. You run the prompt over the inputs, compare each prediction to its gold answer, and report a number. That number is how you compare prompt v1 to v2 without fooling yourself.
You will build the scorer. score_prediction(pred, gold) returns 1.0 for a match and 0.0 otherwise, comparing case-insensitively and ignoring surrounding whitespace, because "Yes", "yes", and " yes " are all the same answer in practice.
evaluate(predictions, golds) walks the two lists in lockstep, scores each pair, and returns a summary:
evaluate(["yes", "NO", "maybe"], ["yes", "no", "yes"])
# {"accuracy": 0.6667, "correct": 2, "total": 3}Round accuracy to 4 places. Make the empty set safe: zero golds returns {"accuracy": 0.0, "correct": 0, "total": 0} rather than dividing by zero, because eval sets get filtered down and you do not want the harness to crash on an empty slice.
One of the hidden tests runs your evaluate on 50 randomly generated prediction/gold sets and compares it to a reference computed in the test. A lookup table of the visible cases will not survive that, which is the point: you are building a real scorer, not memorizing answers. Press Run to grade.
Write score_prediction(pred, gold) returning 1.0 if the two match ignoring case and surrounding whitespace, else 0.0. Write evaluate(predictions, golds) returning {"accuracy", "correct", "total"} over the paired lists, with accuracy rounded to 4 places and an empty set returning all zeros (no crash).
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.