Project: Customer Support Copilot

Evaluate the Assistant (P@k + Answer Accuracy + Gate)

This is the lesson that earns the second half of your resume line: and evaluated. Anyone can wire up a RAG demo. What separates a junior from someone you would hire to own a production assistant is the ability to measure it and to stop a bad change from shipping. You will build the evaluation harness and the regression gate.

You evaluate against a gold set: a small labeled list where each case has a query, the ids of the relevant articles, and the expected answer text. Two families of metric:

Retrieval quality -- precision_at_k: of the top-k ids your retriever returned, what fraction are relevant. This tells you whether the bot is even looking at the right articles.
Answer quality -- accuracy of the produced answer against the expected text, in two modes: contains (the expected snippet appears in the answer, case-insensitive) and exact (a trimmed exact match). contains is the forgiving, realistic one for free-text answers.

Build these:

precision_at_k(retrieved, relevant, k)
answer_accuracy(predicted, expected, mode)
evaluate(config)            # -> a metrics dict over the whole gold set
regression_gate(old, new)   # -> True if safe to ship, False if it regressed

precision_at_k: fraction of the first k retrieved ids that are in the relevant set; 0.0 for an empty retrieval. answer_accuracy: 1.0 or 0.0 for one prediction, honoring the mode.

evaluate(config) ties it together. The config dict carries the gold list, a retriever function (query -> ranked ids), an answerer function (query -> text), and the settings k and mode. Run every gold case and return a dict with at least n (cases scored), precision_at_k (mean over cases, rounded to 4 places), and answer_accuracy (mean, rounded to 4). An empty gold set must not crash -- return zeros.

regression_gate(old, new) is the CI guard you would wire into a pull request: return True only if the new run did not get worse on the gated metric (answer_accuracy by default) -- equal or better is fine, a drop returns False. This is the thing that, in real life, blocks a prompt tweak from quietly tanking quality. Press Run to score a perfect bot and a broken one and watch the gate stop the regression.

Your turn

Build the evaluation harness over a labeled gold set. precision_at_k(retrieved, relevant, k) for retrieval and answer_accuracy(predicted, expected, mode) (contains / exact) for answers. evaluate(config) runs the gold set (with retriever and answerer functions, k, mode) and returns a metrics dict of means (n, precision_at_k, answer_accuracy), empty-safe. regression_gate(old, new) returns True only if answer accuracy did not drop.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Evaluate the Assistant (P@k + Answer Accuracy + Gate)

This lesson is locked

Best on a laptop