Evaluate the Assistant (P@k + Answer Accuracy + Gate)
This is the lesson that earns the second half of your resume line: and evaluated. Anyone can wire up a RAG demo. What separates a junior from someone you would hire to own a production assistant is the ability to measure it and to stop a bad change from shipping. You will build the evaluation harness and the regression gate.
You evaluate against a gold set: a small labeled list where each case has a query, the ids of the relevant articles, and the expected answer text. Two families of metric:
- Retrieval quality --
precision_at_k: of the top-k ids your retriever returned, what fraction are relevant. This tells you whether the bot is even looking at the right articles. - Answer quality -- accuracy of the produced answer against the expected text, in two modes:
contains(the expected snippet appears in the answer, case-insensitive) andexact(a trimmed exact match).containsis the forgiving, realistic one for free-text answers.
Build these:
precision_at_k(retrieved, relevant, k)
answer_accuracy(predicted, expected, mode)
evaluate(config) # -> a metrics dict over the whole gold set
regression_gate(old, new) # -> True if safe to ship, False if it regressedprecision_at_k: fraction of the first k retrieved ids that are in the relevant set; 0.0 for an empty retrieval. answer_accuracy: 1.0 or 0.0 for one prediction, honoring the mode.
evaluate(config) ties it together. The config dict carries the gold list, a retriever function (query -> ranked ids), an answerer function (query -> text), and the settings k and mode. Run every gold case and return a dict with at least n (cases scored), precision_at_k (mean over cases, rounded to 4 places), and answer_accuracy (mean, rounded to 4). An empty gold set must not crash -- return zeros.
regression_gate(old, new) is the CI guard you would wire into a pull request: return True only if the new run did not get worse on the gated metric (answer_accuracy by default) -- equal or better is fine, a drop returns False. This is the thing that, in real life, blocks a prompt tweak from quietly tanking quality. Press Run to score a perfect bot and a broken one and watch the gate stop the regression.
Build the evaluation harness over a labeled gold set. precision_at_k(retrieved, relevant, k) for retrieval and answer_accuracy(predicted, expected, mode) (contains / exact) for answers. evaluate(config) runs the gold set (with retriever and answerer functions, k, mode) and returns a metrics dict of means (n, precision_at_k, answer_accuracy), empty-safe. regression_gate(old, new) returns True only if answer accuracy did not drop.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.