Syllabus Lesson 160 of 239 · Evaluating RAG & Retrieval
Evaluating RAG & Retrieval

Capstone: A Retrieval Eval Report

You have the four metrics. Now wire them into the thing an engineer actually ships: a retrieval eval report that runs a labeled query set through a retriever and hands back a scorecard. This is the artifact behind the resume line "built and evaluated a RAG pipeline," and it is what lets you say "the new retriever lifted MRR from 0.62 to 1.0" instead of "it feels better."

The harness has three moving parts. An eval set is a list of cases, each a dict like {"query": "refund policy", "relevant": {"d_refund"}}. A retriever is any callable retriever(query, k) that returns a ranked list of doc IDs (you pass it in, so the same report can grade a BM25 baseline, a dense retriever, or a hybrid). And the report scores every query on all four metrics, then aggregates.

def evaluate_retrieval(eval_set, retriever, k):
    rows = []
    failures = []
    for case in eval_set:
        retrieved = retriever(case["query"], k)
        # score this query: P@k, R@k, RR, NDCG@k
        # if the reciprocal rank is 0.0, the retriever missed -> record a failure
        ...
    # roll rows into a pandas DataFrame, then return the means + the frame + failures
    ...

For NDCG over retrieved IDs, turn the ranking into graded relevance the simple way: a doc scores 1 if it is in the relevant set, else 0. Reuse the functions you built: precision_at_k, recall_at_k, reciprocal_rank, dcg, and ndcg_at_k.

What to build. evaluate_retrieval(eval_set, retriever, k) returning a dict with:

  • mean_precision_at_k, mean_recall_at_k, mrr, mean_ndcg_at_k -> the mean of each metric across the queries, each rounded to 4 places.
  • per_query -> a pandas.DataFrame, one row per query, carrying that query's metrics (so you can eyeball which query tanked).
  • failures -> a list of the queries whose reciprocal rank was 0.0 (the retriever surfaced nothing relevant). This is your debugging worklist.

The grader checks the aggregate numbers against hand-computed values, confirms the failure list names the uncovered query, and proves a better retriever scores strictly higher -> the relational check no constant can fake. In a full system a model would now answer from the chunks the winning retriever returned, but that generation step is the other half of the pipeline; here you are grading retrieval on its own. Press Run to print a scorecard.

Your turn

Write evaluate_retrieval(eval_set, retriever, k). The eval_set is a list of {"query", "relevant"} dicts; retriever(query, k) returns a ranked list of doc IDs. For each query, score precision_at_k, recall_at_k, reciprocal_rank, and ndcg_at_k (treat a retrieved doc as relevance 1 if it is in the relevant set, else 0). Return a dict with mean_precision_at_k, mean_recall_at_k, mrr, mean_ndcg_at_k (each rounded to 4 places), a per_query pandas DataFrame (one row per query), and a failures list of every query whose reciprocal rank was 0.0.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output