Syllabus Lesson 157 of 239 · Evaluating RAG & Retrieval
Evaluating RAG & Retrieval

Precision@k and Recall@k

Before you tune a single prompt, answer a harder question: is the retriever even finding the right documents? A RAG system has two halves, retrieval and generation, and they fail in different ways. If the wrong chunks come back, no amount of prompt polishing saves you. So we measure retrieval on its own, against a labeled set of relevant documents per query, with two numbers every search engineer knows by heart.

Precision@k asks: of the top k documents I returned, what fraction were actually relevant? It is the "how much junk did I show the user" metric.

precision@k = (relevant docs in the top k) / (number of docs in the top k)

Recall@k asks the opposite: of all the documents that were relevant, what fraction did I manage to surface in the top k? It is the "how much did I miss" metric.

recall@k = (relevant docs in the top k) / (total number of relevant docs)

Work a concrete example. You retrieve ["a", "b", "c"] and the relevant set is {"a", "c"}. At k=3, two of the three returned docs are relevant, so precision is 2/3. Both relevant docs were found, so recall is 1.0. The two move in tension: widen k and recall climbs while precision usually drops.

What to build. Two pure functions:

  • precision_at_k(retrieved, relevant, k) -> the fraction of the top k retrieved IDs that are in relevant. Divide by how many docs you actually have in the top k (so a short list or a large k never divides by the wrong number).
  • recall_at_k(retrieved, relevant, k) -> the fraction of relevant captured in the top k, dividing by len(relevant).

Watch the edges the tests probe: k larger than the retrieved list, and an empty relevant set (return 0.0, never divide by zero). Treat relevant as a set for fast membership. In a real harness you would feed these the output of your vector store; here they are graded directly on labeled lists.

One note on scope: this module covers Precision@k, Recall@k, MRR, and NDCG. Search teams often also report MAP (mean average precision), which averages the precision measured at each relevant hit across a query set; the four metrics here are the most common starting point.

Press Run to score a ranking.

Your turn

Write precision_at_k(retrieved, relevant, k) (the fraction of the top k retrieved IDs that are in the relevant set, divided by the number of docs actually in the top k) and recall_at_k(retrieved, relevant, k) (the fraction of the relevant set captured in the top k, divided by len(relevant)). Both must return 0.0 on the empty edges (empty retrieval, empty relevant set) instead of crashing.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output