Syllabus Lesson 145 of 239 · Embeddings & Semantic Search from Scratch
Embeddings & Semantic Search from Scratch

BM25: Sparse Lexical Ranking

TF-IDF is a great baseline, but the ranking function that actually powers search engines and the lexical half of modern hybrid retrieval is BM25. It keeps the TF-IDF instinct (rare terms matter, repeated terms matter) but fixes two of its flaws with tunable knobs.

For a query term in a document, BM25 contributes:

idf(term) * ( tf * (k1 + 1) ) / ( tf + k1 * (1 - b + b * dl / avgdl) )

where tf is how many times the term appears in the document, dl is the document length, and avgdl is the average document length. The document score is the sum over all query terms. Three pieces do the work:

  • IDF with the BM25 smoothing: idf = log(1 + (N - n + 0.5) / (n + 0.5)), where N is the number of documents and n is how many documents contain the term. The + 1 keeps it from going negative.
  • Term-frequency saturation via k1 (default 1.5): the 10th occurrence of a word adds far less than the 2nd. Raw TF-IDF grows linearly forever; BM25 flattens out.
  • Length normalization via b (default 0.75): a long document is expected to contain a term more often just by being long, so its score is discounted. Two documents with the same tf rank the shorter one higher.

A term that appears in no document contributes nothing, and a document missing every query term scores exactly 0.0. That zero is load-bearing: it is how you know a document is irrelevant.

You will work over tokenized documents (lists of words) so the math is the focus, not tokenization. Use math.log (natural log); the base only scales all scores by a constant and does not change the ranking.

Build bm25_scores(query, corpus_tokens, k1=1.5, b=0.75) where query is a list of query terms and corpus_tokens is a list of tokenized documents. Return a list of BM25 scores, one per document, in corpus order.

Your turn

Write bm25_scores(query, corpus_tokens, k1=1.5, b=0.75) implementing the BM25 ranking function. query is a list of terms; corpus_tokens is a list of tokenized documents (each a list of words). For each document, sum over the unique query terms: idf * tf * (k1 + 1) / (tf + k1 * (1 - b + b * dl / avgdl)), with idf = log(1 + (N - n + 0.5) / (n + 0.5)). Return a list of scores in corpus order. A document with more occurrences of a query term scores higher; a document with none scores 0.0; with equal term frequency the longer document scores lower; two identical documents score identically.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output