BM25: Sparse Lexical Ranking
TF-IDF is a great baseline, but the ranking function that actually powers search engines and the lexical half of modern hybrid retrieval is BM25. It keeps the TF-IDF instinct (rare terms matter, repeated terms matter) but fixes two of its flaws with tunable knobs.
For a query term in a document, BM25 contributes:
idf(term) * ( tf * (k1 + 1) ) / ( tf + k1 * (1 - b + b * dl / avgdl) )where tf is how many times the term appears in the document, dl is the document length, and avgdl is the average document length. The document score is the sum over all query terms. Three pieces do the work:
- IDF with the BM25 smoothing:
idf = log(1 + (N - n + 0.5) / (n + 0.5)), whereNis the number of documents andnis how many documents contain the term. The+ 1keeps it from going negative. - Term-frequency saturation via
k1(default 1.5): the 10th occurrence of a word adds far less than the 2nd. Raw TF-IDF grows linearly forever; BM25 flattens out. - Length normalization via
b(default 0.75): a long document is expected to contain a term more often just by being long, so its score is discounted. Two documents with the sametfrank the shorter one higher.
A term that appears in no document contributes nothing, and a document missing every query term scores exactly 0.0. That zero is load-bearing: it is how you know a document is irrelevant.
You will work over tokenized documents (lists of words) so the math is the focus, not tokenization. Use math.log (natural log); the base only scales all scores by a constant and does not change the ranking.
Build bm25_scores(query, corpus_tokens, k1=1.5, b=0.75) where query is a list of query terms and corpus_tokens is a list of tokenized documents. Return a list of BM25 scores, one per document, in corpus order.
Write bm25_scores(query, corpus_tokens, k1=1.5, b=0.75) implementing the BM25 ranking function. query is a list of terms; corpus_tokens is a list of tokenized documents (each a list of words). For each document, sum over the unique query terms: idf * tf * (k1 + 1) / (tf + k1 * (1 - b + b * dl / avgdl)), with idf = log(1 + (N - n + 0.5) / (n + 0.5)). Return a list of scores in corpus order. A document with more occurrences of a query term scores higher; a document with none scores 0.0; with equal term frequency the longer document scores lower; two identical documents score identically.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.