Syllabus Lesson 144 of 239 · Embeddings & Semantic Search from Scratch
Embeddings & Semantic Search from Scratch

TF-IDF Retriever Over a Doc Set

Raw counts overweight common words: the appears everywhere and tells you nothing about a document. TF-IDF fixes that. It multiplies how often a term appears in a document (term frequency) by how rare that term is across the whole collection (inverse document frequency). Words that are frequent in one document but rare overall, the words that actually distinguish it, get the highest weight.

Stack the TF-IDF vectors of your documents into a matrix, turn the query into a TF-IDF vector the same way, and rank documents by cosine similarity to the query. That is a complete retriever: the lexical half of nearly every RAG system. scikit-learn gives you both pieces:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

vec = TfidfVectorizer()
doc_matrix = vec.fit_transform(docs)        # fit vocab + idf on the docs
q_vec = vec.transform([query])              # SAME vocab for the query
sims = cosine_similarity(q_vec, doc_matrix)[0]   # one score per doc

Two details that separate a working retriever from a broken one:

  • Call fit_transform on the documents but only transform on the query, so the query is projected into the same vocabulary and idf weights. Fitting the query separately would put it in a different space and the scores would be meaningless.
  • Return the top-k document indices sorted by similarity descending. np.argsort sorts ascending, so sort the negated scores: np.argsort(-sims).

Build retrieve(query, docs, k) that fits a TfidfVectorizer on docs, scores the query against every document with cosine similarity, and returns the indices of the top k documents as a list of ints, most similar first.

Your turn

Write retrieve(query, docs, k) that returns the indices of the k documents most similar to the query. Fit a TfidfVectorizer on docs, transform the query with the same vectorizer, score every document with cosine_similarity, and return the top k document indices as a list of ints in descending similarity order. A query about Python should return the Python document first; a query about refunds should return the refund document first.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output