TF-IDF Retriever Over a Doc Set
Raw counts overweight common words: the appears everywhere and tells you nothing about a document. TF-IDF fixes that. It multiplies how often a term appears in a document (term frequency) by how rare that term is across the whole collection (inverse document frequency). Words that are frequent in one document but rare overall, the words that actually distinguish it, get the highest weight.
Stack the TF-IDF vectors of your documents into a matrix, turn the query into a TF-IDF vector the same way, and rank documents by cosine similarity to the query. That is a complete retriever: the lexical half of nearly every RAG system. scikit-learn gives you both pieces:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
vec = TfidfVectorizer()
doc_matrix = vec.fit_transform(docs) # fit vocab + idf on the docs
q_vec = vec.transform([query]) # SAME vocab for the query
sims = cosine_similarity(q_vec, doc_matrix)[0] # one score per docTwo details that separate a working retriever from a broken one:
- Call
fit_transformon the documents but onlytransformon the query, so the query is projected into the same vocabulary and idf weights. Fitting the query separately would put it in a different space and the scores would be meaningless. - Return the top-k document indices sorted by similarity descending.
np.argsortsorts ascending, so sort the negated scores:np.argsort(-sims).
Build retrieve(query, docs, k) that fits a TfidfVectorizer on docs, scores the query against every document with cosine similarity, and returns the indices of the top k documents as a list of ints, most similar first.
Write retrieve(query, docs, k) that returns the indices of the k documents most similar to the query. Fit a TfidfVectorizer on docs, transform the query with the same vectorizer, score every document with cosine_similarity, and return the top k document indices as a list of ints in descending similarity order. A query about Python should return the Python document first; a query about refunds should return the refund document first.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.