Index the Knowledge Base
Welcome to your flagship portfolio build: a Customer Support Copilot. Over four lessons you will build and evaluate a retrieval-augmented (RAG) support assistant end to end -- the exact system behind the chat widget that answers "how do I reset my password?" from a company's help articles instead of guessing. When you finish, the resume line writes itself: built and evaluated a RAG pipeline and eval harness.
A RAG system has one golden rule: the model only answers from documents you actually retrieved. So before any language model is involved, you need fast, accurate retrieval. That is this lesson. A hiring manager reading your code here is checking one thing: can you turn a pile of text into something you can search by meaning, not just exact keywords?
You will build a small TF-IDF retriever. The idea in one breath: represent each article as a vector of word weights, where a word counts for more if it is frequent in this article (term frequency) but rare across all articles (inverse document frequency). Compare a query to every article with cosine similarity and return the closest ones.
Three moves make it work:
- Tokenize -- lowercase, split into words, and drop tiny stopwords like "the" and "is" that match everything and mean nothing.
re.findall(r"[a-z0-9]+", text.lower())then filter the stopword set. - Weight -- for each article, term frequency times IDF, then normalize the vector to unit length so long and short articles compare fairly.
- Score -- cosine similarity between the query vector and each article vector. With unit vectors that is just the dot product. Rank descending.
Build two functions over the article list:
index = build_index(docs) # precompute weights once
retrieve(query, index, k) # -> the ids of the top-k closest docsbuild_index(docs) returns whatever structure you need (a dict is fine). retrieve returns a list of integer indices into docs, best match first, at most k of them. You may use plain Python and math, or numpy, or scikit-learn's TfidfVectorizer -- whatever you reach for, the contract is the same. Keep DOCS defined at module level so the grader can rebuild your index. Press Run to watch a few support questions resolve to the right article.
Build a TF-IDF retriever over the 8-article support knowledge base. Keep DOCS at module level. Write build_index(docs) (tokenize with stopword removal, weight terms by tf*idf, normalize to unit vectors) and retrieve(query, index, k) returning the indices of the top-k closest docs by cosine similarity, best first, at most k. Plain Python, numpy, or scikit-learn all work.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.