Syllabus Lesson 208 of 239 · Project: Customer Support Copilot
Project: Customer Support Copilot

Index the Knowledge Base

Welcome to your flagship portfolio build: a Customer Support Copilot. Over four lessons you will build and evaluate a retrieval-augmented (RAG) support assistant end to end -- the exact system behind the chat widget that answers "how do I reset my password?" from a company's help articles instead of guessing. When you finish, the resume line writes itself: built and evaluated a RAG pipeline and eval harness.

A RAG system has one golden rule: the model only answers from documents you actually retrieved. So before any language model is involved, you need fast, accurate retrieval. That is this lesson. A hiring manager reading your code here is checking one thing: can you turn a pile of text into something you can search by meaning, not just exact keywords?

You will build a small TF-IDF retriever. The idea in one breath: represent each article as a vector of word weights, where a word counts for more if it is frequent in this article (term frequency) but rare across all articles (inverse document frequency). Compare a query to every article with cosine similarity and return the closest ones.

Three moves make it work:

  • Tokenize -- lowercase, split into words, and drop tiny stopwords like "the" and "is" that match everything and mean nothing. re.findall(r"[a-z0-9]+", text.lower()) then filter the stopword set.
  • Weight -- for each article, term frequency times IDF, then normalize the vector to unit length so long and short articles compare fairly.
  • Score -- cosine similarity between the query vector and each article vector. With unit vectors that is just the dot product. Rank descending.

Build two functions over the article list:

index = build_index(docs)          # precompute weights once
retrieve(query, index, k)          # -> the ids of the top-k closest docs

build_index(docs) returns whatever structure you need (a dict is fine). retrieve returns a list of integer indices into docs, best match first, at most k of them. You may use plain Python and math, or numpy, or scikit-learn's TfidfVectorizer -- whatever you reach for, the contract is the same. Keep DOCS defined at module level so the grader can rebuild your index. Press Run to watch a few support questions resolve to the right article.

Your turn

Build a TF-IDF retriever over the 8-article support knowledge base. Keep DOCS at module level. Write build_index(docs) (tokenize with stopword removal, weight terms by tf*idf, normalize to unit vectors) and retrieve(query, index, k) returning the indices of the top-k closest docs by cosine similarity, best first, at most k. Plain Python, numpy, or scikit-learn all work.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output