Syllabus Lesson 152 of 239 · Build a RAG Pipeline
Build a RAG Pipeline

Embed the Chunks

You have chunks of text. The retriever, though, compares vectors, not words, so the missing step is embedding: turn each chunk into a fixed-length list of numbers so that text with similar content lands at similar coordinates. This is the "embed" promised in the pipeline (chunk -> embed -> store -> retrieve -> prompt), and it is what feeds VectorStore.add(id, text, vector) in the next lesson.

Production systems call a neural embedding model (OpenAI, Cohere, a sentence-transformer) over the network, but the contract is identical and you can build a real, deterministic embedder offline to learn it: a bag-of-words vector. Fix a vocab (the words you care about), then represent a piece of text by how many times each vocab word appears in it.

vocab = ["cat", "dog", "fish"]
# "cat cat dog" -> [2, 1, 0]   (two cats, one dog, no fish)

One more step makes these usable for cosine search: L2-normalize the vector (divide it by its own length, np.linalg.norm). After that every vector has length 1, so cosine similarity becomes a plain dot product and long chunks stop dominating short ones just for having more words. A chunk that contains none of the vocab words has nothing to normalize, so leave that all-zero vector as it is (never divide by zero).

Build two functions.

  • build_vocab(chunks) -> the sorted list of unique words across all chunks. Tokenize each chunk with text.lower().split() (lowercase, split on whitespace).
  • embed(text, vocab) -> a numpy float array of length len(vocab): the count of each vocab word in text (same tokenizer), then L2-normalized. If the counts are all zero, return the zero vector unchanged.

Press Run to embed a few chunks and confirm related text points the same way.

Your turn

Write build_vocab(chunks) returning the sorted unique words across all chunks (tokenize with text.lower().split()), and embed(text, vocab) returning a numpy float array of length len(vocab) holding the count of each vocab word in text (same tokenizer), L2-normalized with np.linalg.norm. If every count is zero, return the all-zero vector unchanged (do not divide by zero).

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output