Embed the Chunks
You have chunks of text. The retriever, though, compares vectors, not words, so the missing step is embedding: turn each chunk into a fixed-length list of numbers so that text with similar content lands at similar coordinates. This is the "embed" promised in the pipeline (chunk -> embed -> store -> retrieve -> prompt), and it is what feeds VectorStore.add(id, text, vector) in the next lesson.
Production systems call a neural embedding model (OpenAI, Cohere, a sentence-transformer) over the network, but the contract is identical and you can build a real, deterministic embedder offline to learn it: a bag-of-words vector. Fix a vocab (the words you care about), then represent a piece of text by how many times each vocab word appears in it.
vocab = ["cat", "dog", "fish"]
# "cat cat dog" -> [2, 1, 0] (two cats, one dog, no fish)One more step makes these usable for cosine search: L2-normalize the vector (divide it by its own length, np.linalg.norm). After that every vector has length 1, so cosine similarity becomes a plain dot product and long chunks stop dominating short ones just for having more words. A chunk that contains none of the vocab words has nothing to normalize, so leave that all-zero vector as it is (never divide by zero).
Build two functions.
build_vocab(chunks)-> the sorted list of unique words across all chunks. Tokenize each chunk withtext.lower().split()(lowercase, split on whitespace).embed(text, vocab)-> a numpy float array of lengthlen(vocab): the count of each vocab word intext(same tokenizer), then L2-normalized. If the counts are all zero, return the zero vector unchanged.
Press Run to embed a few chunks and confirm related text points the same way.
Write build_vocab(chunks) returning the sorted unique words across all chunks (tokenize with text.lower().split()), and embed(text, vocab) returning a numpy float array of length len(vocab) holding the count of each vocab word in text (same tokenizer), L2-normalized with np.linalg.norm. If every count is zero, return the all-zero vector unchanged (do not divide by zero).
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.