Embeddings & Semantic Search from Scratch

Bag-of-Words Vectors

Before a model can compare two pieces of text, the text has to become numbers. The oldest, simplest way is the bag of words: pick a fixed list of words (the vocabulary), then describe any document by how many times each vocabulary word appears in it. Word order is thrown away, hence the word "bag". It is crude, but it is the foundation that TF-IDF, BM25, and even dense embeddings all build on.

Say the vocabulary is ["the", "cat", "sat", "dog", "ran"]. The document "the cat sat" becomes the count vector [1, 1, 1, 0, 0]: one the, one cat, one sat, zero dog, zero ran. The document "the dog ran the dog" becomes [2, 0, 0, 2, 1].

Two rules that matter in production:

The vector length is always the vocabulary size, no matter how long the document is. That fixed shape is what lets you stack documents into a matrix and do linear algebra.
Words not in the vocabulary (out of vocabulary, or OOV) are simply ignored. A real system caps the vocabulary, so unseen words vanish.

In real code you would reach for scikit-learn, which does exactly this and hands you a sparse matrix:

from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(vocabulary=["the", "cat", "sat", "dog", "ran"])
X = vec.transform(["the cat sat"])   # one row, vocab-size columns

To understand what that tool does under the hood, you will build the counter yourself.

Build vectorize(doc, vocab): lowercase the document, split it on whitespace, and return a list of counts, one per vocabulary word, in vocabulary order. Ignore any token not in the vocabulary. An empty document returns all zeros.

Your turn

Write vectorize(doc, vocab) that turns a document string into a bag-of-words count list over a fixed vocabulary. Lowercase and split the document on whitespace, count how many times each vocab word appears, and return the counts as a list in vocab order. Tokens outside the vocabulary are ignored; an empty document returns all zeros. The returned list length always equals len(vocab).

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Bag-of-Words Vectors

This lesson is locked

Best on a laptop