Syllabus Lesson 142 of 239 · Embeddings & Semantic Search from Scratch
Embeddings & Semantic Search from Scratch

Embeddings Intuition (Deterministic Mock Embedder)

Bag-of-words has a fatal blind spot: "dog" and "puppy" are different words, so their count vectors share nothing and their cosine is 0, even though they mean almost the same thing. Real embeddings fix this by placing words (or whole sentences) into a vector space where things used in similar ways land near each other. The guiding idea is the distributional hypothesis: a word is known by the company it keeps. Words that appear in the same contexts get similar vectors.

Production embeddings come from a trained neural network (sentence-transformers) or a hosted API. We cannot run those here: there is no GPU and no network in this sandbox. So you will build a deterministic mock embedder that captures the geometry of the real thing without the trained weights. It is honest about its limits and perfect for learning the mechanism.

The trick is a co-occurrence vector. Given a small curated corpus, the embedding of a word is a count of which other words appear in the same documents as it. If dog and puppy both keep company with park, runs, and nap, their context vectors line up and cosine is high. invoice keeps company with pay and due instead, so it points elsewhere.

On the corpus shipped with this lesson, the payoff is concrete:

cosine(embed("dog"), embed("puppy"))    # high, around 0.9
cosine(embed("dog"), embed("invoice"))  # low,  around 0.37

That inequality, cosine(dog, puppy) > cosine(dog, invoice), is the whole point of embeddings expressed in one line.

For contrast, this is what a real call looks like (it will NOT run here, it is shown for context only):

# real embeddings, requires a model download and is NOT used in grading
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
vec = model.encode("a happy dog")   # 384-dim learned vector

Build embed(word, corpus, vocab) that returns the co-occurrence vector for a word: a fixed-length list (one slot per vocabulary word) where slot j counts how many times vocabulary word j co-occurs with the target word across all corpus documents that contain the target word. Do not count the target word against itself. The same word always returns the same vector.

Your turn

Write embed(word, corpus, vocab), a deterministic mock embedder. For the given word, scan every document in corpus (each a lowercase string). In documents that contain the word, count the other words (skip the target word itself) into a vocabulary-length vector, returned as a list in vocab order. The result is a co-occurrence context vector: words used in similar contexts get similar vectors, so on the lesson corpus cosine(embed("dog"), embed("puppy")) > cosine(embed("dog"), embed("invoice")). The same word always yields the same vector, and the length always equals len(vocab).

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output