Syllabus Lesson 141 of 239 · Embeddings & Semantic Search from Scratch
Embeddings & Semantic Search from Scratch

Cosine Similarity in numpy

Now that documents are vectors, how do you measure whether two of them are "alike"? The instinct is Euclidean distance (straight-line distance), but for text that is a trap. A long document repeats words, so its count vector is simply bigger, and Euclidean distance punishes it for length even when the topic is identical. What you actually care about is direction: which words, in what proportion. That is what cosine similarity measures.

Cosine similarity is the cosine of the angle between two vectors:

cosine(a, b) = (a . b) / (|a| * |b|)

where a . b is the dot product and |a| is the length (the L2 norm) of a. The value lands in a clean range:

  • 1.0 -> same direction (a document and a copy of itself scaled up).
  • 0.0 -> orthogonal, no shared direction (no words in common).
  • Values in between -> partial overlap.

Because it divides out the lengths, cosine treats [1, 1] and [10, 10] as identical. That length-invariance is exactly why every embedding and RAG system ranks by cosine, not Euclidean distance.

In numpy the whole thing is three calls:

import numpy as np
a = np.array([1, 2, 3], dtype=float)
b = np.array([2, 4, 6], dtype=float)
np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))   # -> 1.0

One edge case will bite you in production: the zero vector (a document with no vocabulary words) has length 0, and dividing by 0 gives NaN, which then poisons every downstream sort. A robust similarity returns 0.0 when either vector is all zeros.

Build cosine(a, b) that returns the cosine similarity as a float, and returns 0.0 (never NaN) when either input is the zero vector. Inputs may be Python lists or numpy arrays.

Your turn

Write cosine(a, b) that returns the cosine similarity of two vectors: the dot product divided by the product of their L2 norms. Accept Python lists or numpy arrays and return a float. Parallel vectors return 1.0, orthogonal vectors return 0.0, and the function is symmetric. If either vector is all zeros, return 0.0 rather than dividing by zero and producing NaN.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output