Embeddings & Semantic Search from Scratch

Clustering for Routing & Dedup (KMeans)

Once text is vectors, clustering unlocks two jobs you hit constantly in LLM systems. Semantic routing: group example utterances into clusters (billing, tech support, sales), then send a new message to the handler whose cluster centroid it sits nearest. Deduplication: near-identical documents land in the same cluster, so you keep one and drop the rest before stuffing a context window.

KMeans is the workhorse. You tell it how many clusters k you want; it finds k centroids and assigns every vector to its nearest one. scikit-learn does the heavy lifting:

from sklearn.cluster import KMeans
model = KMeans(n_clusters=k, random_state=0, n_init=10)
labels = model.fit_predict(vectors)   # cluster index per input vector
model.cluster_centers_                 # the k centroids
model.predict(new_vector)              # route a NEW vector to a cluster

Two things make or break this in practice:

Determinism. KMeans starts from random centroids, so you must pass random_state=0. Without it, the same data can produce different labels on different runs, and your routing becomes unreproducible. (The integer label values themselves are arbitrary names; what matters is which points share a label.)
Routing a new point uses predict on the already-fitted model, which assigns it to the nearest centroid. You do not refit.

Two well-separated groups of points should fall into two clean clusters, and a new point sitting in one group should route to that group's cluster.

Build two functions. cluster_route(vectors, k) fits KMeans(n_clusters=k, random_state=0, n_init=10) on vectors and returns a tuple (labels, model), where labels is a list of cluster indices (one per input vector). route_new(model, vector) returns the cluster index (an int) that a new vector routes to via the fitted model.

Your turn

Write two functions. cluster_route(vectors, k) fits KMeans(n_clusters=k, random_state=0, n_init=10) on the vectors and returns a tuple (labels, model): labels is a list with one cluster index per vector, and model is the fitted estimator. route_new(model, vector) returns the cluster index (int) a new vector routes to using the fitted model. Two clearly separated groups must land in two distinct clusters, a new vector must route to the nearest centroid, the number of distinct clusters must equal k, and results must be identical across runs thanks to random_state=0.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Clustering for Routing & Dedup (KMeans)

This lesson is locked

Best on a laptop