Productionizing LLMs: Cost, Caching & Guardrails

Semantic & Hash Caching

The cheapest model call is the one you never make. Production LLM systems put a cache in front of the model so repeated or near-repeated questions are answered for free. Two flavours cover most cases.

Hash caching keys on the exact request. The trick is to normalize first so trivially different prompts collide: lowercase the text, collapse runs of whitespace to a single space, and fold in any options (temperature, model) in a stable, sorted order before hashing. Then " Hello World " and "hello world" share one key.

import hashlib, re
norm = re.sub(r"\s+", " ", prompt.strip().lower())
key  = hashlib.sha256(norm.encode()).hexdigest()

Semantic caching goes further: it returns a cached answer when a new query means the same thing as an old one, measured by cosine similarity between embedding vectors. If the closest stored vector scores at or above a threshold (say 0.9), it is a hit; otherwise it is a miss and you call the model. In a real app the vectors come from an embedding model; here we pass them in directly so the logic stays deterministic.

cosine(a, b) = dot(a, b) / (norm(a) * norm(b))

Build three things. (1) A class SemanticCache with put(query_vec, answer) and get(query_vec) that returns the stored answer of the closest entry when its cosine similarity is >= self.threshold, else None. (2) cache_key(prompt, **opts) that normalizes (lowercase, collapse whitespace, sort the option keys) and returns a hex hashlib.sha256 digest, so two superficially different prompts map to the same key. (3) make_memoized(fn) -> a callable that uses cache_key to cache results and exposes .hits and .misses counters.

Your turn

Build (1) SemanticCache (a threshold in __init__, plus put(query_vec, answer) and get(query_vec) returning the closest answer when cosine similarity is >= threshold else None); (2) cache_key(prompt, **opts) hashing the lowercased, whitespace-collapsed prompt plus sorted options with hashlib.sha256; (3) make_memoized(fn) returning a callable with .hits and .misses. A near-duplicate vector must hit, a distant one must miss, and two differently-spaced prompts must share a key.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Semantic & Hash Caching

This lesson is locked

Best on a laptop