Syllabus Lesson 222 of 239 · Project: Production AI Gateway
Project: Production AI Gateway

Token/Cost Accounting + Caching

Your gateway routes requests now, but every call to a real model costs money and time. The two biggest levers you control are cost accounting (so nobody is shocked by the bill) and caching (so you stop paying for the same answer twice). This lesson builds both, and the caching has a twist that makes it genuinely useful: it catches near-duplicate questions, not just byte-for-byte repeats.

Start with the meter. A cheap, deterministic token estimate is "about one token per four characters," and cost is just tokens times a per-token rate:

def estimate_cost(text, rate):
    tokens = len(text) // 4
    return tokens * rate

Now the cache. A naive cache keys on the exact string, so "Reset my password" and "reset my password" look like different questions and both pay full price. You will build a two-tier Cache class:

  • Hash tier: normalize the text (lowercase, collapse whitespace), hash it, and store the answer under that key. An exact repeat (after normalizing) is an instant hit.
  • Semantic tier: when the hash misses, compare the query's embedding to every stored embedding by cosine similarity. If the closest one is at or above a threshold (say 0.9), that counts as a hit too. This is how "password reset steps" hits a cached "how do I reset my password".

Both put and get take the text and a precomputed embedding (a small vector). get returns a (status, value) tuple where status is "hit" or "miss":

def get(self, text, embedding):
    key = self._key(text)
    if key in self.by_hash:            # tier 1: exact
        return ("hit", self.by_hash[key])
    best, best_sim = None, -1.0        # tier 2: semantic
    for vec, answer in self.entries:
        sim = cosine(embedding, vec)
        if sim > best_sim:
            best_sim, best = sim, answer
    if best is not None and best_sim >= self.threshold:
        return ("hit", best)
    return ("miss", None)

Cosine similarity is the dot product divided by the product of the norms; guard against a zero vector so you never divide by zero. In a real gateway those embeddings come from the model, but here you pass them in directly so the logic is testable offline. Press Run to price a prompt, hit the cache exactly, hit it again on a paraphrase, and watch an unrelated query miss.

Your turn

Write estimate_cost(text, rate) returning a token estimate (~1 token per 4 chars) times the per-token rate. Then build a two-tier Cache class with put(text, embedding, answer) and get(text, embedding) that returns a (status, value) tuple. get is a "hit" on an exact (normalized: lowercased, whitespace-collapsed) text match via a hash, OR when the nearest stored embedding has cosine similarity >= the threshold (a semantic hit); otherwise ("miss", None). A distinct query with low similarity must miss.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output