Syllabus Lesson 175 of 239 · Productionizing LLMs: Cost, Caching & Guardrails
Productionizing LLMs: Cost, Caching & Guardrails

Context-Window Budgeter

Every model has a finite context window: the most tokens it can read in one call. A long chat history eventually overflows it, and the call either errors or silently truncates. The production fix is a budgeter that trims the conversation before you send it.

The standard policy is simple and battle-tested: always keep the system message (it carries the instructions and persona that must never be dropped), then drop the oldest ordinary turns one at a time until the running token estimate fits the budget. Recent turns matter most for coherence, so they are the last to go.

Messages look like the familiar chat format:

[
  {"role": "system",    "content": "You are a careful assistant."},
  {"role": "user",      "content": "What is the capital of France?"},
  {"role": "assistant", "content": "The capital of France is Paris."},
  {"role": "user",      "content": "And of Spain?"}
]

You already have a token estimator (chars/4) from the previous lesson; it is provided for you here as estimate_tokens. Sum it over the content of every kept message to get the running total.

Build fit_to_budget(messages, max_tokens). Return a new list of messages, in the original order, such that: the system message(s) are always kept; the total estimated tokens of the kept messages is <= max_tokens whenever dropping is possible; and when you must drop, you drop the oldest non-system messages first. A generous budget keeps everything. Do not mutate the input list.

Your turn

Write fit_to_budget(messages, max_tokens) that returns the kept messages in original order. Keep every system message. Drop the oldest non-system messages one by one until the total estimated token count (summing estimate_tokens over each content) is <= max_tokens. A generous budget drops nothing; two different budgets must keep different sets, and the most recent turns survive over older ones.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output