Production Patterns: Retries, Async, Streaming & Memory

Multi-Turn Conversation Memory

A chatbot has no memory of its own -> the model sees only what you send it each call. To hold a conversation you replay the history: the system message plus the back-and-forth turns so far. But history grows without bound and the context window does not, so you need a memory that keeps the conversation and trims it to fit before each call.

The policy is the one production systems converge on. Always keep the system message -> it carries the instructions and persona that must never be lost. Then keep as many of the most-recent turns as fit the budget, dropping the oldest ordinary turns first, because recent context matters most for a coherent reply. Token budgets use the familiar estimate chars / 4.

convo = Conversation(system="You are a careful assistant.")
convo.add("user", "Tell me about Rome.")
convo.add("assistant", "Rome is the capital of Italy.")
context = convo.trimmed_context(max_tokens=64)   # what you actually send

This is exactly the object that sits in front of any chat model, including an on-device WebLLM engine: you accumulate turns, then pass trimmed_context(...) as the messages for the next call. No model runs here -> the trimming logic is deterministic and graded directly.

Build a Conversation class. The constructor takes an optional system message (pinned, never dropped). add(role, content) appends a turn. trimmed_context(max_tokens) returns the kept turns in chronological order: every system message, plus the most-recent ordinary turns whose estimated tokens (summed with the chars / 4 rule) fit within max_tokens, dropping oldest first. A short history under a generous budget is kept whole.

Your turn

Build a Conversation class holding turns in order: add(role, content) appends a turn, and trimmed_context(max_tokens) keeps the system message(s) plus the most-recent turns that fit a chars/4 token budget (dropping oldest first), returned in chronological order. A short history is kept whole; a tight budget drops the oldest but keeps the system message and the newest turn; the order stays chronological; the token estimate is correct.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Multi-Turn Conversation Memory

This lesson is locked

Best on a laptop