Production Patterns: Retries, Async, Streaming & Memory

Streaming Token Accumulation

Modern models stream their reply: instead of one big response, the runtime hands you a sequence of small token chunks as they are generated. That is why chat UIs show text appearing word by word. Your job on the receiving end is to accumulate those chunks into the full answer, do something with each one as it arrives (update the UI, count tokens, watch for a stop), and know when to quit.

Two things make streaming more than a plain join. First, you usually pass a callback that fires per chunk -> that is how the live typing effect works. Second, you often have a stop sequence: a marker that means "the real answer ends here, drop the rest". When you see it you stop immediately and you do not include the stop text in the output.

text = ""
for chunk in chunks:
    if chunk == stop:        # marker reached -> stop, do NOT append it
        break
    text += chunk
    on_token(chunk)          # fire the per-chunk callback

A real engine yields these chunks from a network stream; an on-device WebLLM engine yields them from engine.chat.completions.create({ stream: true }). Here the chunks are just a list you are handed, so the logic is deterministic and you can grade the exact accumulation. The accumulation, the per-chunk callback, and the stop handling are identical to what ships.

Build consume_stream(chunks, on_token, stop=None). Walk the chunks, appending each to a running string and calling on_token(chunk) for it. If stop is given and a chunk equals it, stop before appending it (the stop text is excluded) and remember that you stopped. Return {"text": full_text, "n_tokens": chunks_consumed, "stopped": bool}. With no stop sequence present, every chunk is consumed.

Your turn

Write consume_stream(chunks, on_token, stop=None) that accumulates streamed token chunks into the full text, calls on_token(chunk) per consumed chunk, stops at a stop sequence if present (excluding the stop text itself), and returns {"text", "n_tokens", "stopped"}. The accumulated text equals "".join of the chunks up to the stop; the stop is excluded; n_tokens counts consumed chunks; with no stop, all chunks are consumed.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Streaming Token Accumulation

This lesson is locked

Best on a laptop