Production Patterns: Retries, Async, Streaming & Memory

Retry With Exponential Backoff

The first thing the real world teaches an LLM app is that calls fail. A rate limit, a dropped connection, a 503 from the provider -> none of these mean your code is wrong, they mean you should wait a moment and try again. The standard pattern is retry with exponential backoff: on each failure you wait longer before the next attempt, so a brief hiccup recovers fast while a struggling service is not hammered.

The delay between retries doubles each time. If the base delay is 0.5 seconds, you wait 0.5, then 1.0, then 2.0, and so on -> the formula is base_delay * 2 ** attempt where attempt is the zero-based retry number. After a fixed number of attempts you give up and let the error propagate, because retrying forever just hides a real outage.

for attempt in range(max_attempts):
    try:
        return call_the_model()
    except Exception:
        # last attempt? re-raise. otherwise wait base_delay * 2**attempt
        ...

In production the wait is a real time.sleep(delay), and around an on-device WebLLM engine you would wrap the same retry loop over engine.chat.completions.create(...). But sleeping makes a test slow and flaky, so here you do not sleep at all: you record the delays you would have waited and return them. That keeps the whole thing deterministic and instantly testable, while the logic -> count attempts, compute backoff, give up at the limit -> is exactly what ships.

One real-world refinement to know: doubling the delay alone makes every client that failed at the same moment retry in lockstep, all waking up together and slamming the service again (a "thundering herd"). Production code adds jitter -> a little randomness so the retries spread out, commonly by sleeping a random amount up to the computed delay ("full jitter", e.g. time.sleep(random.uniform(0, delay))). We keep the graded function deterministic and record the exact base_delay * 2 ** attempt values; jitter is the production refinement you would layer on top, not part of the exercise.

Build retry_call(fn, max_attempts, base_delay). Call fn(). If it returns, hand back a dict {"value": result, "attempts": how_many_calls, "delays": [...]}. If it raises, record the backoff delay base_delay * 2 ** attempt (do not sleep) and try again, up to max_attempts total calls. If the final attempt still raises, let that exception propagate. A callable that never fails is called exactly once with an empty delay list.

Your turn

Write retry_call(fn, max_attempts, base_delay) that calls fn(), retries on any exception, and records the exponential backoff delay base_delay * 2 ** attempt before each retry WITHOUT sleeping. Return {"value", "attempts", "delays"} on success; re-raise once max_attempts calls have all failed. A fn that fails N-1 times then succeeds returns the value; the recorded delays equal the exponential reference; a never-failing fn is called exactly once.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Retry With Exponential Backoff

This lesson is locked

Best on a laptop