Production Patterns: Retries, Async, Streaming & Memory

Concurrent Batch Calls (bounded fan-out)

When you need answers for a thousand prompts, calling the model one at a time and waiting for each reply is painfully slow. Real apps fire many requests concurrently -> but not unlimited: every provider caps how many in-flight calls it allows, so you process the work in batches of a fixed size. Two facts make concurrency tricky, and both show up here.

Results come back out of order. A short prompt may finish before a long one you sent first. Your code must reassemble the answers by their original input position, not by whoever finished first.
Throughput is bounded by the cap. With n prompts and a cap of c, you run ceil(n / c) batches -> the number every capacity plan is built around.

A production version would use asyncio and a real network client. That is non-deterministic and untestable, so here you model the same shape with plain functions and a scheduling loop -> no event loop, no network, no real model. Here is the real shape, for reference, the same logic with a live event loop:

import asyncio

async def run_batch(prompts, worker, cap=4):
    sem = asyncio.Semaphore(cap)            # the concurrency cap
    async def one(i):
        async with sem:                     # at most cap in flight
            return i, await worker(prompts[i])
    pairs = await asyncio.gather(*(one(i) for i in range(len(prompts))))
    results = [None] * len(prompts)
    for i, r in pairs:                       # reassemble by ORIGINAL index
        results[i] = r["result"]
    return results

A Semaphore enforces the cap, gather runs the calls together, and you still write each answer back into its original slot. That needs a running event loop and a real async client, which this in-browser grader does not have, so below you build the same logic with a deterministic scheduler you can actually test. A worker(prompt) stands in for the model call and returns {"result": ..., "cost": ...}, where cost represents how long that item would take. You deliberately schedule each batch in a different cost order to prove your result assembly does not secretly depend on it.

for start in range(0, len(prompts), cap):
    window = range(start, min(start + cap, len(prompts)))
    # work the window in ANY order, but write results[i] = worker(prompts[i])["result"]
    ...

Build run_batch(prompts, worker, cap=4). Map worker over every prompt in batches of size cap, and return {"results": [...], "batches": n_batches} where results[i] is the worker result for prompts[i] (input order preserved) and batches is ceil(len(prompts) / cap). An empty prompt list runs zero batches.

Your turn

Write run_batch(prompts, worker, cap=4) that maps a (mocked) worker over every prompt in batches of cap and returns {"results", "batches"}. results must be in INPUT order even though each item carries a different cost that you may schedule by; batches is ceil(len(prompts) / cap). Use plain functions and a scheduling loop -> no asyncio, no network. A second distinct input set must give different results; an empty list runs zero batches.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Concurrent Batch Calls (bounded fan-out)

This lesson is locked

Best on a laptop