Syllabus Lesson 194 of 239 · The ML Around the LLM
The ML Around the LLM

Cost-Aware Cascade

This is the punchline of the whole module. A production LLM call is slow and costs real money; a fitted classifier costs almost nothing. So you cascade: let the cheap model handle the confident majority of traffic, and escalate only the uncertain tail to the expensive LLM. Done right, you serve the same quality for a fraction of the bill.

The decision is the thresholding you already built: for each request you have the cheap classifier's probability row. If its top probability clears a confidence threshold, answer cheaply; otherwise escalate. The engineering question your boss actually asks is how much does this save?, so you also count the escalations and the cost.

row = [0.9, 0.05, 0.05]   # confident  -> handle cheaply
row = [0.4, 0.35, 0.25]   # uncertain  -> escalate to the LLM

Build cascade(prob_rows, threshold, cheap_cost=1, llm_cost=25). For each probability row, if max(row) >= threshold it is handled cheaply (it costs cheap_cost); otherwise it is escalated (it costs llm_cost). Return {"escalated", "escalation_rate", "total_cost", "all_llm_cost", "saved"} where all_llm_cost is what sending every request to the LLM would have cost, and saved = all_llm_cost - total_cost. An empty input escalates nothing and costs nothing. Press Run to price a batch of traffic.

Your turn

Write cascade(prob_rows, threshold, cheap_cost=1, llm_cost=25). For each row, if max(row) >= threshold it is handled cheaply (cost cheap_cost), else escalated (cost llm_cost). Return {"escalated", "escalation_rate", "total_cost", "all_llm_cost", "saved"}: escalated is the count escalated, escalation_rate is escalated/len (0.0 for empty), total_cost is the summed per-request cost, all_llm_cost = len(prob_rows) * llm_cost, and saved = all_llm_cost - total_cost.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output