Cost-Aware Cascade
This is the punchline of the whole module. A production LLM call is slow and costs real money; a fitted classifier costs almost nothing. So you cascade: let the cheap model handle the confident majority of traffic, and escalate only the uncertain tail to the expensive LLM. Done right, you serve the same quality for a fraction of the bill.
The decision is the thresholding you already built: for each request you have the cheap classifier's probability row. If its top probability clears a confidence threshold, answer cheaply; otherwise escalate. The engineering question your boss actually asks is how much does this save?, so you also count the escalations and the cost.
row = [0.9, 0.05, 0.05] # confident -> handle cheaply
row = [0.4, 0.35, 0.25] # uncertain -> escalate to the LLMBuild cascade(prob_rows, threshold, cheap_cost=1, llm_cost=25). For each probability row, if max(row) >= threshold it is handled cheaply (it costs cheap_cost); otherwise it is escalated (it costs llm_cost). Return {"escalated", "escalation_rate", "total_cost", "all_llm_cost", "saved"} where all_llm_cost is what sending every request to the LLM would have cost, and saved = all_llm_cost - total_cost. An empty input escalates nothing and costs nothing. Press Run to price a batch of traffic.
Write cascade(prob_rows, threshold, cheap_cost=1, llm_cost=25). For each row, if max(row) >= threshold it is handled cheaply (cost cheap_cost), else escalated (cost llm_cost). Return {"escalated", "escalation_rate", "total_cost", "all_llm_cost", "saved"}: escalated is the count escalated, escalation_rate is escalated/len (0.0 for empty), total_cost is the summed per-request cost, all_llm_cost = len(prob_rows) * llm_cost, and saved = all_llm_cost - total_cost.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.