Syllabus Lesson 227 of 239 · Project: Tool-Using Research Agent
Project: Tool-Using Research Agent

Planning + Loop Guards

An agent that can act can also act forever. The number one way a real agent burns money and pages an on-call engineer is a loop with no brakes: it keeps calling the same tool, getting the same observation, and never deciding it is done. This lesson adds the two things that make an agent safe to actually run: a tiny planner that breaks a goal into ordered subtasks, and a loop guard that halts the agent before it spins out.

First, planning. decompose(goal) turns a one-line goal into a numbered list of subtasks. We keep it deterministic and rule-based (the LLM does this in production, but the shape is what matters): split the goal on the word "then", strip each piece, and number them from 1.

decompose("find the team size then multiply by 2 then report")
# ["1. find the team size", "2. multiply by 2", "3. report"]

Now the guard. run_with_guard(policy, state, max_steps) drives a policy step by step, but it refuses to run away. A policy here is a function policy(state) -> action (a string naming the action it wants to take); applying an action just appends it to a history list. The guard stops for one of three reasons and reports which:

  • finished: the policy returns the action "done" -> halt with reason "done".
  • no progress: the policy returns the same action twice in a row -> halt with reason "repeat". This is the classic stuck-in-a-loop signal: the agent is not learning from its observations.
  • budget: it reaches max_steps without finishing -> halt with reason "max_steps".
def run_with_guard(policy, state, max_steps):
    history = []
    for _ in range(max_steps):
        action = policy(state)
        if action == "done":
            return {"history": history, "reason": "done"}
        if history and action == history[-1]:
            return {"history": history, "reason": "repeat"}
        history.append(action)
        # let the policy advance its own state here
        ...
    return {"history": history, "reason": "max_steps"}

Return a dict {"history": [...], "reason": ...}. The "repeat" check is what saves you: a buggy or confused policy that keeps emitting "search" forever gets stopped on the second identical action, with a reason you can log and alert on. A well-behaved policy that makes progress and then says "done" finishes cleanly inside its budget. The grader checks a looping policy halts with "repeat", a runaway policy hits "max_steps", and a good policy finishes with "done", plus that decompose numbers correctly. Press Run to see all three outcomes.

Your turn

Write decompose(goal): split goal on " then ", strip each part, and return them as "1. part", "2. part", ... Then write run_with_guard(policy, state, max_steps) where policy(state) returns an action string. Drive the policy up to max_steps times and return {"history": [...], "reason": ...}: stop with reason "done" when the action is "done"; stop with reason "repeat" when an action equals the immediately previous one (no-progress); stop with reason "max_steps" if the budget runs out. Append each accepted action to history.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output