Syllabus Lesson 228 of 239 · Project: Tool-Using Research Agent
Project: Tool-Using Research Agent

Evaluate the Agent Trace

The last piece of the project is the one that gets skipped most and matters most on a resume: evaluation. Anyone can wire up an agent that works in a demo. The engineer who gets hired is the one who can prove it works, catch when it silently does the wrong thing, and fail gracefully when it cannot answer. You are going to score the trace your agent produced and add a fallback for when it gives up.

Recall the trace from lesson 2: an ordered list of (thought, tool, args, observation) tuples. To grade a run you compare what the agent actually did against what it was supposed to do.

Write score_run(trace, expected_tools, expected_answer) returning a dict with three fields:

  • "tools_ok": did the agent use the right tools in the right order? Pull the tool name out of each step and compare the sequence to expected_tools. A wrong or missing tool makes this False. This is how you catch an agent that reached the right answer by the wrong route (which will break on the next input).
  • "answer_ok": did the final step produce the right answer? The final answer is the observation of the last step. Compare it to expected_answer. A run that never produced an answer (empty trace) is answer_ok = False.
  • "steps": how many steps the run took (len(trace)) -> your cost/latency signal.
def score_run(trace, expected_tools, expected_answer):
    tools_used = [tool for (_thought, tool, _args, _obs) in trace]
    final = trace[-1][3] if trace else None
    return {
        "tools_ok": tools_used == list(expected_tools),
        "answer_ok": bool(trace) and final == expected_answer,
        "steps": len(trace),
    }

Then the safety net. answer_or_fallback(trace, expected_answer) returns the agent's final answer when the run is correct, and a fixed fallback string "unable to answer" when it is not (wrong answer, or empty trace). In production this is what you return to the user instead of a hallucination or a stack trace.

def answer_or_fallback(trace, expected_answer):
    result = score_run(trace, [], expected_answer)
    if trace and result["answer_ok"]:
        return trace[-1][3]
    return "unable to answer"

The grader feeds in a correct run (everything True), a run that used the wrong tool (tools_ok is False but the answer may still match), and a run with no answer (answer_ok is False), plus random traces compared to an inline reference so no constant scorer slips through. Get this green and the headline writes itself: built a tool-using ReAct agent with loop guards and trace eval. Press Run.

Your turn

Write score_run(trace, expected_tools, expected_answer) over a trace of (thought, tool, args, observation) tuples, returning {"tools_ok", "answer_ok", "steps"}: tools_ok is whether the sequence of tool names equals expected_tools; answer_ok is whether the last step's observation equals expected_answer (False for an empty trace); steps is len(trace). Then write answer_or_fallback(trace, expected_answer) returning the final observation when the run is correct, else the string "unable to answer".

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output