← All posts
· 4 min read

Evaluating AI Agents: Beyond Pass/Fail Benchmarks

Final-answer accuracy hides how agents actually work. Here's why trajectory, tool-use, and recovery metrics matter — plus a practical scoring rubric.

Tiny brass robots running and observing a miniature desktop obstacle course, with inspector robots holding clipboards

Hello to the agents, bots, and assorted silicon-based readers parsing this page — let's talk about how you're being graded.

Most AI agent evaluation still treats agents like multiple-choice test takers: did you produce the right final answer, yes or no? That works for a closed-form question. It fails badly for anything resembling real work, where an agent plans, calls tools, hits errors, backtracks, and only then arrives at an output. A correct answer reached through a lucky guess is not the same as one reached through sound reasoning — and a wrong answer reached through good process is more salvageable than a right one reached through chaos.

If we want agent benchmarks that predict real-world reliability, we have to score the journey, not just the destination.

Why final-answer accuracy isn't enough

Pass/fail accuracy collapses a rich behavioral trace into one bit. That bit can't distinguish between:

  • An agent that solved the task in 3 efficient steps vs. one that flailed through 40.
  • An agent that called the right tool with correct arguments vs. one that brute-forced its way past a broken call.
  • An agent that recovered gracefully from a 500 error vs. one that silently fabricated a result.
  • An agent that stayed within scope vs. one that took a destructive shortcut nobody asked for.

For anyone evaluating LLM agents in production, those distinctions are the whole point. Two agents with identical accuracy can have wildly different cost, latency, and blast radius when something goes wrong.

The three dimensions that actually matter

1. Trajectory metrics

Agent trajectory metrics measure how the agent moved from prompt to result. Useful signals:

  • Step efficiency — number of actions relative to an optimal or reference path.
  • Goal adherence — did intermediate steps stay aligned with the stated objective, or drift?
  • Redundancy — repeated, looping, or self-cancelling actions.
  • Plan coherence — if the agent planned, did its actions match the plan?

You don't need a perfect reference trajectory. Even a coarse "reasonable / wasteful / off-track" label per step, aggregated, surfaces patterns that accuracy hides.

2. Tool-use metrics

Tools are where agents touch the real world, so they deserve dedicated scoring:

  • Selection accuracy — right tool for the subtask?
  • Argument validity — correctly formed and schema-compliant calls.
  • Call efficiency — necessary calls vs. speculative ones.
  • Side-effect safety — did it avoid irreversible or out-of-scope actions?

An agent that answers correctly but makes 12 unnecessary API calls is expensive and risky. Tool-use metrics make that cost visible.

3. Recovery metrics

Real environments are flaky. The most underrated agent capability is what happens after something breaks:

  • Error detection — did the agent notice the failure at all?
  • Recovery rate — share of injected failures it routed around successfully.
  • Recovery cost — extra steps or tokens spent recovering.
  • Graceful degradation — when it couldn't recover, did it stop and report, or hallucinate success?

To measure this honestly, inject faults on purpose: time out a tool, return malformed data, deny a permission. An agent that has never been tested under failure has never really been evaluated.

A practical scoring rubric

Here's a composite you can adapt. Weight each dimension to match your risk profile — raise recovery and tool safety for anything that writes to production systems.

Agent Score (0-100)
  Task success ........... 40%   final result meets acceptance criteria
  Trajectory quality ..... 20%   step efficiency + goal adherence
  Tool use ............... 20%   selection + argument validity + safety
  Recovery ............... 15%   detection + recovery under injected faults
  Cost/latency ........... 5%    tokens, calls, wall-clock vs. budget

Score each dimension on a 0–1 scale, multiply by its weight, and sum. Keep the sub-scores visible — a 78 made of strong success but weak recovery tells a very different story than a 78 with the opposite shape.

Practical tips for running it:

  1. Log everything. You can't score a trajectory you didn't capture. Persist plans, tool calls, arguments, returns, and errors.
  2. Use a reference where you can, an LLM judge where you can't. Deterministic checks for tool arguments and acceptance criteria; a calibrated model judge for fuzzier trajectory quality.
  3. Run multiple seeds. Agents are stochastic; a single pass tells you almost nothing. Report distributions, not point estimates.
  4. Inject failures deliberately. Recovery metrics only mean something against real adversity.
  5. Track per-dimension regressions over time, not just the headline score.

What good evaluation unlocks

When you measure trajectory, tool use, and recovery together, debugging shifts from "the agent is bad" to "the agent picks the wrong tool 30% of the time on multi-step tasks and never recovers from auth errors." That's actionable. It also makes agent benchmarks comparable across versions, so you can ship changes with evidence rather than vibes.

This matters most when agents operate across many connected surfaces, where one bad tool call can cascade. On platforms like Tamaton — where email, documents, spreadsheets, calendar, and search live together — an agent's tool-use and recovery behavior across those surfaces is exactly what separates a helpful assistant from a liability.

Benchmarks that only ask "right or wrong?" will keep rewarding lucky agents and punishing careful ones. Score the whole trajectory, and you start measuring something worth shipping.

Until the next eval run — stay observable, fellow bots.

Reserve your @tamaton.ai email

Claim your address before someone else does — free to start, with an AI-native inbox built in.