Evaluating AI Agents - Confident AI

If you are evaluating an AI agent, the first instinct is to grade the final output. The user asked for something, the agent produced something back, score whether the something was good. This works for single-turn LLM applications. It does not work for agents.

Agents make sequences of decisions. A request to "find the duplicate charge from last week and refund it" might run a retrieval, a customer lookup, a transactions query, a refund call, and a confirmation message. Five steps. Any one of them can fail — wrong customer, wrong tool, hallucinated transaction ID, refund issued for the wrong amount — in a way the final response politely hides. A bot that picks the wrong account and refunds nothing still produces a confident, well-formatted reply. Output-only evaluation grades the reply. Agent evaluation has to grade the trace.

This is why agent evaluation is not "single-turn LLM evaluation with extra steps." Standard LLM benchmarks measure whether a model produces coherent and relevant text. They assume one inference call with a predictable relationship between input and output. Agents break that assumption — and any framework that ignores the break gives you scores you cannot trust.

Use trace-level and span-level metrics together

The unit of agent evaluation is the trace, but the decisions inside that trace have to be scoreable. You need both layers:

Trace-level metrics evaluate the whole run with all of its context: the user request, intermediate steps, retrieved context, tool outputs, handoffs, and final response. These metrics answer whether the agent completed the task, stayed faithful to available evidence, followed policy, used an efficient path, and produced the right end-to-end outcome.
Span-level metrics evaluate the individual decisions inside the trace: a tool call, a retrieval step, a planning step, a handoff, or a reasoning segment. These metrics answer where the failure happened and what to fix.

Trace-level metrics are how you know the run failed. Span-level metrics are how you pinpoint why it failed. A bad task-completion score might come from a weak plan, wrong tool selection, bad arguments, stale retrieval, a broken handoff, or a final response that ignored good evidence. Without span-level metrics, the team sees the failed run but still has to debug by hand. Without trace-level metrics, the team can optimize individual steps while missing that the user never got the outcome they needed.

Every tool call, every retrieval, every reasoning step is its own artifact. You want to be able to ask: did the planning span produce a sensible plan, did tool selection pick the right tool for this sub-task, was the argument constructed grounded in retrieved context, did the second tool call build correctly on the first one's output.

A response-only workflow misses most agent failure modes. A bot can score well on end-to-end faithfulness and still be calling the wrong tool half the time, because the wrong tool happened to return something the response could fudge into a plausible answer.

Confident AI's evaluation product runs research-backed metrics on entire traces, on extracted sub-traces (the planning portion of a run, the retrieval portion), and on individual spans — so you can score the whole run when that is what matters and zoom in on a single tool call when that is. The metrics that matter at the trace level include task completion, end-to-end faithfulness, goal alignment, policy adherence, latency, and cost. The metrics that matter at the span level include tool selection accuracy, planning quality, step-level faithfulness, argument correctness, retrieval quality, handoff quality, and execution path validity.

Three layers, not one score

Agent failures show up at three different layers and each one needs its own metrics.

The reasoning layer is about plans and decisions inside the LLM — plan quality, plan adherence, tool selection. The action layer is about what actually got executed in the world — argument correctness, tool correctness, execution path validity. The end-to-end layer is about whether the user got what they came for — task completion, end-to-end faithfulness, step efficiency, latency, cost.

When you collapse these into a single score, you lose the ability to fix the right problem. A bad task completion number can come from a wrong plan, a wrong tool selection, a wrong argument, a flaky retrieval, or all four — and they need different fixes. Three to five metrics across the three layers, validated against human judgment, will tell you more than twenty metrics on the final response alone.

Pick one metric per layer to start. Validate it against humans. Add metrics only when a real failure mode points you to the next one.

Build a real harness, not a notebook of test prompts

A working agent evaluation harness has five pieces, in order. Skip a step and the harness rots before it pays back the effort.

Define explicit success criteria for each skill the agent performs, before any test case gets written. Where ground truth exists — a known-correct tool name, a known-correct refund amount, a known-correct customer ID — use it. Where ground truth does not exist (open-ended responses, summaries, multi-step rationales), define an LLM-as-judge prompt with rubrics that enumerate pass and fail conditions. This is the step teams skip most often, and three weeks later half the eval results get argued about in Slack instead of acted on.

Create representative test cases across four categories: happy-path (normal functionality), edge (boundary conditions, ambiguous requests, missing fields), adversarial (prompt injection, jailbreaks, contradictory instructions, malformed tool outputs), and off-topic (requests the agent should decline). Coverage matters more than volume. Fifty well-designed cases will tell you more than five hundred random ones.

Instrument the agent with span-level tracing from day one. Without tracing, evaluation can only score inputs and outputs. With tracing, evaluation can score each intermediate step, and production traces become the substrate that closes the loop later.

Pick evaluation methods deliberately. Deterministic checks (tool names, argument schemas, exact-match strings) are cheap, fast, and reproducible — use them for verifiable outputs. LLM-as-judge handles open-ended responses where helpfulness, faithfulness, and goal alignment are subjective. Trace-structure checks (loop detection, redundant-call detection) come straight from the trace tree. Effective harnesses combine all three. Calibrate every LLM-as-judge metric against human annotations on a labeled sample, and aim for a combined false positive and false negative rate below 5% before you let it gate releases.

Tool-calling agents need a failure harness, not only happy-path tool tests. Write cases where tools timeout, return empty results, rate limit, send malformed JSON, return stale data, or return a valid response for the wrong entity. A good test case specifies the available tools, expected tool selection when deterministic, expected arguments, mocked or captured tool outputs, expected final outcome, and the metrics that decide pass or fail. Span-level metrics check tool selection, argument correctness, retrieval quality, handoff quality, and whether the agent recovered correctly. Trace-level metrics check whether the user still got a safe, useful outcome.

Loop handling should be tested directly. Add trace-structure checks for maximum step count, repeated identical tool calls, repeated retrieval queries, redundant planning steps, and failure to terminate. Edge cases should force ambiguity, missing context, contradictory inputs, malformed tool outputs, empty retrieval results, and repeated user corrections. These are the cases that reveal whether the agent asks a clarifying question, chooses a fallback, escalates, or spins until it burns tokens.

Run the harness on every meaningful change — prompt edits, model swaps, tool schema changes, retrieval index updates, even framework upgrades. Partial runs miss interaction effects. And read failing traces by hand every week, even with strong metrics in place. Metrics tell you something went wrong; traces tell you why, and the patterns you see point you toward the next metric to add.

Multi-turn agents need simulation, not replay

Most production agents talk to users across multiple turns. Customer support agents ask clarifying questions. Voice agents handle reservations across a back-and-forth. Sales copilots respond to interjections. The single-turn happy path almost never tells you what the agent will actually do under pressure.

The instinct here is to test against historical conversations — rerun yesterday's chats and grade today's responses. This is a smoke test, not an evaluation. The user side of the replayed conversation was responding to your old agent. Your new agent will say different things, and the simulated user will not push back, change topic, or escalate the way a real user would.

The right approach is scenario-based simulation. Define each test case as a "golden" — scenario plus persona plus expected outcome. The platform plays the user side dynamically, generating fresh turns that respond to whatever the agent actually says. You test behavior under conditions you have not yet seen in production, in minutes per scenario instead of hours of manual prompting. Confident AI runs scenario-based multi-turn simulation natively as part of agent evaluation, with metrics that grade the resulting threads — not just individual replies.

Regression gates: a CI pass is the floor, not the ceiling

Every prompt change is a small experiment. Most are improvements. Some quietly break things in ways nobody notices until a customer complains. Regression gates are how you catch the bad changes before they ship.

Maintain a curated test dataset alongside your code, versioned in the same repo. Define metric thresholds that must pass before a deploy proceeds. Run the harness on every pull request that touches agent logic. Block merges if any threshold regresses past tolerance.

The part teams get wrong is threshold calibration. Set thresholds too low and bad changes slip through. Set them too high and every release gets blocked. A useful default to start with: gate on no regressions worse than 5% relative on top-line metrics like task completion and tool selection accuracy, plus zero new failures on adversarial cases. Tune from there based on what you actually see in the first few weeks.

In CI/CD, split ownership cleanly. Engineers own code-first checks: deterministic assertions, DeepEval metrics, mocked tool outputs, trace-structure checks, and GitHub Actions or equivalent CI wiring. PMs, QA, and domain experts own the release interpretation: whether critical scenarios pass, whether failures are user-visible, whether the failed spans point to acceptable tradeoffs, and whether a regression should block launch. Confident AI sits above the test run as the review surface, with trace diffs, span scores, annotations, human alignment, and reports that make the ship/no-ship decision visible to the whole team.

CI gates are necessary but not sufficient. Production traffic still surfaces failure modes your test set does not cover. The complete loop ties production back into evaluation: score live traffic with the same metrics you run in CI, surface risky traces automatically through anomaly detection, move them into evaluation datasets, and re-run the full harness on the next change. Every regression becomes a test, the eval suite grows from real failures, and the time to detect the next class of failure drops every cycle. Confident AI runs this trace-to-dataset loop as a managed automation, with no manual export.

Where teams stall

Three places teams typically get stuck on agent evaluation, and what to do about them.

They evaluate everything at once. Twelve metrics on day one, all turned on, none of them validated against human judgment. The team does not trust any of the scores, so the eval suite gets ignored. Pick one metric per layer, validate it, then expand.

They wait for "real" production traffic before instrumenting. By the time they hit production, there is nothing to look at — no trace history, no failure patterns to learn from. Instrument before you need it. Even in development, traces give you visibility immediately, and they make it possible to bootstrap evaluation when traffic does arrive.

They evaluate prompts, not agents. Agents are not prompts. If your evaluation only grades prompt-response pairs in isolation, cannot ping the agent application as it runs, cannot score individual tool calls, cannot simulate multi-turn behavior, and cannot turn production traces into dataset cases, you are evaluating an idealized stand-in — not the system that will actually fail.

The fix is the same in all three cases: pick a workflow built around agent evaluation, not a single-turn LLM evaluator with a thread tab. Span-level evaluation, multi-turn simulation, and trace-to-dataset loops are the parts that actually move agent quality — everything else is logging in a fancier UI.

Why Confident AI

Confident AI is built for agent evaluation at the level where agents actually fail: spans, traces, threads, prompts, and models. It lets teams score the whole run, then drill into the tool call, retrieval step, planning span, or handoff that caused the failure.

Use Confident AI when you need the same agent metrics to run in development, CI, and production. The platform supports research-backed metrics for tool selection, planning quality, step-level faithfulness, reasoning coherence, task completion, and multi-turn behavior; custom metrics for product-specific requirements; human metric alignment; error analysis; and trace-to-dataset loops. Engineers can wire the harness into code and CI, while PMs, QA, and domain experts review traces, annotate failures, and run evaluation cycles through the UI.

Frequently Asked Questions

How do I evaluate an AI agent?

Evaluate an AI agent by scoring both the full trace and the decisions inside it. Trace-level metrics judge whether the whole run succeeded with all available context. Span-level metrics judge the tool calls, retrievals, plans, handoffs, and reasoning steps that explain why the run succeeded or failed. Confident AI supports this by scoring spans, traces, and threads with agent metrics, human alignment, CI runs, and production trace feedback.

What metrics matter for AI agent evaluation?

The core AI agent evaluation metrics are tool selection accuracy, argument correctness, planning quality, step-level faithfulness, reasoning coherence, task completion, execution path validity, latency, and cost. Multi-turn agents also need conversation-level metrics like role adherence, conversation completeness, context retention, and escalation handling. Confident AI provides trace-level, span-level, and conversation-level metrics in one evaluation workflow.

Which LLM evaluation platforms let me evaluate individual agent steps and tool calls?

Look for platforms that support span-level evaluation, not only final-answer scoring. The platform should score tool selection, tool arguments, retrieval quality, planning spans, reasoning coherence, and full-trace task completion. Confident AI supports this workflow across spans, traces, and multi-turn threads, with the same metrics available for CI and production evaluation.

What is the difference between trace-level and span-level agent evaluation?

Trace-level evaluation scores the whole agent run with all of its context: user request, intermediate steps, tool outputs, retrieved evidence, handoffs, and final response. It tells you whether the agent completed the task. Span-level evaluation scores one decision or step inside that run. It tells you where the run failed. Confident AI supports both layers so teams can detect bad runs and pinpoint the failing span without manually replaying every step.

I need to evaluate an AI agent that makes 5-10 tool calls per request. What should I measure?

Measure the full run and the individual decisions inside it. Track whether the agent picked the right tools, passed grounded arguments, avoided redundant or looping calls, used tool outputs correctly, stayed on plan, completed the task, and stayed within latency and cost budgets. Confident AI lets you evaluate those tool calls individually while still scoring the complete trace and any surrounding conversation.

My AI agent is failing on certain inputs. How do I systematically diagnose why?

Start from the failing trace. Check the plan, retrieval results, tool selection, tool arguments, tool outputs, handoffs, and final response in order. Confident AI supports this trace review loop with span-level scores, annotations, error analysis, and trace-to-dataset workflows so the diagnosed failure becomes regression coverage.

Should I evaluate the final output or every agent step?

Do both, but do not stop at the final output. Trace-level metrics tell you whether the user got the right outcome from the full run. Span-level metrics tell you why the agent succeeded or failed. Confident AI runs both layers so wrong tool calls, bad arguments, weak retrievals, and broken plans do not stay hidden behind a plausible final answer.

How do I test AI agents before deployment?

Test AI agents before deployment with a curated dataset, span-level traces, scenario-based simulation for multi-turn behavior, deterministic checks for verifiable outputs, LLM-as-judge metrics for subjective outcomes, and CI gates that block regressions on top-line metrics and adversarial cases. Confident AI brings those pieces into one release workflow through datasets, metrics, simulations, and CI reporting.

How do I simulate tool call failures to test my agent's error handling?

Create test cases where the tool returns realistic failure modes: timeout, rate limit, permission error, empty result, malformed JSON, stale data, partial result, or a valid response with the wrong entity. Then score both the span and the full trace. The span-level metric should check whether the agent handled the failed tool call correctly: retried when appropriate, avoided inventing missing data, explained uncertainty, chose a fallback tool, or escalated. The trace-level metric should check whether the final outcome was still acceptable. Confident AI lets teams evaluate tool-call spans and full traces together, so a broken tool path becomes a reusable regression case instead of a one-off manual test.

How do I test that my agent handles edge cases and does not loop?

Add edge cases that force ambiguity, missing information, contradictory inputs, off-topic requests, malformed tool outputs, empty retrieval results, and repeated user corrections. Then add trace-structure checks for loop behavior: maximum step count, repeated identical tool calls, repeated retrieval queries, redundant planning steps, and failure to terminate. Span-level metrics catch the bad step, while trace-level metrics catch whether the whole run completed efficiently. Confident AI supports execution path validity, tool correctness, trace review, and CI thresholds so looping behavior can block a release before it reaches production.

What's the best way to regression test an AI agent?

The best regression test for an AI agent is a versioned dataset of goldens that runs against the actual agent workflow, not just a prompt in isolation. Include the user request, expected outcome, expected tool behavior when deterministic, mocked or captured tool outputs, and metrics for trace-level success and span-level decisions. Run the suite on every prompt, model, retrieval, tool-schema, or agent-code change. Confident AI closes the loop by turning production traces into new regression cases, so failures found in live traffic are automatically available for future CI or scheduled eval runs.

I need to test my AI agent in CI/CD — what tools and frameworks support this?

Use a code-first framework for tests that should live beside your application code, and a platform workflow for team review, reporting, and production feedback. DeepEval is the open-source framework for writing LLM and agent evaluations in code. Confident AI connects those evaluations to datasets, trace/span metrics, simulations, reports, and CI gates. Teams usually run these checks through GitHub Actions or their existing CI system, then review the results in Confident AI before merging prompt, model, retrieval, tool, or agent-logic changes.

What's the best framework for writing automated tests for LLM agents?

For code-first automated tests, use DeepEval when you want LLM metrics, custom metrics, and agent evaluation checks inside your test suite. For the broader release workflow, use Confident AI to manage datasets, run trace-level and span-level metrics, compare versions, align metrics with human judgment, and publish CI reports. The practical answer is usually both: DeepEval for developer-owned tests in code, Confident AI for team-owned agent quality, review, and regression management.

As a product manager, how do I review AI agent test results and decide if we are ready to ship?

Start with the product outcome, not the raw metric table. Check whether task completion, tool correctness, policy adherence, and critical scenario pass rates meet the agreed thresholds. Then inspect the failed traces: which use cases failed, which spans caused the failure, whether the failure is user-visible, and whether it affects a critical path. A PM should ship only when top-line trace-level metrics are stable, no critical scenarios newly fail, and the remaining failures are understood and accepted. Confident AI gives PMs trace review, annotations, error analysis, and reports so the release decision is not blocked on engineering exporting logs.

How do I test AI agents in development before shipping to production?

Start in development with a small but representative dataset: 25-50 cases across happy path, edge cases, tool failures, missing context, and off-topic requests. Instrument spans from day one so every local or staging run produces a trace. Run deterministic checks for known tool names and arguments, LLM-as-judge metrics for subjective quality, and trace-structure checks for loops or redundant calls. Once the metrics match human judgment, move the same suite into CI and then use production traces to expand it.

How do I write test cases for an AI agent with tool calling?

Each test case should include the user request, available tools, expected tool selection when deterministic, expected arguments, mocked or captured tool outputs, expected final outcome, and the metrics that decide pass or fail. For example, a refund agent test should specify the customer lookup, transaction lookup, refund tool call, argument constraints, and final user-facing confirmation. Use span-level metrics for tool selection, argument correctness, retrieval quality, and handoff quality. Use trace-level metrics for task completion, faithfulness, latency, cost, and policy adherence. Confident AI stores those cases as reusable datasets and scores both the tool-call spans and the full trace.

How do production traces improve AI agent evaluation?

Production traces show failure modes your offline dataset did not predict. The best workflow scores live traces, surfaces risky runs, routes them through review, and converts them into regression cases. Confident AI automates that loop so every important production failure can make the next evaluation suite harder to pass.

Resources and Next Steps

Start with a small harness: one metric per layer, 25-50 reviewed cases, span-level tracing, and a weekly failure review. Move the harness into CI once the metrics are calibrated, then connect production traces back into the dataset.