Back
1.I: Chapter 1

Setting Up AI Agent Observability

AI agents are deterministic in code only. Every other layer is fuzzy. Models get swapped, prompts evolve, retrieval indexes drift, tool APIs change shape, and a single conversation can branch into a tool call tree that no developer planned. The agent that worked in last week's demo can quietly start picking the wrong tool in production tomorrow, and your monitoring dashboard will keep reporting a 200 status the entire time.

Agent observability is what closes that gap. Done well, it tells you which tool the agent picked, what arguments it constructed, what came back, how the next reasoning step changed, and whether the whole run reached the user's goal. Done badly, it is a trace dump — useful for forensic debugging, useless for catching the next regression before users do.

What is AI agent observability?

AI agent observability is the practice of capturing every step an AI agent takes during execution — tool selections, tool arguments, retrievals, memory reads and writes, intermediate reasoning steps, sub-agent handoffs, model calls — and storing them as a structured trace that engineers, PMs, and QA can replay later.

It sits one layer above traditional application monitoring and generic AI/LLM observability. Infrastructure dashboards tell you a service is up and a request finished in 380ms. Basic LLM monitoring tells you the prompt, response, token count, cost, and latency. Normal AI monitoring observes outputs; agent observability explains the chain of decisions that produced the outcome. It tells you the agent looked up the wrong customer record, hallucinated a refund policy, called a tool with an invented order ID, retried twice, and still returned a 200 to the user.

A useful agent trace has four properties:

  • Structured spans, not free-form logs. Every step is a typed record with inputs, outputs, timing, and errors — not a string blob the LLM happened to print.
  • Tree shape, not flat lists. Tool calls nest under reasoning steps, sub-agent runs nest under the parent agent, parallel branches show up as siblings.
  • Stable identifiers across services. A trace ID propagates from the user-facing API through queues, async workers, and downstream services. Distributed agent traces stay distributed.
  • Tied to a prompt, model, and version. A failing trace should answer "which prompt version did this run use?" in one click — not require a five-tab archeology dig through Git history.

The goal is to make any production run inspectable as a single artifact. If you cannot open one trace and see the full decision path that produced an output, you do not have agent observability yet — you have logging.

Why traditional monitoring falls short for agents

Traditional application monitoring and logging stacks were built for deterministic services. The same input produces the same output, every request walks a known code path, and a 200 response means the request did what it was supposed to. Agents break each of those assumptions.

The same prompt can produce different tool calls on different runs. Execution branches based on model output, retrieval results, and tool responses. A 200 can wrap a confidently wrong answer that nobody flagged. Latency stays fine while the agent loops three times because the tool returned a slightly different schema this morning.

Concern

Traditional monitoring

Application logging

Agent observability

What it records

Request rate, latency, error rate, span counts

Developer-defined events and messages

Tool calls, reasoning, state, memory, handoffs

Failure signal

HTTP error or timeout

Exception or warning string

Wrong tool, wrong arguments, drifted plan, hallucinated output

Silent failures

Looping agents look healthy

Logs lack reasoning context

Trace shows the loop, retry, or wrong branch

Debugging unit

Service or endpoint

Log line

End-to-end agent run

Quality answer

"Service is up"

"Function ran"

"Did the agent do the right thing?"

A traditional dashboard confirms the agent service is running. It cannot confirm the agent picked the right tool, passed the right arguments, retrieved the right memory entry, or stayed on the original plan. Agent observability fills that gap by treating every step in the agent run as a typed, inspectable, evaluable span.

The four pillars of agent observability

Production agents fail in patterns that map directly to four span types. A minimum viable trace schema captures one span type per failure mode, so the surface where the failure shows up is also the surface where you fix it.

1. Tool-call spans

Agents act on the world through tools. Each tool span should capture the tool name, the arguments, the raw return value, retry count, duration, and any error. Without that data, hallucinated arguments and silent retry loops blend into normal traffic. With it, you can ask questions like "which tool errored most this week?" or "how often did the agent pass an invented order ID to lookup_order?" — and get an answer in one query.

Tool-call spans also enable tool-selection metrics. Once you can see which tool fired and which one should have, scoring tool selection accuracy stops being a research project.

2. Reasoning spans

Reasoning spans capture the model's plan, the action it picked, the observation it made, and what it decided next. These are the spans that surface plan drift — the agent generates a sensible plan and then deviates from it — and wrong-branch selection that no single LLM span can show.

If your agent uses a planning step, a reflection step, or an explicit chain-of-thought, those should be their own spans. Burying them in a generic "LLM call" span loses the structure that makes them debuggable.

3. State and handoff spans

Agents carry working memory across steps. State transition spans record the state before and after each step, including context edits and handoff payloads to sub-agents. They catch context loss and summarization drift that quietly degrade longer runs.

In multi-agent systems, handoff spans are the most valuable record you have. They make the boundary between Agent A and Agent B inspectable — which is where the most expensive failures live, because Agent B keeps running on whatever Agent A handed over, even if it is wrong.

4. Memory and retrieval spans

Memory spans cover reads and writes to vector stores, semantic caches, and long-term memory backends. Each span should record the query, the returned entries, relevance scores, and the freshness of the retrieved data. This is where stale reads, wrong-entity retrieval, and memory leakage between users show up — none of which any aggregate metric will surface.

For RAG-heavy agents, the retrieval spans are the difference between "the agent was wrong" and "the retriever returned the wrong document, the agent worked correctly off bad input."

Together, these four span types form a structured record of the agent's behavior. Each span is typed, timestamped, and parent-child linked, so a single trace shows the full execution graph for one user request — and so an evaluation layer can score the right thing at the right level.

Designing a minimum viable agent trace schema

The trace schema is the contract between your agent and every downstream consumer: the debugging UI, the evaluation layer, the alerting system, the dataset curator, the analytics dashboard. A small, opinionated schema is far easier to enforce — and far easier to evaluate against — than a free-form log.

A workable schema records, for each span:

  • Span type. Tool call, reasoning step, state transition, memory operation, handoff, or LLM call.
  • Inputs. Structured arguments, queries, prior state, retrieved context.
  • Outputs. Raw return value, generated text, retrieved entries, new state.
  • Timing. Start time, end time, duration, time-to-first-token if applicable.
  • Errors and retries. Typed error state, retry count, parent retry context.
  • Costs. Tokens by category (prompt, completion, cached, reasoning), estimated dollar cost.
  • Identifiers. Trace ID, parent span ID, session or thread ID, user ID, tenant ID.
  • Prompt and model version. Which prompt template, which version, which model, which temperature — so the trace ties back to a specific deployable artifact.

Storing each step as a typed record with these fields keeps traces queryable. Engineers can filter by tool name, span type, error class, prompt version, or session — instead of grepping through unstructured logs at 3am while a production incident burns down.

This is also where most homegrown observability efforts collapse. Teams instrument their agent, but they do it inconsistently — one tool call gets a span, the next gets a print statement, the third gets a dictionary appended to a list in memory. Two months later, half the production traces are missing fields and the evaluation team is reverse-engineering JSON shapes by hand. The schema is worth getting right on day one, even if it feels like over-engineering.

Implementing agent observability across frameworks

Most production teams run more than one framework. LangGraph for graph-based control flow, OpenAI Agents SDK for newer model features, CrewAI or Pydantic AI for type-safe Python agents, Vercel AI SDK for TypeScript stacks, LlamaIndex for RAG-heavy pipelines, plus a custom orchestrator that predates any of the above. A workable instrumentation strategy spans all of them without forcing a rewrite.

The pattern that works:

  1. Use native framework adapters where they exist. A first-class SDK wrapper produces a span tree that matches the framework's mental model — LangGraph nodes nest correctly, CrewAI sub-agents land under the parent crew, OpenAI Agents tool calls show up with arguments and outputs already parsed.
  2. Fall back to OpenTelemetry for the rest. For unsupported frameworks or custom orchestrators, OTEL is the universal substrate. As long as you produce well-formed spans with inputs and outputs, your collector or observability platform has a standard shape to ingest.
  3. Standardize attribute names. Whether you use OTEL semantic conventions or OpenInference, pick one and stick with it. Mixed conventions are how you end up with tool_name on half your spans and function_name on the other half.
  4. Instrument once, export to many. A single span pipeline that exports to your observability platform plus an internal warehouse is far easier to reason about than three half-instrumented stacks.

Confident AI's agent observability supports Python and TypeScript instrumentation across LangGraph, CrewAI, Pydantic AI, OpenAI Agents SDK, Vercel AI SDK, LlamaIndex, custom agents, OpenTelemetry, and OpenInference. Distributed context propagation works across services, queues, and async workers — so a trace that crosses three microservices stays a single trace, not three disconnected fragments.

From observability to evaluation: closing the loop

Tracing without evaluation tells you what the agent did. It does not tell you whether what the agent did was correct.

This is the single biggest mistake teams make in agent observability. They instrument exhaustively, build dashboards for token cost and latency, set up alerts on error rates — and then have no way to answer "is this agent actually getting better, or worse?" Latency can drop 30% while quality silently degrades because the new prompt is more confident, faster, and wrong.

A complete agent observability stack has four layers, in order:

  1. Tracing. Every step lands as a structured span with inputs, outputs, timing, cost, and the prompt version that produced it.
  2. Online / live evaluations on traces. Online metrics score the trace as production traffic flows through it — end-to-end on the full run, on extracted sub-traces (the planning portion, the retrieval portion), and on individual spans (a specific tool call). Tool selection accuracy, planning quality, step-level faithfulness, reasoning coherence, and task completion belong here.
  3. Signals, issues, and anomalies. Failing runs, frustrated users, prompt-injection patterns, timeout spikes, new emerging use cases, and quality drift get surfaced automatically — your team works on real problems instead of querying for them.
  4. Dataset curation and feedback. Failing or interesting traces convert into evaluation test cases on autopilot. The next deployment is regression-tested against the failures you just saw.

Teams that stop at layer 1 end up with expensive logging. Teams that stop at layer 2 evaluate everything but never act on the results. Teams that close all four layers get compounding returns: every production cycle makes their evaluation better, which makes their agent better, which surfaces new harder problems to solve.

Online / live evaluations on agent traces

Online evaluations, live evaluations, and online metrics are the same operational idea in this guide: run quality metrics on real production traces while the agent is being used. The trace tells you what happened. The online metric tells you whether that trace was good enough.

For agents, the useful trigger moments are usually:

  • After the final response. Score task completion, answer relevancy, faithfulness, policy adherence, latency, and cost for the whole trace.
  • After tool execution. Score whether the agent selected the right tool, passed grounded arguments, handled the tool result correctly, and avoided redundant calls.
  • After retrieval. Score retrieval quality, context relevance, freshness, and whether the final response stayed faithful to the retrieved context.
  • On meaningful spans. Score planning quality, step-level faithfulness, reasoning coherence, handoff quality, or execution path validity at the exact step where failures tend to start.

Do not treat online metrics as a separate system from observability. They are the evaluation layer on top of the trace store. Pick the metrics using the Evaluating AI Agents playbook, choose where they should run using Setting Up Trigger Moments (Online Evals), then configure the equivalent online metrics in Confident AI once your traces are reliable.

Multi-agent observability: handoffs are the failure surface

Multi-agent systems compound the single-agent failure modes with one new class: handoff failures. Agent A passes incomplete or incorrect context to Agent B, and Agent B continues based on wrong assumptions. Without cross-agent tracing, the team debugging Agent B's output cannot see that the root cause was upstream.

The fix is parent-child span propagation across agents. The handoff payload becomes a span on the parent trace. The receiving agent's run nests under that handoff span. The same trace ID flows through both. The result is a single trace that shows the planning agent's output, the handoff payload it produced, the sub-agent's tool calls, and the final response — all in one view.

Treat each agent boundary the same way you would treat a remote procedure call. Log the inputs, outputs, and timing of every handoff. Score the handoff itself if it is a meaningful decision point. The agents that fail in production are almost never failing because of one bad LLM call — they are failing because Agent A handed Agent B a bad summary three steps earlier, and nobody saw it.

How to roll out agent observability without a six-month project

Most teams do not need full multi-agent maturity on day one. They need basic visibility on day one, and a clear path to add evaluation, alerting, and release enforcement as production traffic grows.

A practical adoption path:

  • Day 1: Capture the basics. Instrument LLM calls and tool invocations, record errors, make every run inspectable. Use the framework's native adapter where it exists, OTEL otherwise. Do not try to evaluate anything yet — just get the spans flowing.
  • Week 1: Add structure. Add reasoning spans, retrieval spans, prompt-version metadata, and user/session identifiers. Standardize attribute names. Verify a representative sample of traces by hand to catch instrumentation bugs early.
  • Month 1: Score live behavior. Turn on production signals and anomaly detection first, then add live evaluations with online metrics to priority workflows and expand them across production traces as the metrics prove reliable. Watch for the first real signals: which tools fail most, which prompts produce the worst scores, which user segments see degraded quality.
  • Quarter 1: Close the loop. Convert failing traces into evaluation test cases. Run the same scorers in CI. Gate deploys on quality thresholds. Set up quality-aware alerts on score drops, not just latency. Now production failures become next deploy's regression tests automatically.

The order matters. Tracing is the foundation every later stage depends on. But evaluation, automatic issue surfacing, and dataset curation are where the cost of agent failures actually drops — not at the trace viewer.

Why Confident AI

Confident AI is built for the full agent observability loop: trace the agent run, score the decisions inside it, surface the failures that matter, and turn production evidence into the next evaluation cycle. That is the part most trace viewers do not finish.

Use Confident AI when you need AI agent observability across spans, traces, and threads — not just request logs. It captures tool calls, retrievals, handoffs, model calls, prompt versions, costs, latency, and metadata; runs online metrics as live evaluations on production traces; surfaces signals and anomalies; sends quality-aware alerts; and turns risky traces into datasets for regression testing. Engineers set up the instrumentation, then PMs, QA, and domain experts can review traces and annotate failures without waiting on engineering for every quality decision.

Frequently Asked Questions

What is AI agent observability?

AI agent observability is the practice of tracing, analyzing, and evaluating the full chain of decisions an AI agent makes in production. Confident AI captures tool calls, retrievals, reasoning steps, memory operations, handoffs, model calls, inputs, outputs, costs, latency, and version metadata so the team can see why an agent produced a result.

How is AI agent observability different from LLM observability?

Basic LLM observability usually captures prompts, responses, tokens, cost, and latency. AI agent observability captures the workflow between the user request and the final response: which tools fired, what arguments were passed, what context was retrieved, which sub-agent received the handoff, and where the decision path started to fail. Confident AI is built for that deeper agent trace, then adds evaluation, signals, alerts, and dataset curation on top.

What should an AI agent trace include?

An AI agent trace should include typed spans for tool calls, reasoning steps, retrievals, memory reads and writes, handoffs, LLM calls, errors, retries, timing, cost, user or session identifiers, prompt versions, model versions, and parent-child relationships between spans. Confident AI stores that context as structured traces so teams can evaluate spans, full runs, and multi-turn threads from the same evidence.

I need to trace multi-step agent workflows in production. What should I set up?

Set up structured tracing around the full agent run, not only the final model call. Capture tool-call spans, retrieval spans, reasoning spans, handoff spans, prompt and model versions, cost, latency, errors, retries, and stable trace IDs across services. Confident AI supports this through framework instrumentation, OpenTelemetry, OpenInference, production signals, online metrics on traces and spans, and quality-aware alerts.

How do I automatically categorize AI production issues like wrong tool calls, bad responses, and hallucinations?

Use signals and evaluation results on production traces. Tool-call metrics catch wrong tool selection or bad arguments. Faithfulness and hallucination metrics catch unsupported responses. Trace-level task completion catches failed outcomes. Confident AI combines those signals with human annotations so confirmed failure categories feed better datasets and metrics for the next cycle.

How do I send LLM quality alerts to Slack or PagerDuty when an AI agent's scores drop?

Run evaluations or production signals on the traces you care about, define thresholds or relative-regression rules, and route alerts to Slack, PagerDuty, or Teams. Confident AI sends quality-aware alerts with links back to the failing traces, including the relevant score, prompt version, use case, and trace context.

When should I set up AI agent observability?

Set up AI agent observability before production, even if you are not ready to run evals yet. Traces are the substrate for debugging, production signals, online evaluations, quality alerts, dataset curation, and CI regression tests. Confident AI lets you start with tracing, then layer on signals, evaluation, alerts, and trace-to-dataset loops as traffic grows.

Are live evaluations the same as online metrics?

In Confident AI, online metrics are how you run live evaluations on production traces, spans, and threads. For AI agents, that means scoring tool selection, argument correctness, retrieval quality, planning quality, task completion, and other trace-level metrics while real traffic flows through the system. Use Evaluating AI Agents to choose the metrics, then configure them as online metrics once the trace data is reliable.

Do I need OpenTelemetry for AI agent observability?

You do not need OpenTelemetry for every agent stack, but you should use it when native framework instrumentation does not cover your custom orchestrator or distributed services. Confident AI supports native instrumentation, OpenTelemetry, and OpenInference; the important part is consistent, structured spans with stable trace IDs and standard attributes.

Resources and Next Steps

Start by instrumenting LLM calls and tool invocations, then add reasoning spans, retrieval spans, handoff spans, and prompt-version metadata. Once the traces are reliable, add production signals, online metrics for live evaluation, quality alerts, and trace-to-dataset automation.

Read next: