At 2:14am the loan-eligibility agent returned HTTP 200. No exceptions. No alerts. The income-verification tool logged status: ok. Over the next six hours the agent rejected 847 valid applications because a third-party API had quietly renamed a JSON field overnight. The model received a key it had never seen, hallucinated null for income, and declined every pending case. [8]
This is the failure class agentic systems were built to expose. Infrastructure healthy. Dashboard green. The fault is semantic — distributed across a reasoning chain where one wrong interpretation at step 3 shapes every decision 40 calls deep. There is no exception to read. There is no line number to fix.
This runbook gives you four things: a failure-mode taxonomy you classify against before reading the trace, the minimum observability stack that catches silent failures before users do, a five-step triage sequence that works when the error log is empty, and a postmortem template that does not pretend non-determinism is a typo.
Traditional SRE asks what crashed. Agentic SRE asks something else: which step produced the wrong premise, and how far did it travel before anyone noticed?
For agent tasks involving 5+ tool calls, per Stanford HAI Q4 2025 evaluation analysis reported by Markaicode. [1]
Front-loaded into query reformulation and initial retrieval. Errors are not distributed evenly across the chain. [1]
89% of teams running agents in production have observability; only 52% have evals that validate behavior. [2]
Most agent failures complete cleanly at the infrastructure layer. The wrong output is semantic — invisible to latency dashboards.
The Green Dashboard Is the Alibi
Traditional observability was built for systems that fail by exiting nonzero. Agents fail by completing successfully.
The most dangerous agent failures are the ones your monitoring stack calls successes.
Latency histograms, error rates, CPU graphs — every one of those signals was designed for deterministic systems where failure means the process exited nonzero. Agents break that assumption. A clean HTTP 200 from an LLM API is not evidence of correct behavior. Valid JSON from a tool is not evidence the agent interpreted it correctly. A formatted final response is not evidence the response is true. [2]
The canonical pattern: a third-party tool changes its response schema. No exception fires. The JSON is still valid. The model receives a field name absent from training and, instead of escalating, generates a plausible value from prior context. That value enters the session state, gets cited in the next three calls, and by the time the user sees the output the entire reasoning chain has been built on a hallucinated foundation. None of it shows up in your Datadog dashboard. [8]
In early 2026, 89% of teams running agents in production had implemented some form of agent observability. Only 52% had eval pipelines that actually validated whether the agent's behavior was correct, not just syntactically complete. [2] The 37-point gap between "we have traces" and "we catch behavioral failures" is where production incidents live.
Here is the part most teams miss: the model is rarely the root cause. The dominant failure is a boundary failure — a tool returned partial JSON, retrieval pulled the wrong chunk, a planner looped, an API contract changed without notice. [4] If the LLM's reasoning is the actual root cause, your team got unlucky. If the tool interface failed, that is just production.
Alert fires on 5xx error or p99 latency spike
Read the exception message and the stack trace
Identify the failing line of code
Reproduce deterministically in staging
Fix the bug, write the unit test, deploy
Close when error rate returns to baseline
Alert fires on semantic anomaly, cost spike, or user report — often hours after the fault
Find the session trace; no exception exists to read
Identify the fault-origin step inside the reasoning chain
Accept that exact reproduction is probabilistically unlikely
Fix prompt, tool schema contract, or memory architecture
Close after a regression eval catches this pattern in future sessions
Errors Are Front-Loaded. Read Steps 1–3 First.
Failures in a 40-call session are not distributed evenly. The data points the other way — and so does where you should start the trace.
The intuitive assumption is that a 40-call session distributes failures proportionally across all steps. The evidence contradicts it.
Across 1,200 logged agent runs with verified hallucinations, analysis reported from multiple benchmark evaluations found 71% of errors were introduced in the first two steps — typically during query reformulation or initial retrieval. [1] The final output looks like a complex multi-step failure. Trace it backward and the root cause is almost always an error in how the task was understood or how the first retrieval was executed.
This changes the triage path. When you are staring at a 40-step session, start at steps 1–3. Read what the agent understood the user to be asking. Read what came back from the first retrieval or tool call. Read whether the initial plan was sound. If those steps are clean, the problem is unlikely to be a fundamental reasoning failure — it is more likely a specific tool boundary issue later in the chain.
Hallucination probability scales nonlinearly with tool call count. Across agentic benchmark evaluations, the probability of at least one hallucination in a run climbs from roughly 12% at 2 tool calls, to 67% at 10, to above 85% at 15 or more. [1] This is not an argument against complex agents. It is the structural reason any agentic workflow that requires 10+ tool calls is production-critical infrastructure and needs dedicated semantic observability — not standard LLM monitoring with extra fields.
Benchmark numbers come from controlled environments. Real production data is dirtier — noisier inputs, inconsistent schemas, retrieval systems that drift over time. Real-world hallucination rates in enterprise deployments are likely higher than the benchmarks suggest. Treat the directional insight as the load-bearing claim: errors are front-loaded, complexity compounds risk nonlinearly, and the precise numbers depend on how clean your inputs are.
Four Instrumentation Capabilities. Without Them, You Are Guessing.
Standard LLM logging is necessary and not sufficient. Here is what generic monitoring misses entirely.
Latency, token count, API errors — necessary for any production LLM workload, insufficient for agentic incident response. The minimum stack captures four things generic LLM monitoring does not.
Session-level correlation. Every LLM call, every tool invocation, every state mutation must carry the same session_id. Without it, a 40-step session decomposes into 40 disconnected events in your log store. The session ID is the single most important field you can add — and the one most commonly missing from first-generation agent deployments.
Tool call payloads at every step. Not "tool was called with status 200" — the actual input arguments and the actual response payload, including the field structure. Schema drift detection — spotting when a tool response has different field names than expected — is one of the highest-signal automated checks available. It caught the loan agent incident above in under two minutes on replay. The TOOL_SCHEMA_DRIFT flag on step 3 pointed directly to the renamed field. [5]
Step-level anomaly flags. Structured fields that mark specific failure patterns the moment they occur: ENTITY_HALLUCINATION when the model references an entity ID not in prior context, TOOL_SCHEMA_DRIFT when a tool response does not match its documented schema, LOOP_DETECTED when the same tool fires with near-identical arguments three or more times in a session. [6] These flags do not catch everything. They reduce the backward trace from "read 40 events manually" to "check the flagged steps first."
Session outcome classification. Every session ends in one labeled outcome from a small set: Completed, Tool Error, Bad Output, Timeout, Budget Exceeded. The label enables cohort queries — "show me every Bad Output session from the last week" — that single-trace inspection cannot answer. [4]
agent_tracing.pyfrom opentelemetry import trace
import json
tracer = trace.get_tracer("agent.session")
# Wrap every LLM call. gen_ai.* attributes are not optional.
def traced_llm_call(step: int, session_id: str, model: str, messages: list) -> dict:
with tracer.start_as_current_span(f"agent.llm.step.{step}") as span:
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("session.id", session_id)
span.set_attribute("agent.step", step)
response = client.messages.create(model=model, messages=messages)
span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)
# Anomaly flag: entity ID in response not present in any prior session message.
if entity_ids_not_in_prior_context(response.content, messages):
span.set_attribute("agent.anomaly", "ENTITY_HALLUCINATION")
return response
# Tool call args, response, and schema drift. First-class span data, not log lines.
def traced_tool_call(step: int, session_id: str, tool_name: str, args: dict) -> dict:
with tracer.start_as_current_span(f"agent.tool.{tool_name}") as span:
span.set_attribute("tool.name", tool_name)
span.set_attribute("session.id", session_id)
span.set_attribute("agent.step", step)
span.set_attribute("tool.input", json.dumps(args))
result = execute_tool(tool_name, args)
# Schema drift: response field names diverge from the documented contract.
if not validate_tool_schema(tool_name, result):
span.set_attribute("agent.anomaly", "TOOL_SCHEMA_DRIFT")
span.set_attribute("tool.unexpected_fields",
",".join(set(result.keys()) - expected_fields(tool_name)))
span.set_attribute("tool.output_fields", ",".join(result.keys()))
return resultRequired trace fields per LLM call
- ✓
session_id— identical value for every event in the session - ✓
agent.step — sequential integer starting at 1
- ✓
gen_ai.request.model — exact model version string, not the family name - ✓
gen_ai.usage.input_tokensandgen_ai.usage.output_tokens - ✓
agent.anomaly — structured flag field, null when clean, typed string when flagged
- ✓
step_summary— one sentence in plain language describing what the model concluded
Required trace fields per tool call
- ✓
session_idand agent.step — matching the parent LLM call - ✓
tool.name and tool.input — full args payload, redacted when sensitive
- ✓
tool.status — one of: ok, error, empty,
schema_drift - ✓
tool.
output_fields— comma-separated field names from the response - ✓
tool.
latency_ms— high latency is a leading indicator for retry loops
Classify Before You Read. Forensics Without Classification Is Wasted.
From alert to classified failure mode in under five minutes. Every minute beyond that is the agent still affecting production.
Before reading any trace in detail, classify the failure. Classification determines which 20% of the trace you actually need to examine. Getting this wrong costs you the first 30 minutes of every agent incident.
The decision tree below encodes the four operationally distinct failure modes. Not because the taxonomy is academically satisfying. Because each one has a different blast radius and a different first action. A cost runaway needs a session kill order before forensics begins. A tool misfire needs a downstream system audit before forensics begins. Starting forensics without classification means an hour spent debugging the wrong thing while the agent continues to affect production.
- [01]
Contain the session before you read anything
Kill or pause the active agent process before investigating. A running agent continues taking external actions — sending messages, processing transactions, deleting records — while you read the trace. Containment is not investigation. Do it first, even if it interrupts a session that might have self-corrected. Self-correction is a hope, not a control.
- [02]
Capture the complete session trace before logs rotate
Export the full trace before logs rotate out. Every LLM call with prompt and completion. Every tool invocation with arguments and response. Session state at each step. The session trace is your only forensic artifact — unlike deterministic incidents, you cannot reliably reproduce an agent failure in a test environment. Capture now. Reconstruct later.
- [03]
Classify on signals, not on instinct
Before reading any individual trace step, look at four high-level signals to classify. Two minutes of work. It determines whether you start with forensics, a downstream audit, or a cost containment call. Skipping it is the most common reason an agent incident takes three hours instead of one — the cost runaway path and the hallucination path are nothing alike, and one of them keeps spending money while you read the other.
- [04]
Find the fault origin by working backward, not forward
Do not read the trace sequentially from step 1. Start at the final wrong output and work backward. Ask the inverse question: what information would have to be true for this output to make sense? Then find the earliest step where that information first appeared. That step is the fault origin — not the step that surfaced the wrong output, but the step that introduced the wrong premise everything downstream was built on.
- [05]
Map the blast radius and start remediation in parallel
Enumerate every external action the agent took after the fault origin step. For each, decide reversible or irreversible. Sent communications, processed payments, deleted records, modified state — every irreversible action needs an explicit reversal plan, and every plan needs to be documented before the postmortem starts. This is also where you check whether the agent spawned subagents or chained workflows. They need their own containment assessment.
Four Failure Modes. Different Blast Radii. Different First Actions.
A reference card for naming what is in front of you so the triage path is correct from minute one.
| Failure Mode | Primary Signal | Best Detection Method | Blast Radius |
|---|---|---|---|
| Hallucination Propagation | Confident, coherent output that is factually wrong; no external action errors | ENTITY_HALLUCINATION anomaly flag or backward trace from final wrong output | Low to medium — informational unless paired with downstream tool calls |
| Tool Misfire | Wrong API called, malformed arguments, or correct API called on wrong entity | Tool call audit log shows unexpected argument pattern or TOOLSCHEMADRIFT flag | High — real-world effects are immediate and irreversible |
| Context Poisoning | Agent's stated goal drifts across turns without user instruction; inconsistent world model | CONTEXT_DRIFT anomaly flag; compare agent's stated objective at step 1 vs. step 20 | Variable — depends on how many actions fired after goal state was corrupted |
| Cost Runaway | Token counter spikes well above baseline; same tool called repeatedly; no final output | LOOP_DETECTED flag or token budget alarm before hard kill; same-tool call frequency spike | Financial — no user-visible wrong output, significant cost exposure per session |
Why Standard Postmortems Fail Agents
Single-root-cause framing produces narrow fixes that prevent the exact failure without addressing the class of failure.
The first time we ran an agent postmortem on a standard SRE template, the conclusion read: "root cause: LLM hallucination." That is about as useful as writing "root cause: gravity" for a structural failure.
Standard templates ask for a single root cause — one line, one changeset, one decision that went wrong. Agent failures do not cooperate with that framing. Every significant agent incident we investigated had at least three contributing factors, each insufficient on its own: a tool response schema changed without notice, a prompt that did not constrain entity references, an eval suite with no coverage for this failure class. Fixing any one of them in isolation would not have prevented the incident. The gravitational pull toward a singular root cause mislabels the problem and produces narrow fixes that block the exact failure without addressing the class.
The agent postmortem template replaces "root cause" with "contributing factors" and refuses to close the incident until all of them are named.
1. Session context — Session ID. Time range. Total LLM calls. Total tokens. A plain-language description of intended session behavior. This grounds every other finding in concrete evidence rather than generalized claims about model behavior.
2. Failure classification — Which of the four failure modes applies. This is not bureaucracy. It determines which prevention layers were absent and which architectural change is actually needed.
3. Fault origin and cascade path — The step number that introduced the wrong premise. The full context window at that step. The cascade path from fault origin to final output, with specific step numbers — not "around step 10" but "step 7."
4. Impact assessment — Every external action taken after the fault origin, classified reversible or irreversible, with a concrete remediation plan for each irreversible one.
5. Contributing factors — Name all three or four. Model limitation? Prompt design gap? Tool interface contract change? Retrieval contamination? List each one explicitly. Refuse to collapse them into a single cause.
6. Detection gap — Why didn't observability catch this before users were affected? Missing anomaly flag? No eval coverage for this failure pattern? No guard rail on the tool call? The detection gap is the most important finding in the entire postmortem. It is the only one that drives an infrastructure change rather than a one-off prompt edit.
Agent Postmortem Checklist
Session trace exported and stored before log rotation
Fault origin identified by specific step number
Failure mode classified: hallucination propagation / tool misfire / context poisoning / cost runaway
Full cascade path documented from fault origin to final wrong output
Every external action after fault origin enumerated and classified reversible vs. irreversible
Rollback or remediation initiated and logged for every irreversible action
All contributing factors listed explicitly — no single root cause framing
Detection gap described: what monitoring or eval would have caught this earlier?
New eval regression test added covering this session failure pattern
Alert or anomaly flag configured for this failure signature
Postmortem distributed to platform and on-call teams within 48 hours
Five Anomaly Flags That Pay for Themselves on the First Incident
Structured checks that turn agent debugging from 40-event manual review into targeted forensics.
Anomaly Flag Specifications
ENTITY_HALLUCINATION
Flag when a model response references an entity ID — customer ID, order ID, account number, document reference — not present in any prior message or tool result in the session. Set at LLM call boundaries. High precision, low false-positive rate when entity ID formats are consistent. The single most actionable flag in production agent forensics.
TOOLSCHEMADRIFT
Flag when a tool response contains field names not present in the tool's documented schema from the last validated session. Hash the response key set and compare against the expected key set per tool. High signal for API contract changes. Fires before the hallucination has time to propagate — caught at the tool call boundary, not in the model output downstream.
LOOP_DETECTED
Flag when the same tool fires with arguments that hash to within 90% similarity more than twice in a single session. Precursor flag for cost runaway and context poisoning. Fires early enough to pause-and-escalate before token spend compounds. The threshold of two near-identical calls is intentionally low — legitimate agents rarely need to run the same query three times.
CONTEXT_DRIFT
Flag when the model's stated objective in chain-of-thought output diverges from the original system prompt task by more than a threshold semantic distance. Requires embedding comparison — more expensive than the other flags. Run it at session mid-point checkpoints rather than every step to keep overhead manageable. Misses some context poisoning cases. Catches the severe ones early.
BUDGET_WARNING
Soft warning at 60% of the session token budget, separate from the hard kill at 100%. The point is intervention time — 40% of the remaining budget for the platform team to investigate and pause before the session terminates abruptly. An agent killed at the hard limit produces an incomplete trace that is harder to debug than one paused on BUDGET_WARNING.
When does an agent incident warrant a full postmortem versus a bug fix?
Escalate to a full postmortem when the agent took an irreversible external action, when per-session cost exceeded 5x baseline, or when the same failure pattern appeared in more than one session. A single hallucination that produced wrong text without triggering tool calls — and was contained to one session — is a bug: add an eval test, update the prompt, move on. The postmortem process exists for failures that expose architectural gaps. The practical test: if the detection gap finding requires changing something in the infrastructure (anomaly flag, schema validation, guard rail) rather than just a prompt edit, it needs a postmortem.
Can my existing APM tool (Datadog, Grafana, New Relic) handle agent observability?
It handles infrastructure-level metrics — latency, error rates, cost per session — well. It does not handle behavioral correctness: whether the agent chose the right tool, whether its output was factually accurate, whether its goal stayed consistent across turns. LLM observability is fundamentally semantic, not syntactic. A clean HTTP 200 from an LLM API can carry a hallucinated fact, and no latency graph detects it. For behavioral correctness, you need OpenTelemetry spans with gen_ai.* attributes plus a semantic eval layer — Langfuse, Arize Phoenix, or a custom eval harness that runs assertions against session traces. The APM stack stays for infrastructure. It is not a substitute for session-level behavioral monitoring.
How do I instrument an agent built on a third-party framework like LangGraph, LlamaIndex, or AG2?
Most major frameworks now ship native OpenTelemetry support. AG2 has built-in OTel tracing that captures agent turns, LLM calls, tool executions, and speaker selections as structured spans connected by a shared trace ID, exportable to any OTel-compatible backend. LlamaIndex and LangChain support OpenInference instrumentation, which follows the OpenTelemetry GenAI semantic conventions. The minimum is one trace per session with step numbers, gen_ai.* attributes on LLM calls, and tool call input/output payloads. If your framework does not emit these natively, add a thin wrapper at the LLM call boundary — two functions, under 50 lines.
What is the minimum instrumentation before shipping an agent to production for the first time?
Three things, in priority order. A session correlation ID on every LLM call and every tool call — without this you cannot reconstruct what happened, full stop. Tool call input and output logging at every step — most production agent failures trace to a tool interface problem, and you cannot debug it without the actual payloads. A session outcome label that resolves at end-of-session: Completed, Tool Error, Bad Output, Timeout, Budget Exceeded. With these three in place, you can debug most production incidents. Anomaly flags, semantic evals, and replay harnesses are force multipliers on top. Without correlation IDs, tool payloads, and outcome labels, your first production incident will be undebuggable regardless of what else is in the stack.
On the benchmark statistics cited in this article
The hallucination probability curves (73% at 5+ tool calls, 71% front-loaded errors) come from analysis across controlled benchmark evaluations — AgentBench 2025, HELM Agentic Evaluation, and Stanford HAI Q4 2025 data — as reported by Markaicode (Feb 2026) [1]. The 89%/52% observability gap comes from Tianpan.co's systematic debugging article (Feb 2026) [2], citing early 2026 survey data. Real-world production rates depend on input noise, schema consistency, and retrieval quality. Treat them as directional benchmarks, not engineering thresholds you cite in an SLO.
- [1]Debugging Hallucinations: New Tools for Tracing Agent Logic — Markaicode (Feb 2026)(markaicode.com)↩
- [2]Systematic Debugging for AI Agents: From Guesswork to Root Cause — Tianpan.co (Feb 2026)(tianpan.co)↩
- [3]The Complete Guide to Debugging AI Agents in Production — Latitude (Mar 2026)(latitude.so)↩
- [4]Debugging AI Agent Failures in Production — Warpmetrics (Feb 2026)(warpmetrics.com)↩
- [5]Distributed Tracing for Agentic Workflows with OpenTelemetry — Red Hat Developer (Apr 2026)(developers.redhat.com)↩
- [6]AI Agent Observability — Evolving Standards and Best Practices — OpenTelemetry (2025)(opentelemetry.io)↩
- [7]AG2 OpenTelemetry Tracing: Full Observability for Multi-Agent Systems (Feb 2026)(docs.ag2.ai)↩
- [8]AI Agent Observability: Tracing & Debugging LLM Agents in Production — Md Sanwar Hossain (Mar 2026)(mdsanwarhossain.me)↩