Most agent failures return HTTP 200. The dashboard stays green while the reasoning chain quietly compounds the wrong premise. Triage runbook, failure-mode field guide, observability instrumentation, and postmortem template for non-deterministic systems.
At 2:14am the loan-eligibility agent returned HTTP 200. No exceptions. No alerts. The income-verification tool logged status: ok. Over the next six hours the agent rejected 847 valid applications because a third-party API had quietly renamed a JSON field overnight. The model received a key it had never seen, hallucinated null for income, and declined every pending case. [8]
This is the failure class agentic systems were built to expose. Infrastructure healthy. Dashboard green. The fault is semantic — distributed across a reasoning chain where one wrong interpretation at step 3 shapes every decision 40 calls deep. No exception to read. No line number to fix.
This runbook gives you five things: a failure-mode taxonomy to classify against before reading any trace, the minimum observability stack that catches silent failures before users do, the gen_ai semantic conventions that are now the industry standard for agent instrumentation, a five-step triage sequence that works when the error log is empty, and a postmortem template that doesn't pretend non-determinism is a typo.
Traditional SRE asks what crashed. Agentic SRE asks something harder: which step produced the wrong premise, and how far did it travel before anyone noticed?
Four distinct failure modes with different blast radii and first actions
Minimum instrumentation spec: session IDs, tool payloads, anomaly flags, outcome labels
OpenTelemetry gen_ai semantic conventions — exact attribute names for agent and tool spans
Five-step triage sequence from alert to fault origin
Postmortem template that replaces single root-cause framing with contributing-factor analysis
Decision table: when a fix is a prompt edit vs. when it requires a postmortem
For agent tasks involving 5+ tool calls, per Stanford HAI Q4 2025 evaluation analysis reported by Markaicode. [1]
Front-loaded into query reformulation and initial retrieval. Errors are not distributed evenly across the chain. [1]
89% of teams running agents in production have observability; only 52% have evals that validate behavior. [2]
Gemini-2.5-Pro scored 11% on 148 human-annotated agent traces with 841 total errors. Long context alone doesn't solve this. [9]
Traditional observability was built for systems that fail by exiting nonzero. Agents fail by completing successfully.
The most dangerous agent failures are the ones your monitoring stack calls successes.
Latency histograms, error rates, CPU graphs — every one of those signals was designed for deterministic systems where failure means the process exited nonzero. Agents break that assumption. A clean HTTP 200 from an LLM API is not evidence of correct behavior. Valid JSON from a tool is not evidence the agent interpreted it correctly. A formatted final response is not evidence the response is true. [2]
The canonical pattern: a third-party tool changes its response schema. No exception fires. The JSON is still valid. The model receives a field name absent from training and, instead of escalating, generates a plausible value from prior context. That value enters the session state, gets cited in the next three calls, and by the time the user sees the output the entire reasoning chain has been built on a hallucinated foundation. None of it shows up in your Datadog dashboard. [8]
In early 2026, 89% of teams running agents in production had implemented some form of agent observability. Only 52% had eval pipelines that actually validated whether the agent's behavior was correct, not just syntactically complete. [2] The 37-point gap between "we have traces" and "we catch behavioral failures" is where production incidents live.
The model is rarely the root cause. The dominant failure is a boundary failure — a tool returned partial JSON, retrieval pulled the wrong chunk, a planner looped, an API contract changed without notice. [4] Categorically, tool-use and agentic prompt injection incidents were negligible in H2 2025 and are now meaningful shares of the H1 2026 incident dataset. [12] If the LLM's reasoning is the actual root cause, your team got unlucky. If the tool interface failed, that's just production.
Alert fires on 5xx error or p99 latency spike
Read the exception message and the stack trace
Identify the failing line of code
Reproduce deterministically in staging
Fix the bug, write the unit test, deploy
Close when error rate returns to baseline
Alert fires on semantic anomaly, cost spike, or user report — often hours after the fault
Find the session trace; no exception exists to read
Identify the fault-origin step inside the reasoning chain
Accept that exact reproduction is probabilistically unlikely
Fix prompt, tool schema contract, or memory architecture
Close after a regression eval catches this pattern in future sessions
Failures in a 40-call session are not distributed evenly. The data points the other way — and so does where you should start the trace.
The intuitive assumption is that a 40-call session distributes failures proportionally across all steps. The evidence contradicts it.
Across 1,200 logged agent runs with verified hallucinations, analysis reported from multiple benchmark evaluations found 71% of errors were introduced in the first two steps — typically during query reformulation or initial retrieval. [1] The final output looks like a complex multi-step failure. Trace it backward and the root cause is almost always an error in how the task was understood or how the first retrieval was executed.
This changes the triage path. When staring at a 40-step session, start at steps 1–3. Read what the agent understood the user to be asking. Read what came back from the first retrieval or tool call. Read whether the initial plan was sound. If those steps are clean, the problem is unlikely to be a fundamental reasoning failure — it's more likely a specific tool boundary issue later in the chain.
The TRAIL benchmark from Patronus AI puts a sharper number on the debugging difficulty: 148 human-annotated agent execution traces, 841 total errors, and the best available model (Gemini-2.5-Pro) scored 11% joint accuracy at identifying fault origins. [9] Modern long-context LLMs are not yet reliable debuggers of their own trace output. They'll miss most of what you need them to find. That's not an argument against automated triage — it's a specific reason automated triage needs structured anomaly flags rather than asking the model to self-diagnose from raw text.
Hallucination probability scales nonlinearly with tool call count. Across agentic benchmark evaluations, the probability of at least one hallucination in a run climbs from roughly 12% at 2 tool calls, to 67% at 10, to above 85% at 15 or more. [1] This is not an argument against complex agents. It's the structural reason any agentic workflow requiring 10+ tool calls is production-critical infrastructure and needs dedicated semantic observability — not standard LLM monitoring with extra fields.
Standard LLM logging is necessary and not sufficient. Here is what generic monitoring misses entirely.
Latency, token count, API errors — necessary for any production LLM workload, insufficient for agentic incident response. The minimum stack captures four things generic LLM monitoring doesn't.
Session-level correlation. Every LLM call, every tool invocation, every state mutation must carry the same session_id. Without it, a 40-step session decomposes into 40 disconnected events in your log store. The session ID is the single most important field you can add — and the one most commonly missing from first-generation agent deployments.
Tool call payloads at every step. Not "tool was called with status 200" — the actual input arguments and the actual response payload, including the field structure. Schema drift detection — spotting when a tool response has different field names than expected — is one of the highest-signal automated checks available. The AgentTelemetry benchmark found that vanilla OpenTelemetry misses 57% of agent faults (FDR 0.429) while a comprehensive span taxonomy covering agent-specific spans achieves a fault detection rate of 1.000 across 14 fault types. [10] The gap comes from exactly the missing tool payload and planning-layer spans.
Step-level anomaly flags. Structured fields that mark specific failure patterns the moment they occur: ENTITY_HALLUCINATION when the model references an entity ID not in prior context, TOOL_SCHEMA_DRIFT when a tool response doesn't match its documented schema, LOOP_DETECTED when the same tool fires with near-identical arguments three or more times. [6] These flags don't catch everything. They reduce the backward trace from "read 40 events manually" to "check the flagged steps first."
Session outcome classification. Every session ends in one labeled outcome from a small set: Completed, Tool Error, Bad Output, Timeout, Budget Exceeded. The label enables cohort queries — "show me every Bad Output session from the last week" — that single-trace inspection can't answer. [4]
For platform selection: the six platforms anchoring the 2026 space are LangSmith (deepest LangChain integration), Langfuse (open-source leader, self-hostable, acquired by Clickhouse Jan 2026), Arize Phoenix (ML-grade drift detection and embeddings analysis, OpenTelemetry-native via OpenInference), Helicone (drop-in proxy, simplest install), Datadog LLM Observability (enterprise default for Datadog shops), and Honeycomb LLM Observability (event-based deep tracing). [3] All support the OTel gen_ai semantic conventions. Pick the one that fits your data residency and cost model, not the one with the best marketing.
As of 2026, major frameworks and vendors converge on OpenTelemetry gen_ai attributes. Custom attribute naming is now a compatibility tax.
OpenTelemetry's GenAI semantic conventions reached broad adoption in 2025-2026, with LangChain, CrewAI, AutoGen, AG2, and LlamaIndex all emitting OTel-compliant spans either natively or via thin instrumentation packages. [11] The conventions define four span categories: LLM client spans, agent spans, tool/function execution spans, and events (for capturing prompt and completion content).
The practical benefit for incident response: if every agent, tool, and LLM call in your system emits standardized attributes, any OTel-compatible backend — Jaeger, Datadog, Honeycomb, Arize Phoenix — can reconstruct the full session trace without custom parsers. Agent spans carry gen_ai.agent.id, gen_ai.agent.name, and gen_ai.conversation.id. LLM call spans carry gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.response.finish_reasons. Tool execution spans wrap each tool call in a child span with its input arguments and response payload. [11]
The gap the conventions don't yet cover: five critical agent orchestration phases — planning, reasoning, safety monitoring, inter-agent delegation, and memory management — still lack standardized span-level representation. [10] Until that gap closes, you need custom attributes for those phases. The AgentTelemetry benchmark suggests those custom spans are the difference between catching 43% and 100% of fault types. They're not optional for production workflows.
gen_ai.conversation.id — identical across every span in the session (OTel standard)
gen_ai.request.model — exact model version string, not the family name (OTel standard)
gen_ai.usage.input_tokens and gen_ai.usage.output_tokens (OTel standard)
gen_ai.response.finish_reasons — why the model stopped: stop, tool_calls, max_tokens (OTel standard)
agent.step — sequential integer starting at 1 (custom until OTel covers orchestration)
agent.anomaly — null when clean, typed flag string when triggered (custom)
gen_ai.tool.name — exact tool identifier (OTel standard)
gen_ai.conversation.id and agent.step — matching parent LLM call (OTel standard + custom)
tool.input — full args payload, redacted when sensitive (custom, not yet OTel standard)
tool.output_fields — comma-separated field names from response (custom schema drift detection)
tool.latency_ms — high latency is a leading indicator for retry loops (custom)
From alert to classified failure mode in under five minutes. Every minute beyond that is the agent still affecting production.
Before reading any trace in detail, classify the failure. Classification determines which 20% of the trace you actually need to examine. Getting this wrong costs you the first 30 minutes of every agent incident.
The decision tree below encodes the four operationally distinct failure modes. Not because the taxonomy is academically satisfying — because each one has a different blast radius and a different first action. A cost runaway needs a session kill order before forensics begins. A tool misfire needs a downstream system audit before forensics begins. Starting forensics without classification means an hour spent debugging the wrong thing while the agent continues to affect production.
Kill or pause the active agent process before investigating. A running agent continues taking external actions — sending messages, processing transactions, deleting records — while you read the trace. Containment is not investigation. Do it first, even if it interrupts a session that might have self-corrected. Self-correction is a hope, not a control.
Export the full trace before logs rotate out. Every LLM call with prompt and completion. Every tool invocation with arguments and response. Session state at each step. The session trace is your only forensic artifact — unlike deterministic incidents, you cannot reliably reproduce an agent failure in a test environment. Capture now. Reconstruct later.
Before reading any individual trace step, look at four high-level signals to classify. Two minutes of work. It determines whether you start with forensics, a downstream audit, or a cost containment call. Skipping this is the most common reason an agent incident takes three hours instead of one.
Don't read the trace sequentially from step 1. Start at the final wrong output and work backward. Ask the inverse question: what information would have to be true for this output to make sense? Then find the earliest step where that information first appeared. That step is the fault origin — not the step that surfaced the wrong output, but the step that introduced the wrong premise everything downstream was built on.
Enumerate every external action the agent took after the fault origin step. For each, decide reversible or irreversible. Sent communications, processed payments, deleted records, modified state — every irreversible action needs an explicit reversal plan, and every plan needs to be documented before the postmortem starts. Check whether the agent spawned subagents or chained workflows — they need their own containment assessment.
A reference card for naming what is in front of you so the triage path is correct from minute one.
| Failure Mode | Primary Signal | Best Detection Method | Blast Radius | First Action |
|---|---|---|---|---|
| Hallucination Propagation | Confident, coherent output that is factually wrong; no external action errors | ENTITY_HALLUCINATION anomaly flag or backward trace from final wrong output | Low to medium — informational unless paired with downstream tool calls | Backward trace from output; add eval test; no downstream audit needed unless tool calls fired |
| Tool Misfire | Wrong API called, malformed arguments, or correct API called on wrong entity | Tool call audit log shows unexpected argument pattern or TOOLSCHEMADRIFT flag | High — real-world effects are immediate and often irreversible | Downstream system audit before forensics; map every write/send/delete since fault origin |
| Context Poisoning | Agent's stated goal drifts across turns without user instruction; inconsistent world model | CONTEXT_DRIFT anomaly flag; compare agent's stated objective at step 1 vs. step 20 | Variable — depends on how many actions fired after goal state was corrupted | Find the drift inflection point; check for injected instructions in tool responses or retrieval |
| Cost Runaway | Token counter spikes well above baseline; same tool called repeatedly; no final output | LOOP_DETECTED flag or token budget alarm before hard kill; same-tool call frequency spike | Financial — no user-visible wrong output, significant cost exposure per session | Kill session immediately; set project-level token cap; audit for loop trigger before restarting |
Single-root-cause framing produces narrow fixes that prevent the exact failure without addressing the class of failure.
The first time we ran an agent postmortem on a standard SRE template, the conclusion read: "root cause: LLM hallucination." That's about as useful as writing "root cause: gravity" for a structural failure.
Standard templates ask for a single root cause — one line, one changeset, one decision that went wrong. Agent failures don't cooperate with that framing. Every significant agent incident had at least three contributing factors, each insufficient on its own: a tool response schema changed without notice, a prompt that didn't constrain entity references, an eval suite with no coverage for this failure class. Fixing any one of them in isolation wouldn't have prevented the incident. The gravitational pull toward a singular root cause mislabels the problem and produces narrow fixes that block the exact failure without addressing the class.
The agent postmortem template replaces "root cause" with "contributing factors" and refuses to close the incident until all of them are named.
1. Session context — Session ID. Time range. Total LLM calls. Total tokens. A plain-language description of intended session behavior. This grounds every finding in concrete evidence rather than generalized claims about model behavior.
2. Failure classification — Which of the four failure modes applies. This determines which prevention layers were absent and which architectural change is actually needed.
3. Fault origin and cascade path — The specific step number that introduced the wrong premise. The full context window at that step. The cascade path from fault origin to final output, with specific step numbers — not "around step 10" but "step 7."
4. Impact assessment — Every external action taken after the fault origin, classified reversible or irreversible, with a concrete remediation plan for each irreversible one.
5. Contributing factors — Name all three or four. Model limitation? Prompt design gap? Tool interface contract change? Retrieval contamination? List each explicitly. Refuse to collapse them into a single cause.
6. Detection gap — Why didn't observability catch this before users were affected? Missing anomaly flag? No eval coverage for this failure pattern? No guard rail on the tool call? The detection gap is the most important finding in the entire postmortem. It's the only one that drives an infrastructure change rather than a one-off prompt edit.
Structured checks that turn agent debugging from 40-event manual review into targeted forensics.
Flag when a model response references an entity ID — customer ID, order ID, account number, document reference — not present in any prior message or tool result in the session. Set at LLM call boundaries. High precision, low false-positive rate when entity ID formats are consistent. The single most actionable flag in production agent forensics. If this fires at step 3 of a 40-step session, you don't need to read steps 4–40.
Flag when a tool response contains field names not present in the tool's documented schema from the last validated session. Hash the response key set and compare against the expected key set per tool. High signal for API contract changes. Fires before the hallucination has time to propagate — caught at the tool call boundary, not in the model output downstream. Would have caught the loan-agent income-field rename in under two minutes.
Flag when the same tool fires with arguments that hash to within 90% similarity more than twice in a single session. Precursor flag for cost runaway and context poisoning. Fires early enough to pause-and-escalate before token spend compounds. The threshold of two near-identical calls is intentionally low — legitimate agents rarely need to run the same query three times. Adjust per tool if you have high-frequency idempotent tools like health checks.
Flag when the model's stated objective in chain-of-thought output diverges from the original system prompt task by more than a threshold semantic distance. Requires embedding comparison — more expensive than the other flags. Run at session mid-point checkpoints rather than every step to keep overhead manageable. Misses some context poisoning cases. Catches the severe ones early. Research suggests context drift contributes to approximately 65% of enterprise AI agent failures, making this flag high-value despite the compute cost.
Soft warning at 60% of the session token budget, separate from the hard kill at 100%. The point is intervention time — 40% of the remaining budget for the platform team to investigate and pause before the session terminates abruptly. An agent killed at the hard limit produces an incomplete trace that is harder to debug than one paused on BUDGET_WARNING. Wire this to PagerDuty or your on-call rotation, not just a log entry.
A concrete priority sequence for teams that don't have all of this in place yet.
Most teams running agents in production are somewhere in the middle — they have some logging, they don't have session correlation, they have no anomaly flags, and they've debugged one or two incidents by staring at logs for three hours. This is the priority sequence.
Week 1: Session correlation and tool payloads. Add gen_ai.conversation.id (or your own session_id) to every LLM call and every tool call. Log the actual tool input arguments and response field names at each step. These two changes alone cut mean debug time by more than half on the next incident. Nothing else matters until these are in place.
Week 2: Outcome classification and TOOLSCHEMADRIFT. Add the five outcome labels to your session lifecycle: Completed, Tool Error, Bad Output, Timeout, Budget Exceeded. Wire TOOL_SCHEMA_DRIFT detection to your tool wrappers — hash the expected field set per tool, compare on every call. This flag catches the most common production failure class (API contract change) at the boundary rather than hours later in user reports.
Week 3: ENTITYHALLUCINATION, BUDGETWARNING, and LOOP_DETECTED. Add entity-ID format detection to your LLM call wrappers. Set a soft budget warning at 60% of session budget. Add loop detection at the two near-identical-call threshold. These three flags cover hallucination propagation, cost runaway, and the most obvious form of context poisoning.
Month 2: Eval harness and CONTEXTDRIFT. Build a regression eval that replays historical session traces and asserts against expected outcomes. Add CONTEXTDRIFT detection at session mid-points using embedding comparison. File the first postmortem using the six-section template rather than the standard RCA format.
You don't need all of this before shipping your first agent to production. You need session correlation and tool payloads. Everything else is a force multiplier on that foundation.
| Priority | Capability | Effort | Failure Modes Covered | Monday Morning Heuristic |
|---|---|---|---|---|
| 1 — Ship nothing without this | Session correlation ID on every span | < 1 day | All four — without this you can't reconstruct any session | If you don't have it, you can't debug your first production incident, full stop |
| 2 — Ship nothing without this | Tool input/output payload logging | < 1 day | Tool Misfire, Hallucination Propagation | Most production agent failures trace to a tool interface problem; you need the payloads |
| 3 — Add in week 1 | Session outcome classification | Half day | All four — enables cohort queries | Without outcome labels you can't answer 'how many Bad Output sessions this week?' |
| 4 — Add in week 2 | TOOLSCHEMADRIFT flag | 1 day | Tool Misfire, Hallucination Propagation | API contract changes are the most common silent failure trigger |
| 5 — Add in week 2–3 | ENTITYHALLUCINATION + LOOPDETECTED + BUDGET_WARNING | 2–3 days | All four failure modes | These three flags cover the next-most-common incident classes |
| 6 — Month 2 | Eval harness + CONTEXT_DRIFT | 1–2 weeks | Context Poisoning, regression detection | High-value but requires embedding comparison; do this after the simpler flags are stable |
When does an agent incident warrant a full postmortem versus a bug fix?
Escalate to a full postmortem when the agent took an irreversible external action, when per-session cost exceeded 5x baseline, or when the same failure pattern appeared in more than one session. A single hallucination that produced wrong text without triggering tool calls — and was contained to one session — is a bug: add an eval test, update the prompt, move on. The postmortem process exists for failures that expose architectural gaps. The practical test: if the detection gap finding requires changing something in the infrastructure (anomaly flag, schema validation, guard rail) rather than just a prompt edit, it needs a postmortem.
Can my existing APM tool handle agent observability?
It handles infrastructure-level metrics — latency, error rates, cost per session — well. It doesn't handle behavioral correctness: whether the agent chose the right tool, whether its output was factually accurate, whether its goal stayed consistent across turns. LLM observability is fundamentally semantic, not syntactic. A clean HTTP 200 from an LLM API can carry a hallucinated fact, and no latency graph detects it. For behavioral correctness, you need OpenTelemetry spans with gen_ai.* attributes plus a semantic eval layer — Langfuse, Arize Phoenix, or a custom eval harness that runs assertions against session traces. The APM stack stays for infrastructure. It doesn't substitute for session-level behavioral monitoring.
How do I instrument an agent built on a third-party framework like LangGraph, LlamaIndex, or AG2?
Most major frameworks now ship native OpenTelemetry support. AG2 has built-in OTel tracing that captures agent turns, LLM calls, tool executions, and speaker selections as structured spans connected by a shared trace ID, exportable to any OTel-compatible backend. [7] LlamaIndex and LangChain support OpenInference instrumentation, which follows the OpenTelemetry GenAI semantic conventions. The minimum is one trace per session with step numbers, gen_ai.* attributes on LLM calls, and tool call input/output payloads. If your framework doesn't emit these natively, add a thin wrapper at the LLM call boundary — two functions, under 50 lines — as shown in the code block above.
Does vanilla OpenTelemetry cover agent faults adequately?
No. The AgentTelemetry benchmark found vanilla OTel achieves a Fault Detection Rate of 0.429 (43%) across 14 agent-specific fault types. [10] The five phases it doesn't cover with standard spans are planning, reasoning, safety monitoring, inter-agent delegation, and memory management. A comprehensive span taxonomy that adds custom spans for these phases achieves FDR 1.000 on the same benchmark. The practical implication: use OTel gen_ai semantic conventions as the base, then add custom attributes for orchestration-layer spans until the standard catches up. You're not inventing a proprietary format — you're extending a good foundation.
What is the minimum instrumentation before shipping an agent to production for the first time?
Three things, in priority order. A session correlation ID on every LLM call and every tool call — without this you can't reconstruct what happened. Tool call input and output logging at every step — most production agent failures trace to a tool interface problem, and you can't debug it without the actual payloads. A session outcome label that resolves at end-of-session: Completed, Tool Error, Bad Output, Timeout, Budget Exceeded. With these three, you can debug most production incidents. Anomaly flags, semantic evals, and replay harnesses are force multipliers on top. Without correlation IDs, tool payloads, and outcome labels, your first production incident will be undebuggable regardless of what else is in the stack.
Can an LLM help me debug agent traces automatically?
Not reliably yet. The TRAIL benchmark tested exactly this: 148 human-annotated agent traces with 841 errors, and Gemini-2.5-Pro — the best model on the benchmark — scored 11% joint accuracy at identifying all fault origins in a trace. [9] Performance drops sharply as the number of errors per trace increases. The practical conclusion: use LLMs to assist with trace summarization and hypothesis generation, but don't rely on them as the primary debugger. Structured anomaly flags (ENTITYHALLUCINATION, TOOLSCHEMA_DRIFT, etc.) are still more reliable for finding the fault origin than asking a model to self-diagnose from raw trace text.
The hallucination probability curves (73% at 5+ tool calls, 71% front-loaded errors) come from analysis across controlled benchmark evaluations — AgentBench 2025, HELM Agentic Evaluation, and Stanford HAI Q4 2025 data — as reported by Markaicode (Feb 2026) [1]. The 89%/52% observability gap comes from Tianpan.co's systematic debugging article (Feb 2026) [2], citing early 2026 survey data. The AgentTelemetry FDR numbers (0.429 vanilla OTel vs. 1.000 full taxonomy) come from the AgentTelemetry paper on OpenReview [10]. The TRAIL benchmark stats (148 traces, 841 errors, 11% Gemini-2.5-Pro score) come from the Patronus AI paper (May 2025) [9]. Real-world production rates depend on input noise, schema consistency, and retrieval quality. Treat them as directional benchmarks, not engineering thresholds to cite in an SLO.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.