SRE runbooks assume one process, one stack trace, one bad line. Agent failures are distributed across dozens of reasoning steps — the wrong premise gets laundered through 33 more calls before the user sees it. Here is the taxonomy, the triage, the postmortem.
2:37am. The order agent returns a confirmation for an order that does not exist. The on-call pulls logs and finds nothing. No stack trace. No 500. No exception. HTTP 200, well-formatted prose, full confidence — confirming the wrong order against the wrong customer account.
The failure is not in your code. It's distributed across 40 LLM calls, where the actual fault happened around step 7 — a small hallucination about a customer identifier — and every subsequent call built confidently on top of the poisoned premise. By the time the user saw the output, that error had been laundered through 33 more reasoning steps.
Tool calling fails between 3–15% of the time in production, even in well-engineered systems [8]. When it does, there's no stack trace. Error rates look normal. Latency is fine. The SLO dashboard is green. The user is filing a ticket.
Traditional SRE asks one question: what crashed? Agentic systems demand a different one: which of these 40 reasoning steps produced the wrong context that cascaded into the wrong action? That's not debugging. That's distributed session forensics.
This playbook maps the four operational failure modes to concrete triage steps, gives you a backward-trace method for locating fault origins inside long sessions, shows you the exact OpenTelemetry span attributes that matter for forensics, and provides a postmortem template built for systems where non-determinism is the default state.
The four failure modes — hallucination propagation, tool misfire, context poisoning, cost runaway — and how to classify them fast
The backward-trace method: find the fault origin without reading 40 calls sequentially
OpenTelemetry GenAI semantic convention attributes (v1.37) that make forensics tractable
A complete triage sequence for the first five minutes of an agent incident
A postmortem template that maps to agent failures, not deployment changes
Structural prevention choices that decide whether the next incident takes ten minutes or ten hours
Hallucination propagation, tool misfire, context poisoning, cost runaway. Misclassify the mode and you investigate the wrong session.
Even well-engineered systems drop tool calls in this range. Each failure is silent — no exception, no alert.
Four LangChain agents entered an infinite loop and ran for 11 days before the bill surfaced. Alerts fired. Nobody acted on them.
Most incidents return HTTP 200. The agent completed. It did the wrong thing.
The mental model designed for crashes does not survive contact with reasoning defects.
Traditional runbooks assume systems that fail loudly. A service crashes. A timeout fires. A null pointer propagates. You find the exception, walk the call stack, identify the line. The bug has a location.
Agent incidents violate every assumption in that model. The agent didn't crash. It ran to completion and returned a result. The result was wrong, and nothing in your observability stack knows that. The failure is closer to a reasoning defect than a code defect. And reasoning defects compound.
A wrong assumption at step 7 shapes the framing at step 8, which selects the wrong tool at step 9, which returns data that reinforces the wrong assumption at step 10. By step 20, the agent has constructed a coherent internal narrative that is entirely wrong — and it has the tool call logs to prove it. The model is not confused. It is confidently wrong, and its own earlier output is now the evidence it cites.
This isn't a hypothetical mode. In multi-agent systems, hallucination cascades are a documented failure pattern: one subagent produces a plausible-sounding but false intermediate result, the orchestrator accepts it as ground truth, and every downstream agent operates on poisoned inputs [5]. Without span-level tracing that captures what was in the context window at each step, that cascade is invisible.
Ask: what process crashed?
Read the stack trace
Identify the failing line of code
Reproduce with identical inputs
Fix the deterministic bug
Verify with a unit test
Ask: which step produced the wrong context?
Read the full session trace across every LLM call
Locate the fault origin in the reasoning chain
Accept that exact reproduction is usually impossible
Fix the prompt, the tool interface, or the memory architecture
Verify with a replay eval against the captured session
Classify before you investigate. The wrong taxonomy sends you down the wrong session.
The agentic AI fault taxonomy literature [1] catalogs dozens of failure patterns. For incident response, they collapse into four operationally distinct modes — each with a different signal, a different blast radius, a different triage path.
1. Hallucination propagation
The model generates a false assertion early in the session. Because agents accumulate context across calls, that assertion gets referenced and re-affirmed in later steps. By call 20, the hallucination is an established fact inside the session. The model is not wrong because it's confused. It's wrong because its own earlier output is now evidence.
Signature: confident, coherent, structured output, internally consistent. Built on a false premise. No tool call failure. No error code.
2. Tool misfire
The model selects the wrong tool, passes malformed arguments, or misinterprets tool output. Unlike hallucination, tool misfire produces real side effects immediately — deleted records, sent emails, processed payments, triggered workflows. The session looks healthy on latency and token counts while causing irreversible damage downstream. Schema drift is a common trigger: a dependency update changes how tool schemas are generated, making them incompatible with the model provider's format [6].
Signature: a tool was called with an unexpected argument pattern, or a tool returned data the model processed without validating against the session's stated goal.
3. Context poisoning
A hallucination or injected content makes it into persistent context — goal state, working memory, retrieved documents. The agent's framing of the entire task warps. Long-running agents are especially exposed because they carry context across many turns and the poisoning compounds [5]. Context poisoning differs from hallucination propagation in what it corrupts. Hallucination poisons factual claims about the world. Poisoning corrupts the agent's self-model — what it thinks it's trying to do.
Signature: the agent's stated goal drifts across turns without user instruction. The world model becomes internally inconsistent.
4. Cost runaway
The agent loops: repeated tool calls with similar arguments, infinite retry logic, self-spawning subagents, circular reasoning chains. In November 2025, a market research pipeline running four LangChain agents entered an unintended infinite loop that ran for 11 days and cost $47,000 [11]. Alerts fired. Nobody acted on them in time. Token consumption compounds silently until a budget alarm fires or the session times out.
Signature: per-session token counts spike well above baseline. Tool call frequency is abnormally high. The same tool gets called multiple times with near-identical arguments.
| Failure Mode | Primary Signal | First Triage Step | Blast Radius |
|---|---|---|---|
| Hallucination propagation | Confident, coherent output — factually wrong. No tool errors. | Locate the first false assertion. Trace backward from wrong output. | Low–medium. Informational unless paired with tool calls. |
| Tool misfire | Wrong API call or malformed arguments in the tool log | Audit every tool call after the fault origin. Inspect downstream system state. | High. Real-world effects are immediate and often irreversible. |
| Context poisoning | Goal drifts across turns. World model becomes internally inconsistent. | Find when goal state was overwritten. Inspect injected or retrieved content. | Variable. Depends on how long the session ran after poisoning. |
| Cost runaway | Token counter spike, repeated near-identical tool calls, no final output | Kill the session immediately. Audit total spend. Identify loop entry point. | Financial. No user-visible wrong output. Potentially catastrophic cost. |
A triage sequence that holds when the error log is empty and the SLO dashboard is green.
Kill the active session if the agent is still running. Revoke API credentials if tool misfire is in play. Rate-limit the agent's access to external services until you understand the blast radius. An agent that keeps running while you investigate is an agent that keeps making decisions you haven't authorized.
Export the complete trace before logs rotate. Every LLM call with full prompt and completion. Every tool invocation with arguments and response. The full context window state at each step. Capture now. Reconstruct later. Logs that rotated mid-investigation are not coming back.
Before reading the trace in detail, look at the high-level signals and classify which of the four modes this is. Classification determines the triage path. A cost runaway investigation has nothing in common with a context poisoning investigation. Misclassify and you spend an hour reading the wrong evidence.
Walk the trace backward from the wrong output. Identify the earliest step where the agent's reasoning diverges from what you'd expect. That step is the fault origin — not the step that produced the bad output, but the step that first introduced the wrong premise. Everything between origin and final output is cascade.
Enumerate every external action the agent took after the fault origin. For each, decide whether it's reversible. Deleted data, sent communications, processed transactions, modified state — each needs an explicit reversal plan. Log everything. The postmortem will need it.
GenAI semantic conventions v1.37 give you a standard schema. Here is which fields to instrument first.
The industry is converging on OpenTelemetry (OTel) as the standard for agent telemetry, and the GenAI semantic conventions — currently at v1.37 — define the span attribute schema [10]. Datadog, Langfuse, and Arize Phoenix all map to these conventions natively, so you instrument once and export to any backend [9].
For forensics, not every attribute is equally useful. These are the ones that determine whether you find the fault origin in ten minutes or three hours:
Per-LLM-call spans (gen_ai.client.inference or equivalent)
gen_ai.request.model — which model version processed this step. Model version drift is a silent cause of behavioral regression, especially after provider rollouts.gen_ai.usage.input_tokens and gen_ai.usage.output_tokens — token counts per call. An unexpectedly large input token count often means the context window accumulated garbage. A sudden spike is your loop detection signal.gen_ai.response.finish_reasons — why the model stopped. tool_calls means the model handed off. stop means it gave a final answer. length means the output was truncated — a common cause of malformed tool arguments.gen_ai.system_instructions / full message capture — opt-in, sensitive, but essential. Without capturing what was actually in the context window at the fault step, backward tracing is guesswork.Per-tool-call spans
gen_ai.operation.name: tool_call — labels this as a tool execution spanCustom fields to add on every span
session_id — links all spans from one agent run into a single forensic unitstep_number — without this, backward tracing requires time-sorting across potentially out-of-order eventsanomaly_flags — structured array, not free text. Free text is not searchable evidence.Locating the fault origin in a trace that is hundreds of events long.
A complex agent run produces a trace hundreds of events long. Reading sequentially from step one is the wrong move — you'll spend most of your time on steps that were fine. The correct method is forensic: start from the known bad output and work backwards.
The backward trace method
Take the final wrong output — a hallucinated order confirmation, a malformed API call, an incorrectly processed payment — and ask: what would have to be true for this output to make sense? Then find the step where that information first appeared. That is the fault origin.
In the example trace above, the wrong output is a confirmation linking cust_789 to ord_555. Working backward: the agent called get_order with ord_555 at step 4. Where did ord_555 come from? It appeared in the LLM output at step 3 — an order ID that was never in the conversation context before that point. Entity hallucination. Step 3 is the fault origin.
Anomaly flags as accelerants
If your tracing infrastructure captures structured anomaly flags, the backward trace collapses from forty events to two or three. Useful flag types to instrument [3]:
ENTITY_HALLUCINATION — the model referenced an entity ID not found in prior contextTOOL_ARG_MISMATCH — a tool was called with arguments that violate its schemaCONTEXT_DRIFT — the agent's stated goal changed between turns without user instructionLOOP_DETECTED — the same tool was called with near-identical arguments inside one sessionENTITY_MISMATCH — a tool returned data belonging to a different entity than the one being processedOUTPUT_TRUNCATED — finish_reason: length on a tool-call step; the model's arguments were cut mid-JSONReconstruct the context, not just the output
LLM call summaries tell you what the model said. They don't tell you what it believed. At the fault origin step, rebuild the full prompt — system message, conversation history, tool results, working memory — and ask: given exactly this context, is this model output surprising? Sometimes the answer is yes and you've found a model limitation. More often the answer is no: given what was in that context window, the output was entirely predictable. Which means the fix is upstream — in the context, not the model.
This is the most common finding in agent postmortems that get honest: the model did exactly what the context implied it should do. The fault is in what made it into the context.
Standard SRE templates assume one root cause and one deployment. Agent failures violate both.
Standard postmortem templates ask deterministic questions. What changed in the deployment? What was the root cause line of code? How did the change ship? Agent postmortems need different questions because the failure is rarely in the deployment. It's in the combination of model behavior, prompt design, tool interfaces, and the specific data that appeared in the session.
We ran our first three agent postmortems on a standard SRE template. All three came out inconclusive. The template demanded "the root cause" — a single line, a single change — and every agent failure we investigated had three or four contributing factors that were each insufficient alone. Renaming the field "contributing factors" instead of "root cause" changed the remediation conversations immediately. Postmortems that end in "add a unit test" produce different follow-up than ones that end in "redesign the entity validation layer."
A useful agent postmortem answers six things:
1. Session context — What was the session trying to do? Provide session ID, time range, total LLM calls, total tokens consumed, plain-language description of intended behavior. Every other finding grounds in this.
2. Failure classification — Which of the four modes applies? Not a bureaucratic label — it determines which prevention layer was missing and what has to change.
3. Fault origin and cascade path — Which step was the fault origin? What was in the context window at that step? Trace the cascade from origin to final wrong output. Specific step numbers, no hand-waving.
4. Impact assessment — What external actions did the agent take after the fault origin? Reversible or not? User-facing impact? Cost impact?
5. Contributing factors — Model limitation (the base model hallucinates reliably on this input type)? Prompt design flaw (the system prompt didn't constrain entity references)? Tool interface issue (the tool returned ambiguous data)? Data contamination (a retrieved document carried misleading content)? Name the forces clearly.
6. Detection gap — Why did this reach the user? No eval test for the failure pattern? Insufficient monitoring? Blast radius larger than expected because no guardrails? The detection gap is the most important finding. It drives all prevention work. Everything else is description.
Structural choices that decide whether the next incident takes ten minutes or ten hours.
Prevention architecture falls into two tiers. The first tier is visibility — making failures findable. The second tier is containment — limiting blast radius when a failure does occur. Teams that skip straight to containment without visibility end up with circuit breakers that trip for the wrong reasons and no evidence to diagnose why.
Start with visibility. Then contain.
Emit a structured trace event at every LLM call boundary — step number, input token count, model version, finish reason, tool calls triggered
Capture gen_ai.usage.input_tokens per call. A step where input tokens spike 3x above the session median is a context bloat signal — often the step before a hallucination
Store anomaly flags — entity hallucination, tool arg mismatch, context drift, output truncation — as structured JSON arrays on spans. Free-form log text is not searchable evidence
Retain full session traces for at least 30 days. Most agent incidents surface days after the session ran [12]
Hard token budgets per session type — not just overall caps. Per-step limits that fire an alert when a single LLM call is anomalously expensive
Loop detection: same tool called with near-identical arguments more than three times in a session triggers a pause and escalation — before the budget cap is reached
Spend-rate monitoring: if a session's token burn rate exceeds 3x its trailing average in any 15-minute window, auto-throttle and alert immediately [11]
Cost alarms wired to session-level spend, not just account-level monthly totals. A single runaway session must trigger an alert before it hits double digits
Validate every key entity at session start — customer IDs, order IDs, account references — before any LLM call runs. A bad entity reference that fails validation at step 0 is not an incident
Add lightweight classifiers to intermediate LLM outputs that flag when the model references entity IDs not present in prior context
Use structured output schemas for steps that require precision. Constrain what the model is allowed to claim in tool call arguments — finish_reason: length on a tool-call step means the JSON was cut mid-object
For multi-agent sessions, treat each subagent's output as untrusted input. Validate before passing downstream [5]
Store sessions in replay-friendly format — every message, tool definition, and tool response captured so you can re-run with a patched prompt
Add every production failure to the eval regression suite. The session trace is the test case
Run weekly regression evals against the failure library to catch prompt or model regressions before they reach users
When a model provider rolls out a new version, run your failure library against both versions before migrating traffic
Why can't I reproduce the agent failure in a test environment?
Non-determinism. Outputs vary with temperature, context window packing, and API version drift between environments. Reproduction is not the goal of agent incident response. Evidence collection from the original session is. The session that failed is your most valuable artifact. Capture it, store it, analyze it directly. Stop trying to recreate it.
My agent uses streaming output — how do I get useful traces for forensics?
Buffer streamed completions to a single trace event before each tool call and at session end. Forensics doesn't need character-level streaming data. It needs span boundaries — start time, end time, full input, full output, finish reason. OTel-native libraries for LLMs — traceloop/openllmetry, Langfuse, Arize Phoenix — handle this buffering with minimal overhead. The gen_ai.client.inference.operation.details attribute in the v1.37 spec captures full content as opt-in events decoupled from trace lifecycle.
How do I find the fault origin without reading 40 calls?
Backward trace plus anomaly flags. Start from the wrong output. Ask what would have to be true for that output to make sense. Find the earliest step where that information first appeared. Structured anomaly flags — ENTITYHALLUCINATION, TOOLARGMISMATCH, OUTPUTTRUNCATED — collapse this from manual review of 40 events to checking 2–3 flagged spans. If you don't have flags, look at finish_reason first: a length-terminated step that triggered a tool call is a common fault origin.
When does a context poisoning incident warrant a full postmortem instead of a quick fix?
Escalate to a full postmortem when the agent took an external action, when cost ran more than 5x the session baseline, or when the same failure pattern appeared in more than one session. A single informational hallucination that triggered no tool calls and stayed inside one session is a bug — add an eval, update the prompt, move on. The postmortem process is for failures that expose architectural gaps, not for every imperfect output.
Should agent postmortems be separate from the standard incident review?
Yes — or at minimum, a dedicated section. Standard reviews focus on deployment changes, service dependencies, and infrastructure state. None of that is usually relevant to agent failures. Agent postmortems focus on session behavior, prompt design, model characteristics, and eval coverage. Mixing them muddies both analyses.
One counter: too many separate processes produce postmortem fatigue and the behavioral analysis never gets completed. If that's the pattern, consolidate — but make the agent-specific questions mandatory fields that cannot be left blank before the postmortem closes. An empty fault origin field means the investigation isn't done.
Which observability platform should I pick for agent tracing?
It depends on your stack. LangSmith has the deepest LangChain/LangGraph integration — node-by-node state diffs, full execution graphs, replay against new model versions. Langfuse is the open-source leader — self-hostable on Postgres + ClickHouse, framework-agnostic, any SDK via OTel. Arize Phoenix ships stronger eval primitives and drift detection if you need ML-rigor beyond basic tracing. All three export to OTel conventions v1.37, so your instrumentation code doesn't change if you switch backends. Start with whichever your framework ships first-class support for, and migrate later if your forensic needs grow.
The agent postmortem that ends with "the model hallucinated" is not a postmortem — it's a shrug. Every agent model hallucinates. The question is which structural condition let that hallucination propagate through 33 more reasoning steps and reach the user. That condition is in your observability stack, your entity validation layer, your eval harness, or your containment design. Name it specifically, fix it structurally, and add it to the failure library. The next incident will test whether you actually did.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.