Your monitoring fires at 2:37am. The customer-facing order agent returned a confirmation for an order that doesn't exist. The on-call engineer pulls the logs — and finds nothing useful. No stack trace. No 500 status code. No exception. The agent returned 200 and produced a well-formatted, confident response. It just confirmed the wrong order against the wrong customer account.
This is the agent incident playbook problem. The failure is not in your code. It's distributed across 40 LLM calls, where the actual root cause happened somewhere around step 7 — a small hallucination about a customer identifier — and every subsequent call built confidently on that poisoned premise. By the time the user saw the output, that error had been laundered through 33 more reasoning steps.
Traditional SRE runbooks ask "what crashed?" Agentic systems require a fundamentally different question: which of these 40 reasoning steps produced the wrong context that cascaded into the wrong action? That's not a debugging problem. It's distributed session forensics.
This playbook maps the four core LLM failure modes to concrete triage steps, gives you a session forensics method for locating fault origins in long traces, and provides a postmortem template built for systems where non-determinism is a given, not an anomaly.
Why Your SRE Runbook Fails at 2am
The mental model mismatch between deterministic systems and agentic ones
Traditional runbooks were designed for systems that fail loudly. A service crashes, a timeout fires, a null pointer propagates. You find the exception, you trace the call stack, you identify the line. The problem has a location.
Agent incidents violate every assumption in that model. The agent didn't crash — it ran to completion and returned a result. The result was wrong, but nothing in your observability stack knows that. Error rates look normal. Latency is fine. The SLO dashboard is green. The user is filing a support ticket.
The failure mode is closer to a reasoning defect than a code defect. And reasoning defects in agentic systems have a property that makes them particularly hard to debug: they compound. A wrong assumption at step 7 shapes the framing at step 8, which selects the wrong tool at step 9, which returns data that reinforces the wrong assumption at step 10. By step 20, the agent has constructed a coherent internal narrative that is entirely wrong — and it has the tool call logs to prove it.
Ask: what process crashed?
Read the stack trace
Identify the failing line of code
Reproduce with identical inputs
Fix the deterministic bug
Verify with a unit test
Ask: which step produced the wrong context?
Read the full session trace across all LLM calls
Identify the fault origin in the reasoning chain
Accept that exact reproduction is often impossible
Fix the prompt, tool interface, or memory architecture
Verify with a replay eval on the captured session
The Four Failure Modes of Agentic Systems
A taxonomy for classifying agent incidents before you start digging
Research from the agentic AI fault taxonomy literature [1] identifies several recurring failure patterns. For the purposes of incident response, these collapse into four operationally distinct modes — each with a different signal, blast radius, and triage path.
1. Hallucination propagation
The model generates a false assertion early in the session. Because agents accumulate context across calls, this assertion gets referenced and re-affirmed in later steps. By call 20, the hallucination has become an established fact inside the session. The model isn't wrong because it's confused — it's wrong because its own earlier output is now evidence.
The key signature: the agent sounds confident and coherent. The output is structured, grammatically correct, and internally consistent. It's just built on a false premise.
2. Tool misfire
The model selects the wrong tool, passes malformed arguments, or misinterprets tool output. Unlike hallucination, this can produce real side effects immediately — deleted records, sent emails, processed payments, triggered workflows. The session may look healthy in terms of latency and token counts while causing irreversible damage downstream.
The key signature: a tool was called with an unexpected argument pattern, or a tool returned data that the model processed without validating against the session's stated goal.
3. Context poisoning
A hallucination or injected content makes it into persistent context — goal state, working memory, retrieved documents. The model's framing of the entire task becomes warped. Long-running agents are especially vulnerable because they carry context across many turns and the poisoning compounds over time [5].
Context poisoning differs from standard hallucination propagation in that it affects the agent's self-model — what it thinks it's trying to do — rather than its factual claims about the world.
4. Cost runaway
The agent enters a loop: repeated tool calls with similar arguments, infinite retry logic, self-spawning subagents, or circular reasoning chains. No wrong output in the traditional sense — the agent may never surface a result at all. The failure is financial and operational. Token consumption compounds silently until a budget alarm fires or the session times out [6].
The key signature: token counts per session spike well above the baseline. Tool call frequency is abnormally high. The same tool gets called multiple times with nearly identical arguments.
The Agent Incident Playbook: First Five Minutes
A structured triage sequence that works when the error log is empty
- 1
Contain the session
Kill the active session immediately if the agent is still running. Revoke API credentials if tool misfire is suspected. Rate-limit the agent's access to external services until you understand the blast radius. Do not let the agent continue to run while you investigate.
- 2
Capture the session trace
Export the complete session trace before logs rotate. You need every LLM call with its full prompt and completion, every tool invocation with its arguments and response, and the full context window state at each step. Capture now — reconstruct later.
- 3
Classify the failure mode
Before reading the trace in detail, look at the high-level signals to classify which of the four failure modes this is. Classification determines your triage path. A cost runaway incident has a completely different investigation sequence than a context poisoning one.
- 4
Find the fault origin
Walk the session trace backward from the wrong output. Identify the earliest step where the agent's reasoning diverges from what you would expect. That step is your fault origin — not the step that produced the bad output, but the step that first introduced the wrong premise. Everything between the fault origin and the final output is cascade.
- 5
Assess and document blast radius
List every external action the agent took after the fault origin. For each action, determine whether it's reversible. Deleted data, sent communications, processed transactions, and modified state all need explicit reversal plans. Log everything — you'll need it for the postmortem.
session-trace.jsonl// Structured trace format — one event per line
// Required fields for agent incident forensics
{"session_id":"ses_abc123","step":1,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":847,"output_tokens":312,"ms":1240,"summary":"Initiated customer lookup by account ID"}
{"session_id":"ses_abc123","step":2,"type":"tool_call","tool":"get_customer","args":{"id":"cust_789"},"status":"ok","summary":"Returned Alice Chen, enterprise tier"}
{"session_id":"ses_abc123","step":3,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":1123,"output_tokens":445,"ms":1890,"summary":"Referenced order_id not present in prior context","flags":["ENTITY_HALLUCINATION"]}
{"session_id":"ses_abc123","step":4,"type":"tool_call","tool":"get_order","args":{"id":"ord_555"},"status":"ok","summary":"Order found — belongs to cust_012, not cust_789","flags":["ENTITY_MISMATCH"]}
{"session_id":"ses_abc123","step":5,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":1892,"output_tokens":223,"ms":1540,"summary":"Confirmed order for cust_789 — fault origin was step 3"}Session Forensics: Reading the Trace Like Evidence
How to locate the fault origin in a long, non-deterministic execution trace
A session trace for a complex agent run can be hundreds of events long. Reading it sequentially from the start is the wrong approach — you'll spend most of your time reviewing steps that were fine. The correct method is forensic: start from the known bad output and work backwards.
The backward trace method
Take the final wrong output — a hallucinated order confirmation, a malformed API call, an incorrectly processed payment — and ask: what information would have to be true for this output to make sense? Then find the step where that information first appeared in the session. That's your fault origin.
In the example trace above, the wrong output is a confirmation linking cust_789 to ord_555. Working backward: the agent called get_order with ord_555 at step 4. Where did ord_555 come from? It appeared in the LLM output at step 3 — an order ID that was never in the conversation context before that point. That's an entity hallucination, and step 3 is the fault origin.
Anomaly flags as accelerants
If your tracing infrastructure supports it, add automated anomaly flags to your trace events [3]. Useful flag types include:
ENTITY_HALLUCINATION— model referenced an entity ID not found in prior contextTOOL_ARG_MISMATCH— tool was called with arguments that don't match its schemaCONTEXT_DRIFT— the agent's stated goal changed between turns without user instructionLOOP_DETECTED— same tool called with near-identical arguments within the same sessionENTITY_MISMATCH— a tool returned data belonging to a different entity than the one being processed
These flags don't catch every failure, but they dramatically accelerate the backward trace. In the example above, the ENTITY_HALLUCINATION flag on step 3 points directly to the fault origin without needing to manually compare 40 events.
What to look for in context state
Beyond individual LLM call summaries, examine what was actually in the context window at the fault origin step. For agents that use structured working memory or tool-call history as context, the question is: what did the model believe to be true when it made the wrong inference? Reconstruct the full prompt — system message, conversation history, tool results, working memory — at that exact step. The fault origin will be visible in the context, not just the output.
| Failure Mode | Primary Signal | First Triage Step | Blast Radius |
|---|---|---|---|
| Hallucination propagation | Confident, coherent, but factually wrong output | Find first false assertion in trace; trace backward from wrong output | Low to medium — informational unless paired with tool calls |
| Tool misfire | Wrong API call or malformed arguments in tool log | Audit all tool calls after fault origin; check downstream system state | High — real-world effects may be immediate and irreversible |
| Context poisoning | Agent goal drifts across turns; inconsistent world model | Find when goal state was overwritten; check injected or retrieved content | Variable — depends on how long session ran after poisoning |
| Cost runaway | Token counter spike, repeated tool calls, no final output | Kill session; audit total spend; identify loop entry point | Financial — no user-visible wrong output but significant cost exposure |
The Postmortem Template for Agent Incidents
What a non-deterministic postmortem looks like — and what it must answer
Standard postmortem templates ask questions that assume a deterministic system: what changed in the deployment? What was the root cause line of code? How did we break the change? Agent postmortems need different questions because the failure usually isn't in the deployment — it's in the combination of model behavior, prompt design, tool interfaces, and the specific data that appeared in the session.
A useful agent postmortem answers six things:
1. Session context — What was the session trying to do? Provide the session ID, time range, total LLM calls, total tokens consumed, and a plain-language description of the intended behavior. This grounds every other finding.
2. Failure classification — Which of the four failure modes applies? This is not a bureaucratic label — it determines which prevention layers were missing and what needs to change.
3. Fault origin and cascade path — Which step was the fault origin? What was in the context window at that step? Trace the cascade path from fault origin to final wrong output. Be specific about step numbers.
4. Impact assessment — What external actions did the agent take after the fault origin? Which were reversible and which were not? What was the user-facing impact? What was the cost impact?
5. Root cause — Was this a model limitation (the base model hallucinated reliably on this input type)? A prompt design flaw (the system prompt didn't constrain entity references)? A tool interface issue (the tool returned ambiguous data)? A data contamination problem (a retrieved document contained misleading content)? One root cause, clearly named.
6. Detection gap — Why didn't you catch this before it reached the user? Was there no eval test for this failure pattern? Was the monitoring insufficient? Was the blast radius larger than expected because there were no guard rails? The detection gap is the most important finding — it drives the actual prevention work.
Agent Postmortem Checklist
Session trace exported and stored before log rotation
Fault origin identified by step number
Failure mode classified (hallucination / tool misfire / context poisoning / cost runaway)
Full cascade path documented from fault origin to final output
All external actions after fault origin enumerated
Reversible vs. irreversible actions identified and remediated
Root cause documented with one clear category
Detection gap described: what monitoring would have caught this earlier?
New eval test added to regression suite for this failure case
Alert configured for the failure signature (anomaly flag, token spike, or loop pattern)
Postmortem distributed to platform, product, and on-call teams within 48 hours
The Prevention Layer: Building for Debuggability
Structural choices that make the next incident faster to diagnose and contain
Structural tracing from day one
- ✓
Emit a structured trace event at every LLM call boundary — include step number, input token count, model version, output summary, and any tool calls triggered
- ✓
- ✓
Store anomaly flags (entity hallucination, tool arg mismatch, context drift) as structured fields on trace events, not free-form log text
- ✓
Retain full session traces for at least 30 days — most agent incidents surface days after the session ran
Budget guardrails
- ✓
Set hard token budgets per session type — not just overall limits, but per-step limits that fire an alert when a single LLM call is anomalously expensive
- ✓
Implement loop detection: if the same tool is called with near-identical arguments more than three times in a session, pause and escalate
- ✓
Wire cost alarms to session-level spend, not just account-level monthly totals — a single runaway session should trigger an alert before it hits double digits
Entity validation at session entry
- ✓
Validate all key entities at the start of every session — customer IDs, order IDs, account references — before any LLM calls run
- ✓
Add lightweight classifiers to intermediate LLM outputs that flag when the model references entity IDs not present in the prior context
- ✓
Use structured output schemas for steps that require precision — constrain what the model can claim in tool call arguments
Replay eval harness
- ✓
Store sessions in replay-friendly format: all messages, tool definitions, and tool responses captured so you can re-run the session with a patched prompt
- ✓
Add every production failure to your eval regression suite — the session trace is the test case
- ✓
Run weekly regression evals against your stored failure library to catch prompt or model regressions before they reach users
Why can't I reproduce the agent failure in a test environment?
Non-determinism makes exact reproduction rare. Model outputs vary with temperature, context window packing, and API version differences between environments. The goal of agent incident response isn't reproduction — it's evidence collection from the original session trace. The session that failed is your most valuable artifact. Capture it, store it, and analyze it directly rather than trying to recreate it.
My agent uses streaming output — how do I get useful traces for forensics?
Buffer streamed completions to a single trace event before each tool call and at session end. You don't need character-level streaming data for forensics — you need span boundaries. OpenTelemetry-native libraries for LLMs (traceAI, Langfuse, Phoenix) handle this instrumentation with minimal overhead and integrate with existing observability backends.
How do I know which of the 40 calls is the fault origin without reading them all?
Use the backward trace method: start from the final wrong output and ask what information would have to be true for that output to make sense. Then find the earliest step where that information appeared. Anomaly flags on trace events accelerate this significantly — an ENTITY_HALLUCINATION flag on step 3 tells you where to look without manually reviewing 40 events.
When does a context poisoning incident warrant a full postmortem vs. a bug fix?
Escalate to a full postmortem when the agent took an external action, when cost was more than 5x the session baseline, or when the same failure pattern appeared in more than one session. A single informational hallucination that didn't trigger any tool calls and was contained to one session can usually be handled as a bug — add an eval test, update the prompt, move on. The postmortem process is for failures that expose architectural gaps, not for every imperfect output.
Should the agent postmortem be separate from the standard incident review process?
Yes, or at minimum it needs a dedicated section. Standard incident reviews focus on deployment changes, service dependencies, and infrastructure state — none of which are usually relevant for agent failures. Agent postmortems focus on session behavior, prompt design, model characteristics, and eval coverage gaps. Mixing them muddies both analyses. Run the standard review for infrastructure-level impacts, but add an agent-specific section for the behavioral failure analysis.
- [1]Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes(arxiv.org)↩
- [2]Microsoft Security Blog: New whitepaper outlines the taxonomy of failure modes in AI agents(microsoft.com)↩
- [3]Red Hat Developer: Distributed tracing for agentic workflows with OpenTelemetry(developers.redhat.com)↩
- [4]AG2: OpenTelemetry Tracing — Full Observability for Multi-Agent Systems(docs.ag2.ai)↩
- [5]Galileo: Multi-Agent AI Gone Wrong — How Coordination Failure Creates Hallucinations(galileo.ai)↩
- [6]Agent Wiki: Common Agent Failure Modes — Catalog of Production Incidents(agentwiki.org)↩
- [7]DEV.to: When Your AI Agent Has an Incident, Your Runbook Isn't Ready(dev.to)↩
- [8]Latitude: Detecting AI Agent Failure Modes in Production(latitude.so)↩