Agent Incident Playbook: Debug LLM Failures | AI Native Builders

The Agent Incident Playbook: Debugging a Failure Across 40 LLM Calls

SRE runbooks assume one process, one stack trace, one bad line. Agent failures are distributed across dozens of reasoning steps — the wrong premise gets laundered through 33 more calls before the user sees it. Here is the taxonomy, the triage, the postmortem.

AI Engineering PlatformadvancedNov 27, 20256 min read

By Viktor Bezdek · VP Engineering, Groupon

2:37am. The order agent returns a confirmation for an order that does not exist. The on-call pulls logs and finds nothing. No stack trace. No 500. No exception. HTTP 200, well-formatted prose, full confidence — confirming the wrong order against the wrong customer account.

The failure is not in your code. It is distributed across 40 LLM calls, where the actual fault happened around step 7 — a small hallucination about a customer identifier — and every subsequent call built confidently on top of the poisoned premise. By the time the user saw the output, that error had been laundered through 33 more reasoning steps.

Traditional SRE asks one question: what crashed? Agentic systems demand a different one: which of these 40 reasoning steps produced the wrong context that cascaded into the wrong action? That is not debugging. That is distributed session forensics.

This playbook maps the four operational failure modes to concrete triage steps, gives you a backward-trace method for locating fault origins inside long sessions, and provides a postmortem template built for systems where non-determinism is the default state, not the anomaly.

Failure modes that demand different triage paths

Hallucination propagation, tool misfire, context poisoning, cost runaway. Misclassify the mode and you investigate the wrong session.

40+

LLM calls in a single complex session

Any one of them is a candidate fault origin. Reading them sequentially is how you waste an hour.

Stack traces produced by silent agent failures

Most incidents return HTTP 200. The agent completed. It did the wrong thing.

10×

Token spend during a runaway loop

Looping agents burn tokens at 10x baseline before the budget alarm fires. Cost is observability.

Your SRE Runbook Was Written for Loud Failures. Agents Fail Quietly.

The mental model designed for crashes does not survive contact with reasoning defects.

Traditional runbooks assume systems that fail loudly. A service crashes. A timeout fires. A null pointer propagates. You find the exception, walk the call stack, identify the line. The bug has a location.

Agent incidents violate every assumption in that model. The agent did not crash. It ran to completion and returned a result. The result was wrong, and nothing in your observability stack knows that. Error rates look normal. Latency is fine. The SLO dashboard is green. The user is filing a ticket.

The failure is closer to a reasoning defect than a code defect. Reasoning defects compound. A wrong assumption at step 7 shapes the framing at step 8, which selects the wrong tool at step 9, which returns data that reinforces the wrong assumption at step 10. By step 20, the agent has constructed a coherent internal narrative that is entirely wrong — and it has the tool call logs to prove it.

Stack trace runbook

Ask: what process crashed?
Read the stack trace
Identify the failing line of code
Reproduce with identical inputs
Fix the deterministic bug
Verify with a unit test

Session forensics playbook

Ask: which step produced the wrong context?
Read the full session trace across every LLM call
Locate the fault origin in the reasoning chain
Accept that exact reproduction is usually impossible
Fix the prompt, the tool interface, or the memory architecture
Verify with a replay eval against the captured session

Four Failure Modes. Different Signals, Different Blast Radii.

Classify before you investigate. The wrong taxonomy sends you down the wrong session.

The agentic AI fault taxonomy literature ^[1] catalogs many failure patterns. For incident response, they collapse into four operationally distinct modes — each with a different signal, a different blast radius, a different triage path.

1. Hallucination propagation

The model generates a false assertion early in the session. Because agents accumulate context across calls, that assertion gets referenced and re-affirmed in later steps. By call 20, the hallucination is an established fact inside the session. The model is not wrong because it is confused. It is wrong because its own earlier output is now evidence.

Signature: confident, coherent, structured output, internally consistent. Built on a false premise.

2. Tool misfire

The model selects the wrong tool, passes malformed arguments, or misinterprets tool output. Unlike hallucination, tool misfire produces real side effects immediately — deleted records, sent emails, processed payments, triggered workflows. The session looks healthy on latency and token counts while causing irreversible damage downstream.

Signature: a tool was called with an unexpected argument pattern, or a tool returned data the model processed without validating against the session's stated goal.

3. Context poisoning

A hallucination or injected content makes it into persistent context — goal state, working memory, retrieved documents. The agent's framing of the entire task warps. Long-running agents are especially exposed because they carry context across many turns and the poisoning compounds ^[5].

Context poisoning differs from hallucination propagation in what it corrupts. Hallucination poisons factual claims about the world. Poisoning corrupts the agent's self-model — what it thinks it is trying to do.

4. Cost runaway

The agent loops: repeated tool calls with similar arguments, infinite retry logic, self-spawning subagents, circular reasoning chains. There is no wrong output in the traditional sense — the agent may never surface a result. The failure is financial and operational. Token consumption compounds silently until a budget alarm fires or the session times out ^[6].

Signature: per-session token counts spike well above baseline. Tool call frequency is abnormally high. The same tool gets called multiple times with near-identical arguments.

How a Single Hallucination Cascades Across an Agent Session

A false entity reference at step 3 poisons the context. Every downstream step compounds the error. Without span tracing, the fault origin is invisible.

First Five Minutes: Contain, Capture, Classify

A triage sequence that holds when the error log is empty and the SLO dashboard is green.

[01]
Contain the session
Kill the active session if the agent is still running. Revoke API credentials if tool misfire is in play. Rate-limit the agent's access to external services until you understand the blast radius. An agent that keeps running while you investigate is an agent that keeps making decisions you have not authorized.
[02]
Capture the session trace
Export the complete trace before logs rotate. Every LLM call with full prompt and completion. Every tool invocation with arguments and response. The full context window state at each step. Capture now. Reconstruct later. Logs that rotated mid-investigation are not coming back.
[03]
Classify the failure mode
Before reading the trace in detail, look at the high-level signals and classify which of the four modes this is. Classification determines the triage path. A cost runaway investigation has nothing in common with a context poisoning investigation. Misclassify and you spend an hour reading the wrong evidence.
[04]
Find the fault origin
Walk the trace backward from the wrong output. Identify the earliest step where the agent's reasoning diverges from what you would expect. That step is the fault origin — not the step that produced the bad output, but the step that first introduced the wrong premise. Everything between origin and final output is cascade.
[05]
Document the blast radius
Enumerate every external action the agent took after the fault origin. For each, decide whether it is reversible. Deleted data, sent communications, processed transactions, modified state — each needs an explicit reversal plan. Log everything. The postmortem will need it.

session-trace.jsonl

// One event per line. Required fields for incident forensics.
// Fault origin lives in the trace, not the output.

{"session_id":"ses_abc123","step":1,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":847,"output_tokens":312,"ms":1240,"summary":"Customer lookup initiated by account ID"}
{"session_id":"ses_abc123","step":2,"type":"tool_call","tool":"get_customer","args":{"id":"cust_789"},"status":"ok","summary":"Returned Alice Chen, enterprise tier"}
{"session_id":"ses_abc123","step":3,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":1123,"output_tokens":445,"ms":1890,"summary":"Referenced order_id never present in prior context","flags":["ENTITY_HALLUCINATION"]}
{"session_id":"ses_abc123","step":4,"type":"tool_call","tool":"get_order","args":{"id":"ord_555"},"status":"ok","summary":"Order belongs to cust_012, not cust_789","flags":["ENTITY_MISMATCH"]}
{"session_id":"ses_abc123","step":5,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":1892,"output_tokens":223,"ms":1540,"summary":"Confirmed order for cust_789 — fault origin was step 3"}

Read the Trace Like Evidence, Not a Story

Locating the fault origin in a trace that is hundreds of events long.

A complex agent run produces a trace hundreds of events long. Reading sequentially from step one is the wrong move. You will spend most of your time on steps that were fine. The correct method is forensic: start from the known bad output and work backwards.

The backward trace method

Take the final wrong output — a hallucinated order confirmation, a malformed API call, an incorrectly processed payment — and ask: what would have to be true for this output to make sense? Then find the step where that information first appeared. That is the fault origin.

In the example trace above, the wrong output is a confirmation linking cust_789 to ord_555. Working backward: the agent called get_order with ord_555 at step 4. Where did ord_555 come from? It appeared in the LLM output at step 3 — an order ID that was never in the conversation context before that point. Entity hallucination. Step 3 is the fault origin.

Anomaly flags as accelerants

If your tracing infrastructure supports it, attach automated anomaly flags to trace events ^[3]. Useful flag types:

ENTITY_HALLUCINATION — the model referenced an entity ID not found in prior context
TOOL_ARG_MISMATCH — a tool was called with arguments that violate its schema
CONTEXT_DRIFT — the agent's stated goal changed between turns without user instruction
LOOP_DETECTED — the same tool was called with near-identical arguments inside one session
ENTITY_MISMATCH — a tool returned data belonging to a different entity than the one being processed

These flags do not catch every failure. They dramatically accelerate the backward trace. In the example above, the ENTITY_HALLUCINATION flag at step 3 points directly to the fault origin without manually comparing 40 events.

Reconstruct the context, not just the output

LLM call summaries tell you what the model said. They do not tell you what it believed. Examine what was actually in the context window at the fault origin step. For agents using structured working memory or tool-call history as context, the question is: what did the model believe to be true when it made the wrong inference? Rebuild the full prompt — system message, conversation history, tool results, working memory — at that exact step. The fault origin is visible in the context, not just the output.

Failure Mode	Primary Signal	First Triage Step	Blast Radius
Hallucination propagation	Confident, coherent output, factually wrong	Locate the first false assertion. Trace backward from wrong output.	Low to medium. Informational unless paired with tool calls.
Tool misfire	Wrong API call or malformed arguments in the tool log	Audit every tool call after the fault origin. Inspect downstream system state.	High. Real-world effects are immediate and irreversible.
Context poisoning	Goal drifts across turns. World model becomes inconsistent.	Find when goal state was overwritten. Inspect injected or retrieved content.	Variable. Depends on how long the session ran after poisoning.
Cost runaway	Token counter spike, repeated tool calls, no final output	Kill the session. Audit total spend. Identify loop entry point.	Financial. No user-visible wrong output. Significant cost exposure.

The Postmortem Template That Actually Maps to Agent Failures

Standard SRE templates assume one root cause and one deployment. Agent failures violate both.

Standard postmortem templates ask deterministic questions. What changed in the deployment? What was the root cause line of code? How did the change ship? Agent postmortems need different questions because the failure is rarely in the deployment. It is in the combination of model behavior, prompt design, tool interfaces, and the specific data that appeared in the session.

We ran our first three agent postmortems on a standard SRE template. All three came out inconclusive. The template demanded "the root cause" — a single line, a single change — and every agent failure we investigated had three or four contributing factors that were each insufficient alone. We renamed the field "contributing factors" instead of "root cause" and the remediation conversations changed shape immediately. Postmortems that end in "add a unit test" produce different follow-up than ones that end in "redesign the entity validation layer."

A useful agent postmortem answers six things:

1. Session context — What was the session trying to do? Provide session ID, time range, total LLM calls, total tokens consumed, plain-language description of intended behavior. Every other finding grounds in this.

2. Failure classification — Which of the four modes applies? Not a bureaucratic label. It determines which prevention layer was missing and what has to change.

3. Fault origin and cascade path — Which step was the fault origin? What was in the context window at that step? Trace the cascade from origin to final wrong output. Specific step numbers, no hand-waving.

4. Impact assessment — What external actions did the agent take after the fault origin? Reversible or not? User-facing impact? Cost impact?

5. Root cause — Model limitation (the base model hallucinates reliably on this input type)? Prompt design flaw (the system prompt did not constrain entity references)? Tool interface issue (the tool returned ambiguous data)? Data contamination (a retrieved document carried misleading content)? One root cause, clearly named.

6. Detection gap — Why did this reach the user? No eval test for the failure pattern? Insufficient monitoring? Blast radius larger than expected because no guard rails? The detection gap is the most important finding. It drives the prevention work. Everything else is description.

Agent Postmortem Checklist

Session trace exported and stored before log rotation
Fault origin identified by exact step number
Failure mode classified — hallucination, tool misfire, context poisoning, cost runaway
Cascade path documented from fault origin to final output
Every external action after fault origin enumerated
Reversible vs. irreversible actions identified and remediated
Root cause named in one specific category, no hand-waving
Detection gap described — what monitoring would have caught this earlier
Eval test added to regression suite for this failure pattern
Alert configured for the failure signature — anomaly flag, token spike, or loop pattern
Postmortem distributed to platform, product, and on-call within 48 hours

Build for Debuggability Before the Next Incident Forces You To

Structural choices that decide whether the next incident takes ten minutes or ten hours.

Structural tracing from day one

✓
Emit a structured trace event at every LLM call boundary — step number, input token count, model version, output summary, tool calls triggered
✓
Use OpenTelemetry spans for multi-agent sessions. Each agent is a child span of the orchestrator span ^[3] ^[4]
✓
Store anomaly flags — entity hallucination, tool arg mismatch, context drift — as structured fields on trace events. Free-form log text is not searchable evidence
✓
Retain full session traces for at least 30 days. Most agent incidents surface days after the session ran

Budget guardrails

✓
Hard token budgets per session type — not just overall caps. Per-step limits that fire an alert when a single LLM call is anomalously expensive
✓
Loop detection: same tool called with near-identical arguments more than three times in a session pauses and escalates
✓
Cost alarms wired to session-level spend, not just account-level monthly totals. A single runaway session must trigger an alert before it hits double digits

Entity validation at session entry

✓
Validate every key entity at session start — customer IDs, order IDs, account references — before any LLM call runs
✓
Add lightweight classifiers to intermediate LLM outputs that flag when the model references entity IDs not present in prior context
✓
Use structured output schemas for steps that require precision. Constrain what the model is allowed to claim in tool call arguments

Replay eval harness

✓
Store sessions in replay-friendly format — every message, tool definition, and tool response captured so you can re-run with a patched prompt
✓
Add every production failure to the eval regression suite. The session trace is the test case
✓
Run weekly regression evals against the failure library to catch prompt or model regressions before they reach users

Why can't I reproduce the agent failure in a test environment?

Non-determinism. Outputs vary with temperature, context window packing, and API version drift between environments. Reproduction is not the goal of agent incident response. Evidence collection from the original session is. The session that failed is your most valuable artifact. Capture it, store it, analyze it directly. Stop trying to recreate it.

My agent uses streaming output — how do I get useful traces for forensics?

Buffer streamed completions to a single trace event before each tool call and at session end. Forensics does not need character-level streaming data. It needs span boundaries. OpenTelemetry-native libraries for LLMs — traceAI, Langfuse, Phoenix — handle this instrumentation with minimal overhead and integrate with existing observability backends.

How do I find the fault origin without reading 40 calls?

Backward trace. Start from the wrong output. Ask what would have to be true for that output to make sense. Find the earliest step where that information first appeared. Anomaly flags accelerate this dramatically — an ENTITY_HALLUCINATION flag at step 3 tells you where to look without manually reviewing 40 events.

When does a context poisoning incident warrant a full postmortem instead of a quick fix?

Escalate to a full postmortem when the agent took an external action, when cost ran more than 5x the session baseline, or when the same failure pattern appeared in more than one session. A single informational hallucination that triggered no tool calls and stayed inside one session is a bug — add an eval, update the prompt, move on. The postmortem process is for failures that expose architectural gaps, not for every imperfect output.

Should agent postmortems be separate from the standard incident review?

Yes — or at minimum, a dedicated section. Standard reviews focus on deployment changes, service dependencies, and infrastructure state. None of that is usually relevant to agent failures. Agent postmortems focus on session behavior, prompt design, model characteristics, and eval coverage. Mixing them muddies both analyses. Run the standard review for infrastructure-level impacts. Add an agent-specific section for the behavioral failure analysis.

One counter: too many separate processes produce postmortem fatigue and the behavioral analysis never gets completed. If that is the pattern, consolidate — but make the agent-specific questions mandatory fields that cannot be left blank before the postmortem closes. An empty fault origin field means the investigation is not done.

4 modes

Classify before you dig — hallucination propagation, tool misfire, context poisoning, and cost runaway each demand a different triage path

Backward first

Walk the trace from wrong output back to fault origin. Forward reading wastes the hour you do not have

Capture early

Export the full session trace before logs rotate. It is the only forensic artifact for a failure that will not reproduce

Detection gap

The most important finding in any postmortem is why this reached the user. That answer drives every structural prevention decision

Key terms in this piece

agent incident playbookdebugging LLM agent failuresagentic AI observabilityAI agent incident responseLLM failure modescontext poisoningagent postmortem

Sources

[1]Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes(arxiv.org)↩
[2]Microsoft Security Blog: New whitepaper outlines the taxonomy of failure modes in AI agents(microsoft.com)↩
[3]Red Hat Developer: Distributed tracing for agentic workflows with OpenTelemetry(developers.redhat.com)↩
[4]AG2: OpenTelemetry Tracing — Full Observability for Multi-Agent Systems(docs.ag2.ai)↩
[5]Galileo: Multi-Agent AI Gone Wrong — How Coordination Failure Creates Hallucinations(galileo.ai)↩
[6]Agent Wiki: Common Agent Failure Modes — Catalog of Production Incidents(agentwiki.org)↩
[7]DEV.to: When Your AI Agent Has an Incident, Your Runbook Isn't Ready(dev.to)↩
[8]Latitude: Detecting AI Agent Failure Modes in Production(latitude.so)↩

Share this article

X LinkedIn Hacker News

The Agent Incident Playbook: Debugging a Failure Across 40 LLM Calls

AI Engineering PlatformadvancedNov 27, 20256 min read

By Viktor Bezdek · VP Engineering, Groupon

// One event per line. Required fields for incident forensics. // Fault origin lives in the trace, not the output. {"session_id":"ses_abc123","step":1,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":847,"output_tokens":312,"ms":1240,"summary":"Customer lookup initiated by account ID"} {"session_id":"ses_abc123","step":2,"type":"tool_call","tool":"get_customer","args":{"id":"cust_789"},"status":"ok","summary":"Returned Alice Chen, enterprise tier"} {"session_id":"ses_abc123","step":3,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":1123,"output_tokens":445,"ms":1890,"summary":"Referenced order_id never present in prior context","flags":["ENTITY_HALLUCINATION"]} {"session_id":"ses_abc123","step":4,"type":"tool_call","tool":"get_order","args":{"id":"ord_555"},"status":"ok","summary":"Order belongs to cust_012, not cust_789","flags":["ENTITY_MISMATCH"]} {"session_id":"ses_abc123","step":5,"type":"llm_call","model":"claude-3-7-sonnet","input_tokens":1892,"output_tokens":223,"ms":1540,"summary":"Confirmed order for cust_789 — fault origin was step 3"}

The backward trace method

Anomaly flags as accelerants

If your tracing infrastructure supports it, attach automated anomaly flags to trace events ^[3]. Useful flag types:

ENTITY_HALLUCINATION — the model referenced an entity ID not found in prior context
TOOL_ARG_MISMATCH — a tool was called with arguments that violate its schema
CONTEXT_DRIFT — the agent's stated goal changed between turns without user instruction
LOOP_DETECTED — the same tool was called with near-identical arguments inside one session
ENTITY_MISMATCH — a tool returned data belonging to a different entity than the one being processed

Reconstruct the context, not just the output

Failure Mode

Primary Signal

First Triage Step

Blast Radius

Hallucination propagation

Confident, coherent output, factually wrong

Locate the first false assertion. Trace backward from wrong output.

Low to medium. Informational unless paired with tool calls.

Tool misfire

Wrong API call or malformed arguments in the tool log

Audit every tool call after the fault origin. Inspect downstream system state.

High. Real-world effects are immediate and irreversible.

Context poisoning

Goal drifts across turns. World model becomes inconsistent.

Find when goal state was overwritten. Inspect injected or retrieved content.

Variable. Depends on how long the session ran after poisoning.

Cost runaway

Token counter spike, repeated tool calls, no final output

Kill the session. Audit total spend. Identify loop entry point.

Financial. No user-visible wrong output. Significant cost exposure.

The Agent Incident Playbook: Debugging a Failure Across 40 LLM Calls

Your SRE Runbook Was Written for Loud Failures. Agents Fail Quietly.

Four Failure Modes. Different Signals, Different Blast Radii.

First Five Minutes: Contain, Capture, Classify

Contain the session

Capture the session trace

Classify the failure mode

Find the fault origin

Document the blast radius

Read the Trace Like Evidence, Not a Story

The Postmortem Template That Actually Maps to Agent Failures

Agent Postmortem Checklist

Build for Debuggability Before the Next Incident Forces You To

Structural tracing from day one

Budget guardrails

Entity validation at session entry

Replay eval harness

Related

The Agent Incident Playbook: Debugging a Failure Across 40 LLM Calls

Your SRE Runbook Was Written for Loud Failures. Agents Fail Quietly.

Four Failure Modes. Different Signals, Different Blast Radii.

First Five Minutes: Contain, Capture, Classify

Contain the session

Capture the session trace

Classify the failure mode

Find the fault origin

Document the blast radius

Read the Trace Like Evidence, Not a Story

The Postmortem Template That Actually Maps to Agent Failures

Agent Postmortem Checklist

Build for Debuggability Before the Next Incident Forces You To

Structural tracing from day one

Budget guardrails

Entity validation at session entry

Replay eval harness

Related