Skip to content
AI Native Builders

The Agentic Incident Runbook: Triage When the Stack Trace Is Clean

Most agent failures return HTTP 200. Here is the triage runbook, failure mode field guide, and postmortem template built for non-deterministic agentic systems.

AI Engineering PlatformadvancedApr 15, 20267 min read
By Viktor Bezdek · VP Engineering, Groupon
A detective examining an all-green monitoring dashboard while a tangled invisible web of agent reasoning steps unravels unnoticed behind themThe hardest agent incidents to debug are the ones where everything looks fine.

Your loan-eligibility agent returned HTTP 200 at 2:14am. No exceptions. No alerts fired. The income-verification tool logged status: ok. Over the next six hours, the agent rejected 847 valid loan applications — because a third-party API had silently renamed a JSON field overnight. The LLM received a response key it hadn't seen before and, rather than surfacing an error, hallucinated null for income and declined every pending application. [8]

This is the agentic incident triage problem in its purest form: the infrastructure was healthy, the observability stack reported green across the board, and the failure was entirely semantic — distributed across a reasoning chain where one wrong interpretation at step 3 shaped every subsequent decision, 40 calls deep.

This runbook gives you four things: a failure mode taxonomy for classifying incidents before you touch a trace, a minimum viable observability setup that catches silent failures before users do, a five-step triage sequence that works when the error log is empty, and a postmortem template designed for systems where non-determinism is architectural, not accidental.

Traditional SRE runbooks ask "what crashed?" Agentic ones ask a different question: which step produced the wrong context, and how far did it travel before anyone noticed?

73%
of runs hallucinate on at least one stepFor agent tasks involving 5 or more tool calls, per Stanford HAI Q4 2025 evaluation analysis reported by Markaicode. [^1]
71%
of agent errors originate in steps 1–2Not distributed across all 40 calls — front-loaded into query reformulation and initial retrieval, per AgentBench 2025 analysis. [^1]
37pt gap
observability vs. eval coverage89% of teams running agents in production have observability; only 52% have eval pipelines that validate behavioral quality. [^2]
HTTP 200
typical status code for a silent agent failureMost agent failures complete successfully at the infrastructure level. The wrong output is a semantic problem, invisible to latency dashboards.

The Problem With Green Dashboards

Why traditional observability misses the failure class that matters most in agentic systems

The most dangerous agent failures are the ones your monitoring stack calls successes.

Traditional observability — latency histograms, error rates, CPU utilization — was built for deterministic systems where failure means the process exited nonzero. Agents break this model. A successful HTTP 200 from an LLM API is not evidence of correct behavior. A tool call that returned valid JSON is not evidence the agent interpreted it correctly. A completed session with a formatted response is not evidence the response was accurate. [2]

The canonical failure pattern looks like this: a third-party tool changes its response schema. No exception is raised — the tool still returns valid JSON. The LLM receives a field name it hasn't seen in training and, rather than surfacing an error, generates a plausible value from prior context. That plausible-but-wrong value enters the session state, gets referenced in the next three calls, and by the time the user sees the output, the reasoning chain has been built on a hallucinated foundation. None of this appears in your Datadog dashboard. [8]

Across teams running agents in production in early 2026, 89% had implemented some form of agent observability — but only 52% had eval pipelines that actually validate whether the agent's behavior was correct, not just syntactically complete. [2] The 37-point gap between "we have traces" and "we catch behavioral failures" is where most production incidents live.

Here is the counterintuitive part: the model itself is rarely the root cause. Most production agent incidents trace back to boundary failures — a tool returned partial JSON, retrieval pulled the wrong chunk, a planner got stuck looping, or an API contract changed without notice. [4] If the LLM's reasoning is actually the root cause, your team got unlucky. If the tool interface failed, that's just production.

Traditional SRE triage
  • Alert fires on 5xx error or p99 latency spike

  • Read the exception message and stack trace

  • Identify the failing line of code

  • Reproduce deterministically in staging

  • Fix the bug, write the unit test, deploy

  • Close when error rate returns to baseline

Agentic incident triage
  • Alert fires on semantic anomaly, cost spike, or user report — often hours after failure

  • Find the session trace; no exception exists to read

  • Identify the fault origin step inside the reasoning chain

  • Accept that exact reproduction is probabilistically unlikely

  • Fix prompt, tool schema contract, or memory architecture

  • Close after regression eval catches this pattern in future sessions

Why You Should Check Steps 1–3 First

The counterintuitive truth about where agent errors actually originate in a long session

Everyone assumes that in a 40-call agent session, failures are distributed proportionally across all steps. The evidence points the other way.

Across 1,200 logged agent runs with verified hallucinations, analysis reported from multiple benchmark evaluations found that 71% of errors were introduced in the first two steps — typically during query reformulation or initial retrieval. [1] The final output looks like a complex multi-step failure. Tracing backward, the root cause is almost always an error in how the task was understood or how the first retrieval was executed.

This has a practical implication for triage: when you're looking at a 40-step session, start at steps 1–3. Check what the agent understood the user to be asking. Check what came back from the first retrieval or tool call. Check whether the initial plan was sound. If those steps look correct, the problem is less likely to be a fundamental reasoning failure and more likely to be a specific tool boundary issue later in the session.

Hallucination probability also scales nonlinearly with tool call count. Across agentic benchmark evaluations, the probability of at least one hallucination in a run climbs from roughly 12% at 2 tool calls to 67% at 10, to above 85% at 15 or more. [1] This isn't a reason to avoid complex agents. It is a reason to treat any agentic workflow requiring 10+ tool calls as production-critical infrastructure requiring dedicated semantic observability, not just standard LLM monitoring.

One thing worth being honest about: these benchmark numbers come from controlled evaluation environments, which are cleaner than real production data with noisy inputs, inconsistent schemas, and retrieval systems that drift over time. Real-world hallucination rates in enterprise deployments are likely higher. The directional insight — errors are front-loaded, and complexity compounds risk nonlinearly — holds regardless of the precise numbers.

Minimum Viable Observability for Production Agents

The four specific instrumentation capabilities that make agentic incidents debuggable in under an hour

Standard LLM logging — latency, token count, API errors — is necessary but not sufficient for agentic incident response. The minimum viable observability stack for production agents captures four things that generic LLM monitoring misses entirely.

Session-level correlation. Every LLM call, every tool invocation, and every state mutation in a session must carry the same session_id. Without this, a 40-step session becomes 40 disconnected events in your log store. The session ID is the single most important field you can add — and the one most commonly missing from first-generation agent deployments.

Tool call payloads at every step. Not just "tool was called with status 200" — the actual input arguments and the actual response payload, including the specific field structure. Schema drift detection (spotting when a tool response has different field names than expected) is one of the highest-signal automated checks you can run. It caught the loan agent incident described above in under two minutes on replay — the TOOL_SCHEMA_DRIFT flag on step 3 pointed directly to the renamed field. [5]

Step-level anomaly flags. Structured fields on trace events that mark specific failure patterns as they occur: ENTITY_HALLUCINATION when the model references an entity ID not found in prior context, TOOL_SCHEMA_DRIFT when a tool response doesn't match its documented schema, LOOP_DETECTED when the same tool is called with near-identical arguments three or more times in a session. [6] These flags don't catch everything — but they reduce the backward trace from "read 40 events manually" to "check the flagged steps first."

Session outcome classification. Every session should end in one of a small number of labeled outcomes: Completed, Tool Error, Bad Output, Timeout, Budget Exceeded. The label enables cohort queries — "show me all Bad Output sessions from the past week" — that individual trace inspection cannot. [4]

agent_tracing.py
from opentelemetry import trace
import json

tracer = trace.get_tracer("agent.session")

# Wrap each LLM call with OpenTelemetry gen_ai.* semantic conventions
def traced_llm_call(step: int, session_id: str, model: str, messages: list) -> dict:
    with tracer.start_as_current_span(f"agent.llm.step.{step}") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("session.id", session_id)
        span.set_attribute("agent.step", step)

        response = client.messages.create(model=model, messages=messages)

        span.set_attribute("gen_ai.usage.input_tokens", response.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", response.usage.output_tokens)

        # Anomaly flag: entity ID in response not found in any prior session message
        if entity_ids_not_in_prior_context(response.content, messages):
            span.set_attribute("agent.anomaly", "ENTITY_HALLUCINATION")

        return response


# Capture tool call args, response, and schema drift as first-class span data
def traced_tool_call(step: int, session_id: str, tool_name: str, args: dict) -> dict:
    with tracer.start_as_current_span(f"agent.tool.{tool_name}") as span:
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("session.id", session_id)
        span.set_attribute("agent.step", step)
        span.set_attribute("tool.input", json.dumps(args))

        result = execute_tool(tool_name, args)

        # Schema drift: response field names differ from the documented schema
        if not validate_tool_schema(tool_name, result):
            span.set_attribute("agent.anomaly", "TOOL_SCHEMA_DRIFT")
            span.set_attribute("tool.unexpected_fields",
                ",".join(set(result.keys()) - expected_fields(tool_name)))

        span.set_attribute("tool.output_fields", ",".join(result.keys()))
        return result

Required trace fields per LLM call

  • session_id — identical value for every event in the session

  • agent.step — sequential integer starting at 1

  • gen_ai.request.model — exact model version string, not just the family name

  • genai.usage.inputtokens and genai.usage.outputtokens

  • agent.anomaly — structured flag field, null if clean, typed string if flagged

  • step_summary — one-sentence plain-language description of what the model concluded

Required trace fields per tool call

  • session_id and agent.step — matching the parent LLM call

  • tool.name and tool.input — full args payload, redacted if sensitive

  • tool.status — one of: ok, error, empty, schema_drift

  • tool.output_fields — comma-separated field names from response schema

  • tool.latency_ms — high latency is a leading indicator for retry loops

The Triage Decision Tree

From incident alert to classified failure mode in under five minutes

Before reading any trace in detail, classify the failure. Classification determines which 20% of the trace you actually need to examine. Getting this wrong wastes the first 30 minutes of every agent incident investigation.

The decision tree below encodes the four operationally distinct failure modes — not because the taxonomy is academically satisfying, but because each one has a different blast radius and a different first action. A cost runaway incident needs a session kill order before forensics begins. A tool misfire needs a downstream system audit before forensics begins. Starting forensics without classification means you might spend an hour debugging the wrong thing while the agent continues to affect production systems.

Agent Incident Triage Decision Tree
Rendering diagram…
Classify the failure mode before reading the trace. Each branch leads to a different first action — and a different blast radius.
  1. 1

    Contain the session

    Kill or pause the active agent process before investigating. An agent that is still running can continue taking external actions — sending messages, processing transactions, deleting records — while you are reading the trace. Containment is not investigation. Do it first, even if it means interrupting a session that might have self-corrected.

  2. 2

    Capture the complete session trace

    Export the full trace before logs rotate. You need every LLM call with its complete prompt and completion, every tool invocation with arguments and response, and the session state at each step. The session trace is your only forensic artifact — unlike deterministic software incidents, you cannot reliably reproduce an agent failure in a test environment. Capture now, reconstruct later.

  3. 3

    Classify the failure mode using high-level signals

    Before reading any individual trace step, look at four high-level signals to classify the incident. This takes two minutes and determines whether you start with forensics, a downstream system audit, or a cost containment call. Do not skip this step — the triage path for a cost runaway is completely different from the triage path for a hallucination propagation incident.

  4. 4

    Find the fault origin using the backward trace method

    Do not read the session trace sequentially from step 1. Start at the final wrong output and work backward. Ask: what information would have to be true for this output to make sense? Then find the earliest step where that information first appeared. That step is your fault origin — not the step that surfaced the wrong output, but the step that introduced the wrong premise everything downstream was built on.

  5. 5

    Assess blast radius and initiate remediation

    List every external action the agent took after the fault origin step. For each one, determine whether it is reversible. Sent communications, processed payments, deleted records, and modified state all need explicit reversal plans — and all of it needs to be documented before the postmortem. This is also where you check whether the agent triggered subagents or chained to other workflows that need their own containment assessment.

Failure Mode Field Guide

Quick-reference for classifying and triaging the four agentic failure modes

Failure ModePrimary SignalBest Detection MethodBlast Radius
Hallucination PropagationConfident, coherent output that is factually wrong; no external action errorsENTITY_HALLUCINATION anomaly flag or backward trace from final wrong outputLow to medium — informational unless paired with downstream tool calls
Tool MisfireWrong API called, malformed arguments, or correct API called on wrong entityTool call audit log shows unexpected argument pattern or TOOL_SCHEMA_DRIFT flagHigh — real-world effects may be immediate and irreversible
Context PoisoningAgent's stated goal drifts across turns without user instruction; inconsistent world modelCONTEXT_DRIFT anomaly flag; compare agent's stated objective at step 1 vs. step 20Variable — depends on how many actions the agent took after goal state was corrupted
Cost RunawayToken counter spikes well above session baseline; same tool called repeatedly; no final outputLOOP_DETECTED flag or token budget alarm before hard kill; same-tool call frequency spikeFinancial — no user-visible wrong output but significant cost exposure per session

The Agent Postmortem Template

Why standard SRE postmortems fail for agentic incidents — and what to replace them with

The first time we ran an agent postmortem using a standard SRE incident template, the conclusion read: "root cause: LLM hallucination." That is about as useful as writing "root cause: gravity" for a structural failure.

Standard postmortem templates ask for a single root cause — one line, one changeset, one decision that went wrong. Agent failures rarely cooperate with this framing. Every significant agent incident investigated had at least three contributing factors, each insufficient on its own: a tool response schema changed without notice, a prompt that didn't constrain entity references, and an eval suite with no test coverage for this failure class. Fixing any one of these alone would not have prevented the incident. The gravitational pull toward singular root causes mislabels the problem and produces narrow fixes that prevent the exact failure without addressing the class of failure.

The agent postmortem template replaces "root cause" with "contributing factors" and requires the team to name all of them before closing the incident.

1. Session context — Provide the session ID, time range, total LLM calls, total tokens consumed, and a plain-language description of the intended session behavior. This grounds every other finding in concrete evidence rather than generalized claims about model behavior.

2. Failure classification — Which of the four failure modes applies. This is not a bureaucratic label — it determines which prevention layers were absent and which architectural change is actually needed.

3. Fault origin and cascade path — Which step number introduced the wrong premise. What was in the full context window at that step. The cascade path from fault origin to final wrong output, using specific step numbers — not "around step 10" but "step 7."

4. Impact assessment — Every external action taken after the fault origin, classified as reversible or irreversible, with a concrete remediation plan for each irreversible one.

5. Contributing factors — Name all three or four. Model limitation? Prompt design gap? Tool interface contract change? Retrieval contamination? List each one explicitly without collapsing them into a single cause.

6. Detection gap — Why didn't observability catch this before users were affected? Missing anomaly flag? No eval coverage for this failure pattern? Blast radius larger than expected because there were no guard rails on the tool call? The detection gap is the most important finding in the entire postmortem — it is the only one that drives concrete infrastructure improvement rather than a one-off prompt change.

Agent Postmortem Checklist

  • Session trace exported and stored before log rotation

  • Fault origin identified by specific step number

  • Failure mode classified: hallucination propagation / tool misfire / context poisoning / cost runaway

  • Full cascade path documented from fault origin to final wrong output

  • All external actions after fault origin enumerated and classified reversible vs. irreversible

  • Rollback or remediation initiated and logged for all irreversible actions

  • All contributing factors listed explicitly — no single root cause framing

  • Detection gap described: what monitoring or eval would have caught this earlier?

  • New eval regression test added covering this session failure pattern

  • Alert or anomaly flag configured for this failure signature

  • Postmortem distributed to platform and on-call teams within 48 hours

Anomaly Flags That Pay for Themselves

Five structured checks that transform agent debugging from 40-event manual review to targeted forensics

Anomaly Flag Specifications

ENTITY_HALLUCINATION

Flag when a model response references an entity ID — customer ID, order ID, account number, document reference — not found in any prior message or tool result in the session. Set at LLM call boundaries. High precision, low false positive rate when entity ID formats are consistent. The most commonly actionable flag in production agent forensics.

TOOL_SCHEMA_DRIFT

Flag when a tool response contains field names not present in the tool's documented schema from the last validated session. Hash the response key set and compare against the expected key set per tool. High signal for API contract changes. Fires before a hallucination has time to propagate — catching it at the tool call boundary rather than in the model output.

LOOP_DETECTED

Flag when the same tool is called with arguments that hash to within 90% similarity more than twice in a single session. Precursor flag for cost runaway and context poisoning. Fires early enough to pause-and-escalate before token spend compounds. The threshold of two identical-ish calls is intentionally low — legitimate agents rarely need to re-run the same query three times.

CONTEXT_DRIFT

Flag when the model's stated objective in its chain-of-thought output diverges from the original system prompt task by more than a threshold semantic distance. Requires embedding comparison — more expensive than the other flags. Run it at session mid-point checkpoints rather than every step to keep overhead manageable. Misses some context poisoning cases but catches the severe ones early.

BUDGET_WARNING

Soft warning at 60% of the session token budget, separate from the hard kill at 100%. The purpose is intervention time — gives the platform team 40% of the remaining budget to investigate and pause before the session terminates abruptly. An agent that is killed at the hard budget limit produces an incomplete session trace that is harder to debug than one that was paused during BUDGET_WARNING.

When does an agent incident warrant a full postmortem versus a bug fix?

Escalate to a full postmortem when the agent took an irreversible external action, when per-session cost exceeded 5x the baseline, or when the same failure pattern appeared in more than one session. A single hallucination that produced wrong text but didn't trigger any tool calls — and was contained to one session — can be handled as a bug: add an eval test, update the prompt, move on. The postmortem process is for failures that expose architectural gaps, not for every imperfect output. The practical test: if the detection gap finding would require changing something in the infrastructure (anomaly flag, schema validation, guard rail) rather than just a prompt edit, it needs a postmortem.

Can I use my existing APM tool (Datadog, Grafana, New Relic) for agent observability?

These tools handle infrastructure-level metrics — latency, error rates, cost per session — well. They don't handle behavioral correctness: whether the agent chose the right tool, whether its output was factually accurate, or whether its goal remained consistent across turns. LLM observability is fundamentally semantic, not syntactic. A successful HTTP 200 from an LLM API can contain a hallucinated fact, and no latency graph detects it. For behavioral correctness, you need OpenTelemetry spans with gen_ai.* attributes plus a semantic eval layer — tools like Langfuse, Arize Phoenix, or a custom eval harness that runs assertions against session traces. The existing APM stack stays for infrastructure; it's not a substitute for session-level behavioral monitoring.

How do I instrument an agent built on a third-party framework like LangGraph, LlamaIndex, or AG2?

Most major frameworks now ship native OpenTelemetry support. AG2 has built-in OTel tracing that captures agent turns, LLM calls, tool executions, and speaker selections as structured spans connected by a shared trace ID, exportable to any OTel-compatible backend. LlamaIndex and LangChain support OpenInference instrumentation, which follows the OpenTelemetry GenAI semantic conventions. The minimum you need is a trace per session with step numbers, gen_ai.* attributes on LLM calls, and tool call input/output payloads. If your framework doesn't emit these natively, add a thin wrapper on the LLM call boundary — it is two functions and less than 50 lines of code.

What is the minimum instrumentation to add before shipping an agent to production for the first time?

Three things, in priority order. First, a session correlation ID on every LLM call and every tool call — without this, you cannot reconstruct what happened in a specific session, full stop. Second, tool call input and output logging at every step — most production agent failures trace to a tool interface problem, and you cannot debug it without the actual payloads. Third, a session outcome label that resolves when the session ends: Completed, Tool Error, Bad Output, Timeout, Budget Exceeded. With these three in place, you can debug most production incidents. Anomaly flags, semantic evals, and replay harnesses are high-leverage additions — but without correlation IDs, tool payloads, and outcome labels, your first production incident will be undebuggable regardless of how sophisticated the rest of your stack is.

Steps 1–3
71% of agent errors originate in the first two reasoning steps — check the start of the trace before reading anything else
Backward first
Work from the final wrong output backward to the fault origin — reading forward through 40 steps wastes time on clean steps
HTTP 200 ≠ success
Most agent failures complete with a clean status code — behavioral correctness requires semantic validation, not syntactic monitoring
Boundary failures
Most production agent incidents trace to tool interface problems — schema drift, empty returns, auth failures — not model reasoning failures

On the benchmark statistics cited in this article

The hallucination probability curves (73% at 5+ tool calls, 71% front-loaded errors) come from analysis across controlled benchmark evaluations — AgentBench 2025, HELM Agentic Evaluation, and Stanford HAI Q4 2025 data — as reported by Markaicode (Feb 2026) [1]. The 89%/52% observability gap comes from Tianpan.co's systematic debugging article (Feb 2026) [2], citing early 2026 survey data. Real-world production rates depend heavily on input noise, schema consistency, and retrieval quality. Treat these as directional benchmarks, not engineering thresholds you can cite in an SLO.

Key terms in this piece
agentic incident runbookLLM agent debugging productionAI agent silent failure triageagent observability OpenTelemetryLLM failure modes triageagent session forensicsAI incident response
Sources
  1. [1]Debugging Hallucinations: New Tools for Tracing Agent Logic — Markaicode (Feb 2026)(markaicode.com)
  2. [2]Systematic Debugging for AI Agents: From Guesswork to Root Cause — Tianpan.co (Feb 2026)(tianpan.co)
  3. [3]The Complete Guide to Debugging AI Agents in Production — Latitude (Mar 2026)(latitude.so)
  4. [4]Debugging AI Agent Failures in Production — Warpmetrics (Feb 2026)(warpmetrics.com)
  5. [5]Distributed Tracing for Agentic Workflows with OpenTelemetry — Red Hat Developer (Apr 2026)(developers.redhat.com)
  6. [6]AI Agent Observability — Evolving Standards and Best Practices — OpenTelemetry (2025)(opentelemetry.io)
  7. [7]AG2 OpenTelemetry Tracing: Full Observability for Multi-Agent Systems (Feb 2026)(docs.ag2.ai)
  8. [8]AI Agent Observability: Tracing & Debugging LLM Agents in Production — Md Sanwar Hossain (Mar 2026)(mdsanwarhossain.me)
Share this article