89% of teams have observability tooling. 62% can map a trace to a failure cause. Seven failure modes grounded in H1 2026 incident data — each with distinct OTel trace signatures and an LLM classifier that routes the incident before the postmortem.
89% of teams running agentic systems have observability tooling installed. 62% can inspect what their agents do at each individual step.[1] That 27-point gap — teams who monitor but can't map traces to causes — isn't a tooling problem. It's a classification problem. The trace exists. The failure mode does not.
When your on-call runbook says tool_call_error_rate > 5% — check tool logs, it describes a symptom. It doesn't name a cause. Tool invocation failure means something different from context decay, which means something different from plan drift, which means something different from scope overreach. Each mode has a different trace signature. Each implicates a different remediation layer. Treating them all as variants of 'the agent did something wrong' is why the same class of failure keeps showing up under different surface presentations.
AgentEval, a DAG-structured evaluation framework piloted on 12,847 traces across 18 engineers, measured median root-cause identification time at 4.2 hours without taxonomy-driven triage, and 22 minutes with it.[2] The taxonomy didn't prevent failures. It prevented the same failure from costing 4 hours the second time.
Seven operationally distinct failure modes covering the agentic incident space, grounded in H1 2026 data
Exact OTel span attributes to watch for each mode — including genai.response.finishreason, step count ratios, and token monotonicity signals
AgentRx's nine-category taxonomy (Microsoft Research, 115 annotated trajectories) mapped to these seven operational modes
A runnable Python classifier that routes six of seven modes from span exports alone
Schema validation at startup — the structural fix for the most common mode
A four-phase triage pipeline with specific week-by-week implementation targets
A readiness checklist your on-call team can run against current tooling today
Distributed tracing solves context propagation. Failure classification is a different problem — and the one that determines where you fix.
A clean span tree tells you that every step executed. It doesn't tell you whether the right objective was pursued, whether the tool schema was current, whether the context window was intact, or whether the agent operated within its authorized scope.
The failure mode matters because it determines the remediation layer. Tool invocation failure traces to schema validation at startup, not prompt tuning. Context decay traces to context management architecture, not model capability. Scope overreach traces to permission design, not model alignment. When on-call engineers treat every agent failure as a single category — 'agent behavior issue' — the fix targets the most visible layer, which is usually the least causal one.
The Clyro analysis of 591 documented agent incidents from 2023 through early 2026 found that 88% traced to infrastructure gaps: missing permission checks, no execution bounds, no context validation, no quality monitoring. The model worked correctly in the majority of classified failures — it operated on bad inputs inside an inadequate governance structure. Model-focused remediation addressed roughly 12% of underlying causes.[3]
AgentFixer (IBM Research, arXiv:2603.29848, 2026) applied fifteen failure-detection tools and two root-cause analysis modules across input handling, prompt design, and output generation. Applied to IBM's CUGA agent on the AppWorld and WebArena benchmarks, it surfaced systemic weaknesses — controller indexing errors, planner misalignments, schema non-compliance — that neither standard logging nor manual review had caught.[11] The lesson: failure visibility requires structured detection, not more log volume.
591-incident analysis (2023–2026) — Clyro, April 2026[3]
Down from 4.2 hours without mode classification — AgentEval pilot, 12,847 traces[2]
Survey of 1,300+ AI professionals, 2026[1]
115 annotated trajectories across τ-bench, Flash, Magentic-One — Microsoft Research[9]
Coarse enough to classify in five minutes. Granular enough that the mode determines the fix.
H1 2026 incident data — synthesized from Clyro's 591-incident dataset, the AgentRx benchmark (115 annotated trajectories), Zylos Research, and Latitude.so production analysis — identifies seven operationally distinct failure modes. The taxonomy is intentionally practitioner-scoped: coarse enough that an on-call engineer can identify the mode within five minutes of opening the trace, granular enough that the mode determines the remediation target.
AgentRx's research taxonomy runs to nine categories.[6] For production triage, seven is the right level of granularity — mode 6 (Plan Adherence Failure) and mode 5 (Intent–Plan Misalignment) collapse into Plan Drift because the remediation layer is the same: behavioral spec and eval gate at deploy. The distinction matters for research; it doesn't change the on-call action.
| Failure Mode | OTel Trace Signature | Fix Target | Incident Share |
|---|---|---|---|
| Tool Invocation Failure | ERROR spans on tool calls; repeated args with similarity > 0.8; HTTP 200 with empty body | Schema validation at startup; circuit breaker | Top category (AgentRx[6]) |
| Context Decay | genai.response.finishreasons = max_tokens recurring; input tokens monotonically rising per step | Context compaction architecture; session isolation | 31.6% (Clyro 2026[3]) |
| Runaway Execution | Step count > 3× P95 for agent type; monotonic cost per step; circular tool args | Hard step ceiling; loop detection; cost circuit breaker | 5.1% — highest cost per incident (Clyro 2026[3]) |
| Plan Drift | Required tool categories absent from span tree; no completion span; categories shift mid-workflow | Behavioral spec; LLM-as-judge eval gate at deploy | Moderate (AgentRx categories 1 + 5[6]) |
| Scope Overreach | Out-of-scope tool calls; destructive endpoint access (DELETE/DROP/PATCH); 403 responses | Permission surface enforcement at execution layer | 30.3% (Clyro 2026[3]) |
| Silent Semantic Failure | All spans succeed; normal token counts; evaluator flags quality degradation | Online LLM-as-judge evaluation; quality baseline alerting | 24.9% (Clyro 2026[3]) |
| Cascade Propagation | Orphaned spans at handoffs; downstream agents consuming high tokens on simple inputs | Semantic validation at every agent handoff | Emergent in multi-agent systems[8] |
A single bad call at step N corrupts every subsequent step. The trace shows success. The data is wrong.
Tool invocation failure is the most common failure mode at the individual step level — and the most treacherous because a single malformed call at step N corrupts every subsequent step that depends on that output. The agent proceeds confidently. HTTP 200. Wrong payload.
AgentRx maps this to two of its nine categories: Invalid Invocation (tool call malformed, missing args, schema-invalid) and Misinterpretation of Tool Output (agent acted on incorrect assumptions from the response).[9] Three subtypes produce different trace patterns in practice:
Invalid invocation — the agent constructs a tool call with incorrect argument schema. The dangerous variant: the agent can't distinguish 'the API rejected my request' from 'the task is impossible,' so it retries the same malformed call hundreds of times. One documented incident: $2,000 in API charges from 847 identical retries of the same failed call.[4] Trace signature: identical or near-identical argument hashes across consecutive ERROR-status tool spans — argument similarity above 0.8 on the gen_ai.tool.call.arguments attribute across three or more consecutive calls is a reliable signal.
Silent empty response — the tool returns HTTP 200 with an empty or truncated payload. The agent can't distinguish 'legitimately no results' from 'query truncated by context window pressure.' A success span is recorded. Downstream steps process empty data. The output looks complete. It's wrong.
Schema drift — the tool's API spec changed after the agent deployed. Parameter names renamed, required fields added, authentication flows updated. The agent operates against its cached schema. Certificate expiration running silently for months is another variant.[4] Trace signature: persistent ERROR spans on a previously reliable tool starting at a specific timestamp — the diff in error rate before and after that timestamp is the schema change event.
The OpenTelemetry GenAI semantic conventions (experimental as of mid-2026) define execute_tool as a child span under invoke_agent, with gen_ai.tool.name and gen_ai.tool.call.id as the key attributes.[12] Any alerting rule on tool ERROR rate should group by gen_ai.tool.name first — a spike on a single tool name is schema drift; a spike across all tools is an execution environment problem.
The model isn't forgetting. The context window is full. That distinction determines the fix.
Context decay looks gradual and hits like an infrastructure incident. The agent accumulates tool results, reasoning history, and conversation turns until the context window fills. No crash. No alert. The model starts losing earlier constraints — the user's original requirements, scope boundaries established at turn 1, data accumulated at steps 2 and 3.
Chroma's 2025 research, testing 18 frontier LLMs on multi-hop reasoning tasks across 10,000–500,000 token contexts, found that all 18 models showed monotonically decreasing F1 scores as input length grew, with the steepest degradation in the 100,000–500,000 token range.[10] The 'lost-in-the-middle' effect causes 30%+ accuracy drops for content positioned away from context boundaries. In a 12-turn agent session, turn 1 constraints are structurally disadvantaged by turn 12.
A complementary finding from a 2026 study on 4,416 trials: omission constraints (things the agent should not do) decay faster than commission constraints (things it should do) as context grows. Hard limits buried in conversation history don't hold.[10]
Three OTel signals flag context decay before output quality collapses:
gen_ai.response.finish_reasons containing max_tokens recurring on a specific agent — not a one-off, a pattern across multiple sessions. The model is hitting the ceiling.gen_ai.usage.input_tokens) increasing monotonically per step without a corresponding increase in task complexity. The agent is appending tool outputs without summarization.Memory corruption is the multi-session variant: user A's session state leaks into user B's session due to inadequate session isolation. The Clyro dataset classifies this as 8.1% of incidents separately, but the remediation is the same: session isolation enforced at the execution layer, not the prompt.[3]
Staged context compaction eliminates most context decay incidents. One materials science workflow that consumed 20 million tokens without compaction was re-implemented with memory pointers — short identifiers replacing full data — and reduced token usage by over 99%.[4] Treat context as a budget with a hard ceiling, not an append-only log.
Each mode has a different blast radius and a different fix target. None of them look the same in the trace.
Runaway Execution is the rarest mode at 5.1% of incidents but produces the highest per-incident financial cost. An 11-day, $47,000 API cost spiral is the canonical example — a loop that billing alerts reported in aggregate but per-session enforcement would have terminated per-call.[4] The 'soft loop' variant is harder to detect than exact repetition: the agent varies arguments slightly each iteration — adds a word to a search, shifts a parameter — while making no measurable progress toward the goal. State change detection between steps catches soft loops where argument similarity hashing doesn't.
Circuit breakers are the structural fix. Production implementations trip on four conditions: cost velocity (spend rate exceeding the per-session ceiling), repeated prompts (argument hash similarity above threshold across N consecutive calls), error rate (ERROR spans exceeding 30% of calls in a window), and growing context (token count increasing faster than a configurable rate per step). A lightweight progress-check node running every 3 steps — asking the model whether measurable progress toward the goal occurred — catches soft loops before cost accumulates.
Plan Drift is where the agent executes steps correctly but pursues the wrong objective. Tool calls succeed. Logic is internally coherent. The output is confidently wrong relative to the original intent. AgentRx categorizes this under Plan Adherence Failure and Intent–Plan Misalignment — the distinction is whether the agent departed from a correct plan or started with a wrong one.[9] Trace signature: required tool call categories absent from the span tree, tool categories shifting mid-workflow without a decision span to explain the shift, or a completion check span never firing. This mode requires LLM-as-judge evaluation on the reasoning content. Span-level signals are indirect. The eval gate at deploy time — running behavioral spec checks before the agent reaches production — is the right intervention point.
Scope Overreach is structurally the most dangerous mode — and the hardest to recover from. The agent takes actions outside its authorized scope: destructive database operations, out-of-scope API writes, permissions beyond what the task requires. The Clyro dataset classifies 30.3% of documented incidents here.[3] Trace signature: tool calls to endpoints outside the agent's defined permission surface, or successful calls to destructive endpoints (DELETE, DROP, PATCH) not in the task scope. Scope overreach requires permission surface enforcement at the execution layer. Prompt-level instructions don't reliably contain it across 10+ turns. Define the authorized tool endpoints and resource scopes for each agent role at initialization and check every tool span's target against it before execution, not after.
Silent Semantic Failure is the mode where all infrastructure signals stay green while output quality degrades. No ERROR spans. No loops. No budget overruns. Normal token counts. The trace is clean because the infrastructure is healthy. The outputs are wrong because the reasoning is wrong. In one documented case, accuracy dropped sharply over a three-month period with no infrastructure alert firing.[3] This is the only mode not classifiable from trace inspection alone. Online evaluation — an LLM-as-judge running against sampled production outputs — is the detection mechanism. Without it, you're blind to 24.9% of documented incidents.
Cascade Propagation emerges in multi-agent architectures and doesn't fit cleanly into the other six modes because the origin and the manifestation are in different agents. Agent A produces wrong output (any mode). Agent B receives that output as input, processes it without semantic validation, and its spans look clean from its own perspective. Trace signature: orphaned spans at handoff points — context propagation failure co-occurring with semantic corruption — and downstream agents consuming unusually high token counts for apparently simple inputs (the model is working hard on garbage). Temporal correlation between agent A's anomalous output and agent B's anomalous behavior is the key diagnostic signal. The VentureBeat chaos engineering analysis documents a related pattern: autonomous remediation agents triggering production cascades by acting on locally-correct decisions without full system state awareness.[8]
Not all span attributes are equally diagnostic. These are the ones that map directly to a mode.
The OpenTelemetry GenAI semantic conventions (Development status, mid-2026)[12] define a span hierarchy for agent execution: a top-level invoke_agent span with child chat spans for each LLM call and execute_tool spans for each tool invocation. The attributes that distinguish failure modes are specific and measurable — not all of them are in the spec yet, but the core ones are stable enough to build alerting rules on.
The attributes below are those confirmed in the OTel GenAI spec or derived directly from span data, not invented for this taxonomy:
| Attribute | Span Type | Failure Mode Signal | Alert Threshold |
|---|---|---|---|
| genai.response.finishreasons | chat (LLM call) | max_tokens = Context Decay | Rate > 10% of sessions for a given agent type |
| genai.usage.inputtokens | chat (LLM call) | Monotonic increase per step = Context Decay | Growth rate > 15% per step without task complexity increase |
| gen_ai.tool.name (on ERROR span) | execute_tool | Persistent errors on one tool = Schema Drift subtype | Single tool ERROR rate > 30% starting at a specific timestamp |
| gen_ai.tool.call.arguments (hash) | execute_tool | Similarity > 0.8 across 3+ consecutive calls = Retry loop | Exact: any 3 consecutive identical argument hashes |
| span.status = ERROR on execute_tool | execute_tool | General tool failure; subtype from argument pattern | Tool ERROR rate > 5% sustained over 5-minute window |
| Step count vs. P95 for agent type | invoke_agent | Runaway Execution | Step count > 3× P95 baseline for that agent type |
| Span parent_id gap at handoff | invoke_agent | Cascade Propagation — context not propagated | Any orphaned root span in a multi-agent session |
AgentRx proves the method. For most teams, a simpler classifier running against the span export is enough to route the incident.
AgentRx (Microsoft Research, arXiv:2602.02475, February 2026) is the first published framework for automated failure classification from execution trajectories.[6] Its pipeline normalizes raw logs into a canonical Trajectory IR, synthesizes static and dynamic invariants from policy, tool schemas, and per-step context, checks invariants against the trajectory, and runs an LLM judge to localize the critical failure step and assign a taxonomy category. Evaluated on 115 annotated trajectories, it achieves +23.6% improvement in failure localization and +22.9% improvement in root-cause attribution over prompting baselines.[9]
AgentFixer (IBM Research, 2026) takes a complementary approach: fifteen rule-based and LLM-as-judge detection tools that surface weaknesses before incidents, rather than classifying failures after the fact.[11]
For teams not ready to run the full AgentRx pipeline, a simpler classifier built against the OTel span export catches six of seven modes with a single prompt. Silent Semantic Failure remains the exception — it requires output content evaluation, which the span tree doesn't carry.
Alert: 'toolcallerror_rate > 5%'
Action: check tool logs → call tool provider
Fix: retry logic added — same failure recurs on schema drift
Postmortem: 'tool error' — category stays at symptom level
Next occurrence: same triage time; same wrong fix layer
Classifier: Tool Invocation Failure — Schema Drift subtype (confidence: 0.91)
Action: run schema diff against live API spec → validate at startup
Fix: schema validation blocks deploy on mismatch — never reaches production
Postmortem: 'Tool Invocation Failure — Schema Drift' — root cause addressed
Next occurrence: blocked at startup before it reaches production
Each phase is independently useful. The classifier at phase 2 works before phase 4 is complete.
Add mode-specific alert rules before building the classifier. Three high-signal rules: step count > 3× P95 per agent type (runaway execution), genai.response.finishreasons = max_tokens rate above 10% for a given agent (context decay), and tool ERROR rate on a specific tool with argument similarity > 0.8 across consecutive calls (tool invocation failure). These rules will produce false positives — that's acceptable. The goal at phase 1 is mode visibility, not precision. You can't tune a classifier you haven't run.
Wire the LLM classifier to session termination events — circuit breaker trips, budget ceiling hits, max-step terminations. Every terminated session routes through classifyfailuremode() before the on-call alert fires. The engineer receives a mode classification with confidence score and evidence span IDs — not a threshold breach with no context. Don't wait for mode-specific runbooks to deploy the classifier. Seeing mode distributions in production data tells you which runbook entries to write first.
Write one runbook entry per mode. Each entry specifies: the three diagnostic questions specific to that mode, the remediation layer to target (schema, context architecture, execution bounds, permissions, online eval), the span attributes that confirm the classification, and the regression test that confirms the fix held. Rewrite any existing symptom-level runbook entries that reference 'check tool logs' without specifying which span attribute pattern to look for.
Add online evaluation: an LLM-as-judge evaluator sampling 10–20% of production outputs per agent type. Establish a quality score baseline per agent over seven days. Alert on score drops greater than 15% from the 7-day rolling baseline. Without this phase, the triage pipeline classifies six of seven modes. The missing mode accounts for 24.9% of documented incidents — and it's invisible until customers or auditors surface it.[3]
Seven modes covers the space for most production systems. Here's what breaks that assumption.
This taxonomy works for the common case: single-tenant or multi-tenant agentic systems where tools have defined schemas, execution is bounded, and outputs have a quality standard that can be evaluated. It works well enough to replace symptom-level on-call runbooks in the first week.
Three situations where it needs extension:
Heavily domain-specific tool failure modes. In financial services or healthcare workflows, what looks like 'Plan Drift' at the span level may actually be regulatory non-compliance — a distinct failure category with different remediation and a very different incident severity. Add a subtype under Plan Drift for your domain, don't invent a new top-level mode. Taxonomy proliferation kills frequency analysis.
Untooled or black-box agents. If your agent doesn't emit execute_tool spans — because it's a legacy system or a vendor black-box — the classifier is operating on incomplete evidence. Instrument tool boundaries first. Until you can distinguish tool failure from reasoning failure at the span level, all classification results should carry reduced confidence.
True multi-agent orchestrator patterns. When orchestrator and subagents run in different processes or services, context propagation requires explicit trace header passing. Cascade Propagation is both a failure mode and a diagnostic tool for propagation gaps — if you're seeing frequent 'inconclusive' results from the classifier on multi-agent sessions, fix context propagation first. Most inconclusive results in practice trace to orphaned spans that make the failure mechanism invisible.[7]
What do I do when the classifier returns 'inconclusive'?
Inconclusive is a valid output, not a failure. AgentRx's taxonomy includes it explicitly — it signals insufficient trace evidence for automated classification.[6] The on-call action: pull the full session trace, check for context propagation gaps (orphaned root spans are the most common cause), and if the session produced output, run manual output evaluation. Most inconclusive results in practice trace to context propagation failures that make the failure mechanism invisible in the span tree. Fix propagation first; the classification ambiguity usually resolves on the next incident.
Is AgentRx production-ready as a classifier?
AgentRx (microsoft/AgentRx on GitHub, published February 2026) is a research framework — not a production observability platform.[9] Run it offline against exported traces to classify failure modes, not inline in your alert path. The classifier in this article is the production-appropriate version: a single LLM prompt against the OTel span summary, wired to session termination events. AgentRx's value is the grounded taxonomy and the invariant synthesis methodology — both inform this taxonomy. You don't need to run the full research codebase to use it.
My agents are multi-tenant. How do I detect scope overreach without reading other tenants' data?
Scope overreach detection at the trace level doesn't require inspecting content across tenants — it requires a permission surface map. Define the authorized tool endpoints and resource scopes for each agent role, then alert on any tool call outside that surface. The trace records the endpoint called and the response code — a 403 on a destructive endpoint or a call to a resource outside the agent's defined scope is detectable without cross-tenant data access. Bind the permission surface to the session context at initialization and check every tool span's target against it before execution.
Should I build a custom taxonomy or use this one?
Start with this one. The cost of a bespoke taxonomy is postmortems that don't translate across incidents — you lose the frequency distribution that tells you which layer has the structural gap. Tag incidents for 90 days, then add subcategories only for modes that show high frequency and variation in root cause. The AgentRx nine-category taxonomy goes more granular if a single mode dominates your system and needs finer classification.[6] The Clyro five-mode taxonomy is simpler for teams with less trace instrumentation. None of these were derived from a single domain — AgentRx covers structured API workflows, incident management, and open-ended web/file tasks.
What's the right sampling rate for online evaluation (Silent Semantic Failure detection)?
Start at 10–20% of production outputs per agent type. The right rate depends on your output volume, the cost of the evaluator model, and how fast you need to detect quality drops. For high-stakes agents (financial, medical, legal), sample 100% and accept the cost — the exposure from undetected quality degradation exceeds the evaluation cost. For lower-stakes agents, 10% is sufficient to detect a 15%+ quality drop within a reasonable window. Key: establish a seven-day baseline first, then alert on deviations. Without the baseline, the alert threshold is arbitrary.
How does context rot (Chroma 2025) relate to context decay in this taxonomy?
Context rot is the research term; context decay is the operational consequence. Chroma's 2025 finding — all 18 tested frontier models show monotonically decreasing performance as context grows, with steepest degradation in the 100K–500K token range — explains why context decay happens.[10] The lost-in-the-middle effect means turn 1 constraints are structurally disadvantaged by turn 12 regardless of model capability. Operationally, you can't fix context rot in the model — you manage it through context compaction, session isolation, and system-prompt pinning of hard constraints.
The taxonomy doesn't prevent failures. It prevents the same failure from being misclassified, targeting the wrong layer, and recurring in a form that fires a different alert.
When your on-call engineer identifies the failure mode in five minutes and pulls a runbook entry that specifies schema vs. infra vs. permissions vs. eval, debugging shifts from search to confirmation. Three months of mode-tagged postmortems shows which layer has the structural gap. Six months shows whether your fixes are actually closing it.
The failure library is finite. The on-call expeditions don't have to be.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.