89% of teams running agentic systems have observability tooling installed. 62% can inspect what their agents do at each individual step.[1] That 27% gap — teams who monitor but cannot trace — is not a tooling problem. It is a classification problem. The trace exists. The failure mode does not.
When your on-call runbook says tool_call_error_rate > 5% — check tool logs, it describes a symptom. It does not name a cause. Tool invocation failure means something different from context decay, which means something different from plan drift, which means something different from scope overreach. Each mode has a different trace signature. Each implicates a different remediation layer. Treating them as variants of 'the agent did something wrong' is why the same class of failure keeps showing up under different surface presentations.
AgentEval, a DAG-structured evaluation framework piloted on 12,847 traces across 18 engineers, measured median root-cause identification time at 4.2 hours without taxonomy-driven triage, and 22 minutes with it.[2] The taxonomy did not prevent failures. It prevented the same failure from costing 4 hours the second time.
Key Takeaways
- ✓
88% of agentic incidents trace to infrastructure and governance gaps, not model quality — the model usually worked; the governance layer did not
- ✓
Seven modes cover the classification space: Tool Invocation Failure, Context Decay, Runaway Execution, Plan Drift, Scope Overreach, Silent Semantic Failure, Cascade Propagation
- ✓
Six of seven modes are classifiable from OTel span signals alone — Silent Semantic Failure is the exception and requires online output evaluation
- ✓
AgentRx (Microsoft Research, arXiv:2602.02475, Feb 2026) automates mode classification from trajectory invariant synthesis and LLM judging
- ✓
Failure mode frequency tracked over 90 days reveals which structural layer has the recurring gap
Why Trace Volume Doesn't Buy You Incident Clarity
Distributed tracing solves context propagation. Failure classification is a different problem — and the one that determines where you fix.
A clean span tree tells you that every step executed. It does not tell you whether the right objective was pursued, whether the tool schema was current, whether the context window was intact, or whether the agent operated within its authorized scope.
The failure mode matters because it determines the remediation layer. Tool invocation failure traces to schema validation at startup, not prompt tuning. Context decay traces to context management architecture, not model capability. Scope overreach traces to permission design, not model alignment. When on-call engineers treat every agent failure as a single category — 'agent behavior issue' — the fix targets the most visible layer, which is usually the least causal one.
The Clyro analysis of 591 documented agent incidents from 2023 through early 2026 found that 88% traced to infrastructure gaps: missing permission checks, no execution bounds, no context validation, no quality monitoring. The model worked correctly in the majority of classified failures — it operated on bad inputs inside an inadequate governance structure. Model-focused remediation addressed roughly 12% of underlying causes.[3]
The OWASP Agentic Security Initiative independently validates this finding. Their framework maps the major risk categories — tool misuse, memory poisoning, cascading failures, identity abuse — to the same structural layer: the runtime governance envelope around the model, not the model itself.
591-incident analysis (2023–2026) — Clyro, April 2026[3]
Down from 4.2 hours without mode classification — AgentEval pilot, 12,847 traces[2]
Survey of 1,300+ AI professionals, 2026[1]
Synthesized from AgentRx, Clyro 591-incident dataset, Zylos Research, and Latitude production analysis
The Seven Failure Modes: What the Trace Shows
Coarse enough to classify in five minutes. Granular enough that the mode determines the fix.
H1 2026 incident data — synthesized from Clyro's 591-incident dataset, the AgentRx benchmark (115 annotated trajectories), Zylos Research, and Latitude.so production analysis — identifies seven operationally distinct failure modes. The taxonomy is intentionally practitioner-scoped: coarse enough that an on-call engineer can identify the mode within five minutes of opening the trace, granular enough that the mode determines the remediation target.
| Failure Mode | OTel Trace Signature | Fix Target | Incident Share |
|---|---|---|---|
| Tool Invocation Failure | ERROR spans on tool calls; repeated args; HTTP 200 with empty body | Schema validation; circuit breaker | High (top category — AgentRx[6]) |
| Context Decay | max_tokens finish reason recurring; rising input tokens per step | Context architecture; session isolation | 31.6% (Clyro 2026[3]) |
| Runaway Execution | Step count > 3× P95; monotonic cost per step; circular tool args | Execution bounds; loop detection | 5.1% — highest cost per incident (Clyro 2026[3]) |
| Plan Drift | Tool categories shift mid-workflow; completion span absent | Behavioral spec; eval gate at deploy | Moderate (AgentRx category 5[6]) |
| Scope Overreach | Out-of-scope tool calls; destructive endpoint access; 403s | Permission design; least privilege | 30.3% (Clyro 2026[3]) |
| Silent Semantic Failure | All spans succeed; evaluator flags quality degradation | Online evaluation; quality monitoring | 24.9% (Clyro 2026[3]) |
| Cascade Propagation | Orphaned spans at handoffs; downstream agents on corrupted input | Semantic validation at handoff | Emergent in multi-agent systems[8] |
Tool Invocation Failure: The Most Misdiagnosed Mode
A single bad call at step N corrupts every subsequent step. The trace shows HTTP 200. The data is wrong.
Tool invocation failure is the most common failure mode at the individual step level — and the most treacherous because a single malformed call at step N corrupts every subsequent step that depends on that output. The agent proceeds confidently. HTTP 200. Wrong payload.
Three subtypes produce different trace patterns:
Invalid invocation — the agent constructs a tool call with incorrect argument schema. The specific failure variant that drives runaway cost: the agent cannot distinguish 'the API rejected my request' from 'the task is impossible,' so it retries the same malformed call hundreds of times. One documented incident: $2,000 in API charges from 847 identical retries of the same failed call.[4] Trace signature: identical or near-identical argument hashes across consecutive ERROR-status tool spans.
Silent empty response — the tool returns HTTP 200 with an empty or truncated payload. The agent cannot distinguish 'legitimately no results' from 'query truncated by context window pressure.' A success span is recorded. Downstream steps process empty data. The output looks complete. It is wrong.
Schema drift — the tool's API specification changed after the agent deployed. Parameter names renamed, required fields added, authentication flows updated. The agent operates against its cached schema. Calls fail with validation errors the agent misinterprets as domain failures. Certificate expiration running silently for months is another variant of this pattern.[4] Trace signature: persistent ERROR spans on a previously reliable tool, starting at a specific timestamp.
Schema validation at startup — not at call time — is the structural fix. Validate every tool schema against the live API spec before the agent begins. A mismatch blocks the deploy. It does not manifest as a cryptic production failure at 3am.
startup_schema_validation.py# Schema drift detection at startup — not at call time.
# Blocks the deploy on mismatch rather than producing cryptic runtime failures.
from dataclasses import dataclass
@dataclass
class ToolSchemaMismatch(Exception):
tool_name: str
changed_fields: list[str]
def validate_tool_schemas_on_startup(tools: list) -> None:
"""
Call before the agent loop starts. Raises ToolSchemaMismatch if any
tool's cached schema does not match the live API spec.
"""
for tool in tools:
live_spec = fetch_live_api_spec(tool.endpoint)
diff = compare_schemas(tool.cached_schema, live_spec)
if diff.has_breaking_changes:
raise ToolSchemaMismatch(
tool_name=tool.name,
changed_fields=diff.changed_fields,
)
# In your agent startup:
# validate_tool_schemas_on_startup(agent.tools) # raises before any task runs
# agent.run(task)Context Decay: The Failure That Compounds Over Turns
The model is not forgetting. The context window is full. The distinction matters for the fix.
Context decay looks gradual and hits like an infrastructure incident. The agent accumulates tool results, reasoning history, and conversation turns until the context window fills. No crash. No alert. The model starts losing earlier constraints — the user's original requirements, the scope boundaries established at turn 1, the data accumulated at steps 2 and 3.
Commercial LLM agent analysis shows context retention accuracy drops 15–30% in sessions exceeding 10 turns.[5] The output at turn 12 looks reasonable in isolation. It is wrong relative to constraints from turn 1 that are no longer in the window.
Three OTel signals flag context decay before output quality collapses:
gen_ai.response.finish_reason = 'max_tokens'recurring on a specific agent — not a one-off, a pattern. The model is hitting the ceiling.- Input token count increasing monotonically per step without a corresponding increase in task complexity. The agent is appending tool outputs without summarization.
- Context window utilization exceeding 70%. Alerts at this threshold surface the problem before degradation, not after.
Memory corruption is the multi-session variant: user A's session state leaks into user B's session due to inadequate session isolation. The Clyro dataset classifies this as 8.1% of incidents separately, but the remediation is the same: session isolation enforced at the execution layer, not the prompt.[3]
Staged context compaction eliminates most context decay incidents. A materials science workflow that consumed 20 million tokens without compaction was re-implemented with memory pointers — short identifiers replacing full data — and reduced token usage by over 99%.[4] Treat context as a budget, not an append-only log.
The Remaining Five Failure Modes
Each failure mode has a different blast radius and a different fix target. None of them look the same in the trace.
Runaway Execution is the rarest mode at 5.1% of incidents but produces the highest per-incident financial cost. An 11-day, $47,000 API cost spiral is the canonical example — a loop that billing alerts reported in aggregate but per-session enforcement would have terminated per-call.[4] The 'soft loop' variant is harder to detect than exact repetition: the agent varies arguments slightly each iteration — adds a word to a search, shifts a parameter — while making no measurable progress toward the goal. State change detection between steps catches soft loops where argument similarity hashing does not. Trace signature: step count exceeding 3× P95 for the agent type, per-step token cost increasing monotonically (context window inflation from accumulated iterations).
Plan Drift is where the agent executes steps correctly but pursues the wrong objective. Tool calls succeed. Logic is internally coherent. The output is confidently wrong relative to the original intent. Trace signature: required tool call categories absent from the span tree — an agent that should retrieve before writing, going directly to write — tool call categories shifting mid-workflow without a decision span to explain the shift, or a completion check span never firing. This mode requires LLM-as-judge evaluation on the reasoning content. Span-level signals are indirect.
Scope Overreach is structurally the most dangerous mode — and the hardest to recover from. The agent takes actions outside its authorized scope: destructive database operations, out-of-scope API writes, permissions beyond what the task requires. The Clyro dataset classifies 30.3% of documented incidents here, including a McKinsey breach (46.5M chat messages exposed) and a $3.2M AI procurement fraud incident.[3] Trace signature: tool calls to endpoints outside the agent's defined permission surface, or successful calls to destructive endpoints (DELETE, DROP, PATCH) that were not in the task scope. Scope overreach requires permission surface enforcement at the execution layer. Prompt-level instructions do not reliably contain it.
Silent Semantic Failure is the mode where all infrastructure signals stay green while output quality degrades. No ERROR spans. No loops. No budget overruns. Normal token counts. The trace is clean. The outputs are wrong. In one documented case, math accuracy reportedly dropped from 97.6% to 2.4% over three months with no infrastructure alert firing.[3] This is the only mode that cannot be classified from trace inspection alone. Online evaluation — an LLM-as-judge running against sampled production outputs — is the detection mechanism. A team without online evaluation is flying blind on this mode regardless of how many spans they collect.
Cascade Propagation emerges in multi-agent architectures and does not fit cleanly into the other six modes because the origin and the manifestation are in different agents. Agent A produces wrong output (any mode). Agent B receives that output as input, processes it without semantic validation, and its spans look clean from its own perspective. Trace signature: orphaned spans at handoff points (context propagation failure co-occurring with semantic corruption), downstream agents consuming unusually high token counts for apparently simple inputs — the model is working hard on garbage — and temporal correlation between agent A's anomalous output and agent B's anomalous behavior. The VentureBeat chaos engineering analysis documents a related pattern: autonomous remediation agents triggering production cascades by acting on locally-correct decisions without full system state awareness.[8]
Automating Classification: The LLM-as-Judge Approach
AgentRx shows the full pipeline. For most teams, a simpler classifier running against the span export is enough to route the incident.
AgentRx (Microsoft Research, arXiv:2602.02475, February 2026) is the first published framework for automated failure classification from execution trajectories.[6] Its pipeline normalizes raw logs into a canonical Trajectory IR, synthesizes static and dynamic invariants from policy, tool schemas, and per-step context, checks invariants against the trajectory, and runs an LLM judge to localize the critical failure step and assign a taxonomy category. The framework ships with a 10-category taxonomy derived from 115 annotated failed trajectories across structured API workflows, incident management, and web/file tasks.
For teams not ready to run the full AgentRx pipeline, a simpler classifier built against the OTel span export catches six of seven modes with a single prompt. Silent Semantic Failure remains the exception — it requires output content evaluation, which the span tree does not carry.
failure_mode_classifier.py# Minimal failure mode classifier — runs against an OTel span export.
# Classifies 6 of 7 modes from span signals; Silent Semantic requires eval.
FAILURE_MODE_PROMPT = """
You are classifying a failed agent execution trace. Based on the span evidence,
identify the primary failure mode from this taxonomy:
TAXONOMY:
1. TOOL_INVOCATION_FAILURE — ERROR spans on tool calls; repeated args; silent empty body
2. CONTEXT_DECAY — max_tokens finish reasons; rising input tokens per step
3. RUNAWAY_EXECUTION — step count > 3x baseline; monotonic cost; circular tool args
4. PLAN_DRIFT — tool categories shift mid-workflow; no completion span
5. SCOPE_OVERREACH — out-of-scope calls; destructive endpoint access; 403s
6. SILENT_SEMANTIC_FAILURE — all spans succeed; needs output eval to confirm
7. CASCADE_PROPAGATION — orphaned spans at handoffs; downstream corruption pattern
SPAN EVIDENCE:
{span_summary}
Output JSON only:
{{"mode": "<mode>", "confidence": 0.0, "evidence": "<key span IDs>",
"fix_target": "<schema|context|execution|permissions|eval|handoff>"}}
"""
def classify_failure_mode(spans: list[dict]) -> dict:
"""
Pass a list of OTel span dicts from a failed session.
Returns mode, confidence, evidence span IDs, and fix_target.
"""
span_summary = build_span_summary(spans)
response = llm.complete(FAILURE_MODE_PROMPT.format(span_summary=span_summary))
result = json.loads(response.content)
return {**result, "runbook_url": RUNBOOKS.get(result["mode"])}
# Wire this to your circuit breaker and budget ceiling terminations:
# def on_session_terminated(session_id, reason, spans):
# if reason in ("max_steps", "budget_ceiling", "circuit_breaker"):
# classification = classify_failure_mode(spans)
# page_oncall(session_id, classification)Alert: 'toolcallerror_rate > 5%'
Action: check tool logs → call tool provider
Fix: retry logic added — same failure recurs on schema drift
Postmortem: 'tool error' — category stays at symptom level
Next occurrence: same triage time; same wrong fix layer
Classifier identifies: Tool Invocation Failure — Schema Drift subtype
Action: run schema diff → validate on startup → add drift alert
Fix: schema validation blocks deploy on mismatch
Postmortem: 'Tool Invocation Failure — Schema Drift' — root cause addressed
Next occurrence: blocked at startup before it reaches production
Four Phases to a Working Triage Pipeline
Each phase is independently useful. The classifier at phase 2 works before phase 4 is complete.
- [01]
Phase 1: Mode-Level Alerting (Week 1)
Add mode-specific alert rules before building the classifier. Three high-signal rules: step count > 3× P95 per agent type (runaway execution signal), genai.response.finishreason = max_tokens rate above threshold (context decay signal), tool ERROR rate on a specific tool with argument similarity > 0.8 across consecutive calls (tool invocation failure signal). These rules will produce false positives. That is acceptable — the goal at phase 1 is mode visibility, not precision. You cannot tune a classifier you have not run.
- [02]
Phase 2: Deploy the Failure Mode Classifier (Week 2)
Wire the LLM classifier to session termination events — circuit breaker trips, budget ceiling hits, max-step terminations. Every terminated session routes through classifyfailuremode() before the on-call alert fires. The engineer receives a mode classification with evidence span IDs, not a threshold breach with no context. Do not wait for mode-specific runbooks to deploy the classifier. Seeing mode distributions in production data is what tells you which runbook entries to write first.
- [03]
Phase 3: Mode-Specific Runbooks (Week 2–3)
Write one runbook entry per mode. Each entry specifies: the three diagnostic questions specific to that mode, the remediation layer to target (schema, context architecture, execution bounds, permissions, online eval), and the regression test that confirms the fix held. Rewrite any existing symptom-level runbook entries that reference 'check tool logs' without specifying which span attribute pattern to look for. A runbook entry that reads identically for three different failure modes is still a symptom runbook.
- [04]
Phase 4: Silent Semantic Failure Detection (Week 3–4)
Add online evaluation: an LLM-as-judge evaluator sampling 10–20% of production outputs per agent type. Establish a quality score baseline per agent. Alert on score drops greater than 15% from the 7-day baseline. Without this phase, the triage pipeline classifies six of seven modes. The missing mode accounts for 24.9% of documented incidents — and it is invisible until customers or auditors surface it.
Triage Infrastructure Readiness
On-call runbook entries map to failure modes, not symptom categories
Failure mode classifier runs on every session terminated by circuit breaker or budget ceiling
Schema validation runs on agent startup — every tool, every deploy
Step count alert fires at 3× P95 baseline per agent type
genai.response.finishreason = max_tokens rate tracked per agent — threshold alert configured
Per-session context window utilization gauge — alerts at 70% and 90% capacity
Tool call permission surface defined and enforced at the execution layer, not prompt level
Online evaluator sampling 10–20% of production outputs per agent for Silent Semantic Failure
Postmortem template requires failure mode classification before the root cause section
Failure mode frequency tracked over time — trending modes signal structural gaps in a specific layer
What do I do when the classifier returns 'inconclusive'?
Inconclusive is a valid output, not a failure. AgentRx's grounded taxonomy includes it explicitly — it signals insufficient trace evidence for automated classification. The on-call action: pull the full session trace, check for context propagation gaps (orphaned root spans are the most common reason for inconclusive results), and if the session produced output, run manual output evaluation. Most inconclusive results in practice trace to context propagation failures that make the failure mechanism invisible in the span tree. Fixing propagation first usually resolves the classification ambiguity on the next incident.
Is AgentRx production-ready as a classifier?
AgentRx is a research framework (microsoft/AgentRx on GitHub, published February 2026) — not a production observability platform. Run it offline against exported traces to classify failure modes, not inline in your alert path. The classifier in this article is the production-appropriate version: a single LLM prompt against the OTel span summary, wired to session termination events. AgentRx's value is the grounded taxonomy and the invariant synthesis methodology, both of which informed the taxonomy here. You do not need to run the full research codebase to use the taxonomy.
My agents are multi-tenant. How do I detect scope overreach without reading other tenants' data?
Scope overreach detection at the trace level does not require inspecting content across tenants — it requires a permission surface map. Define the authorized tool endpoints and resource scopes for each agent role, then alert on any tool call outside that surface. The trace records the endpoint called and the response code — a 403 on a destructive endpoint or a call to a resource outside the agent's defined scope is detectable without cross-tenant data access. Bind the permission surface to the session context at initialization and check every tool span's target against it before execution.
Should I build a custom taxonomy or use this one?
Start with this taxonomy. The cost of a bespoke taxonomy is postmortems that don't translate across incidents — you lose the frequency distribution that tells you which layer has the structural gap. Tag incidents for 90 days, then add subcategories only for modes that show high frequency and variation in root cause. The AgentRx 10-category taxonomy goes more granular if a single mode in your system accounts for most incidents and needs finer classification. The Clyro 5-mode taxonomy is simpler for teams with less trace instrumentation. None of these taxonomies are domain-specific — they were derived from incident data across retail workflows, incident management, and web/file tasks.
The taxonomy does not prevent failures. It prevents the same failure from being misclassified, targeting the wrong layer, and recurring in a form that fires a different alert.
When your on-call engineer identifies the failure mode in five minutes and pulls a runbook entry that specifies schema vs. infra vs. permissions vs. eval, debugging shifts from search to confirmation. Three months of mode-tagged postmortems tells you which layer has the structural gap. Six months tells you whether your fixes are actually closing it.
The failure library is finite. The on-call expeditions do not have to be.
- [1]AI Agent Silent Failures: Why 200 OK Is the Most Dangerous Response in Production (Runcycles, March 2026)(runcycles.io)↩
- [2]AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows — pilot data: MTTR 4.2h to 22min on 12,847 traces(github.com)↩
- [3]We Analyzed 100 AI Agent Failures — 591-incident dataset, failure mode distribution (Clyro, April 2026)(clyro.dev)↩
- [4]SRE for AI Agents: What Actually Breaks at 3am — $47K loop incident, hallucinated API retries (tianpan.co, April 2026)(tianpan.co)↩
- [5]Detecting AI Agent Failure Modes in Production: A Framework for Observability-Driven Diagnosis (Latitude, March 2026)(latitude.so)↩
- [6]AgentRx: Diagnosing AI Agent Failures from Execution Trajectories (Barke et al., Microsoft Research, arXiv:2602.02475, February 2026)(arxiv.org)↩
- [7]Site Reliability Engineering for AI Agent Systems: Observability, Incident Response, and Operational Patterns (Zylos Research, March 2026)(zylos.ai)↩
- [8]AI agents are quietly generating chaos engineering failures enterprises don't track yet (VentureBeat, May 2026)(venturebeat.com)↩