Detection tells you something is wrong. The four-step diagnostic pipeline — behavioral telemetry, failure clustering, root cause attribution, eval generation — tells you what failed, why, and how to stop it from shipping again. Most teams build partial detection and stop there.
A compliance-checking agent ran for 18 days before anyone noticed. HTTP 200 on every request. Valid JSON on every response. No alert fired. The first signal was a regulatory audit finding: the agent had been citing fabricated policy references since a retrieval index update three weeks earlier.
The instrumentation layer was complete by standard measures — latency tracked, error rates monitored, token costs attributed. What it never captured: whether responses were grounded in retrieved documents, whether a fallback path had been silently invoked, or whether downstream systems could actually use the output. Eighty-eight percent of production agent failures trace to infrastructure gaps — absent monitoring, missing guardrails, inadequate trace instrumentation — not model quality.[1] The model was working exactly as designed. The diagnostic layer was never built.
This isn't a rare edge case. Research analyzing 1,600+ annotated multi-agent traces across seven popular frameworks found 14 distinct failure modes clustered into three categories: specification issues, inter-agent misalignment, and task verification gaps.[9] The researchers explicitly found that "improvements in base model capabilities will be insufficient to address the full taxonomy" — meaning better models alone won't fix what's fundamentally a systems and observability problem.
Behavioral telemetry is the instrumentation layer between infrastructure metrics and output quality. It captures execution-time signals — grounded, fallback, confidence, downstream appropriateness — that describe how an agent behaved on a specific run, not just whether it returned 200. Without it, detection tells you something is degrading. Diagnosis remains impossible.
This is the four-step pipeline that teams research, partially prototype, and rarely complete: trace collection with behavioral signals → failure clustering → root cause attribution → eval generation. Each step compounds the last. The fourth step is the one that actually prevents recurrence.
Behavioral telemetry adds four execution-time signals to every agent span: grounded (retrieval used?), fallback (primary path abandoned?), confidence (certainty of tool selection?), downstream_ok (output actionable downstream?).
Failure clustering before root cause analysis: one cluster with 40 members is worth investigating. Forty individual failure events are not.
63% of step-level agent failures propagate from an upstream step — the root cause is not where the failure manifests.[3]
Root cause identification improved from a median of 4.2 hours to 22 minutes with structured diagnostic frameworks.[3]
The eval-from-failure loop compounds: 23 manually written tests can become 147 automatically generated from production failures in 60 days.[5]
The MAST taxonomy (NeurIPS 2025) found specification issues account for 41.77% of multi-agent failures — failures that instrumentation plus structured failure review would catch before an audit does.[9]
The gap between HTTP 200 and production correctness is where agent failures live. Four signals close it without additional model calls.
The OpenTelemetry GenAI semantic conventions — now implemented by frameworks including LangChain, CrewAI, AutoGen, and the OpenAI Agents SDK — define what every LLM call emits as standard span attributes: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons.[8] These are infrastructure signals. They tell you the bytes arrived on time and under budget. They have no opinion on whether the agent's reasoning was grounded in retrieved evidence, whether it silently fell back to a degraded path, or whether the output was structured in a way downstream systems could consume.
Behavioral telemetry fills that gap. It is the set of model-free, execution-time signals that describe how the agent behaved — not just that it completed. Each signal attaches as a span attribute on the existing invoke_agent span. No new infrastructure. No added latency. Richer attributes on spans you're already emitting.
Four signals cover the most frequent classes of silent failure:
agent.behavioral.grounded — Did the agent's response draw from retrieved context, or from its training weights? For any agent that does retrieval, this is the single most important behavioral signal. A grounded rate that drops from 0.94 to 0.71 over two weeks is a retrieval degradation signal that no infrastructure metric will surface. Compute it by checking whether retrieved document IDs appear in the reasoning trace or output attribution — a set membership check, not a model call.
agent.behavioral.fallback — Did the agent invoke a fallback path (secondary tool, default response, escalation trigger) rather than completing the primary workflow? A rising fallback rate says the primary path is failing silently. The agent is handling the situation, but not as designed. This catches silent workflow degradation before output quality metrics do.
agent.behavioral.confidence — What was the agent's assessed certainty about its tool selection or decision? Some providers return logprobs for tool selection; others require a confidence estimation prompt. This signal catches the plausible-but-wrong failure class — outputs the agent produced while uncertain, which are the first candidates for downstream validation or human review.
agent.behavioral.downstream_ok — Could a downstream system actually use this output? Schema validation passes but field values are malformed. A date is in the wrong timezone. A required ID is null. This catches outputs that are structurally valid but operationally useless — the failure class that hallucination detection misses because the format is correct and the content is plausibly wrong.
Two notes on where to start: if you build only two of these, build grounded and downstream_ok first. Grounding failures and downstream rejection events are the highest-frequency silent failure classes, and both are computable deterministically with no inference cost.
| Captured by gen_ai.* spans | Not captured — requires agent.behavioral.* |
|---|---|
| gen_ai.request.model — model identifier | Whether the response was grounded in retrieved context |
| genai.usage.inputtokens + output_tokens | Whether a fallback path was invoked vs. primary workflow |
| genai.response.finishreasons (stop, tool_calls) | Agent confidence about tool selection or decision |
| Latency per LLM call and per tool invocation | Whether downstream systems can actually use the output |
| Tool call names and argument structure (when content recording enabled) | Which step in a multi-step workflow originated a failure that manifested later |
Raw failure counts are noise. Behavioral signal clusters are the unit of diagnostic work. The transition point is 20 failures per workflow type per week.
Detection gives you a list of failed runs. That list, unprocessed, is not actionable. A team investigating 40 individual failures will converge on nothing useful. A team investigating three failure clusters with 13 members each will converge on root causes in under an hour.[4]
Failure clustering groups failed runs by behavioral signature — shared failure type, common signal pattern, same execution step, similar tool call sequence — to surface the underlying issue rather than individual incidents. The operational test: when you look at the cluster and immediately recognize the common cause, the clustering is working. When each member looks different, the signal dimensions need refinement.
The MAST taxonomy, developed at NeurIPS 2025, provides a useful reference structure: 14 failure modes across three primary categories — specification issues (41.77% of observed failures), inter-agent misalignment (36.94%), and task verification gaps (21.30%).[9] This isn't a framework to import wholesale; it's a starting shape. Your taxonomy will differ based on your agents' architecture and failure history. But the three-category structure — what was specified, how agents coordinated, and what was validated — maps cleanly onto the behavioral signals: grounding failures sit in verification gaps, fallback spikes sit in coordination failures, and confidence collapses often trace to specification ambiguity.
The bootstrapping challenge is real. You can't build a failure taxonomy before you've seen failures, and you can't cluster failures without a taxonomy. The approach that works:
Open coding first. Have a domain expert read 20–50 recent failed traces without a category framework. Write unstructured observations. Don't try to categorize — just observe. The question at this stage is: "What went wrong here?" not "Which category does this belong to?" This is the same method Langfuse and others recommend for production error analysis: let categories emerge from notes rather than checking traces against a predefined list.
Axial coding second. Group open-coded observations into a tentative taxonomy. Count occurrences per cluster. Re-read the largest clusters and confirm the grouping is coherent. Repeat until new traces stop producing new categories — that's the point at which you have enough taxonomy coverage to automate.
For teams early in the process: read the first 20 failures manually. At 20 failures per week, manual review is still tractable and produces a better taxonomy than premature automation. Automating a premature taxonomy generates better-organized noise. The transition is worth the investment at roughly 50–100 failures per week depending on team bandwidth.
63% of step-level failures propagate from upstream. Behavioral signal patterns map to cause classes — the trail starts from the cluster, not the failure node.
Root cause attribution is where behavioral signals earn their instrumentation cost. Without them, you have a failure timestamp and an output. With them, you have a behavioral fingerprint of the run: what grounding rate the agent maintained, whether it invoked fallback, where confidence dropped, and whether the downstream output passed validation.
The most common diagnostic error is investigating the failure node rather than tracing backward to its cause. Research on multi-step agent workflows found that 63% of step-level failures are propagated from upstream errors — not locally caused.[3] When a downstream synthesis step produces a hallucinated fact, the root cause is often a retrieval failure or a tool argument malformation three steps earlier. Debugging the synthesis step finds nothing. Following the behavioral signal trail backward finds the break. Root cause identification improved from a median of 4.2 hours to 22 minutes with structured diagnostic frameworks.[3]
This propagation problem is especially acute in multi-agent systems, where one agent's corrupted output becomes another agent's ground truth. The MAST taxonomy found that inter-agent misalignment failures — coordination breakdowns and conflicting objectives — account for 36.94% of all failure events.[9] No single agent's logs will surface these; only a cross-agent trace view, anchored by behavioral signals at each hop, will.
Four root cause patterns account for the majority of silent production failures:[4]
Provider model drift. Confidence drops and the execution fingerprint shifts — more fallback invocations, different tool selection ratios — but no code or prompt changed on your side. GPT-4o's behavior change in February 2025 broke production applications that had been stable for months; teams without behavioral telemetry found out from users. Correlate with the provider's changelog. A confidence collapse on a day you deployed nothing is a provider-side model update until proven otherwise.
Prompt regression. Grounded rate drops and downstream_ok falls. Correlate with prompt hash changes. A silent template change that removed a constraint produces exactly this: the model still runs, the tools still call, but responses stop being grounded and stop satisfying downstream validators.
Retrieval drift. Grounded rate drops while confidence stays stable. The agent attempts retrieval, finds nothing useful, and proceeds on weights. Check the retrieval index for changes: new documents that diluted relevance, stale embeddings, a query expansion change that altered what gets retrieved. In zero-shot RAG configurations, retrieval failures increase by 40% relative to deployments using a query rewriter or fine-tuned embedding adapter — making retrieval drift the silent failure mode most teams encounter first.[10]
Tool schema change. Fallback rate spikes and downstream_ok fails. An upstream API changed its response schema. The tool call succeeded. The output is parsed without error. The values are malformed or missing. This is the failure mode that schema validation at the tool boundary would catch and almost never does.
| Signal pattern | Most likely root cause | First diagnostic step |
|---|---|---|
| Confidence drops + fingerprint shifts, no deploy on your side | Provider model drift | Check provider changelog and model version log against failure timestamp |
| Grounded rate drops + downstream_ok falls, prompt hash changed | Prompt regression | Diff system prompt and tool descriptions against last known-good hash |
| Grounded rate drops, confidence stable, fallback flat | Retrieval drift | Inspect index changes; rerun retrieval queries manually on representative inputs |
| Fallback spikes + downstream_ok fails, no prompt change | Tool schema change | Compare current tool response schema against the schema from the last clean window |
| Confidence collapses on specific task type only | Capability boundary hit | Check whether task type appeared in training distribution; add explicit prompt constraint |
| Inter-agent signal degradation: downstream agent grounded drops without upstream agent showing failure | Inter-agent misalignment (MAST category) | Inspect cross-agent trace: check output format of upstream agent against input contract of downstream agent |
The diagnostic loop closes only when the failure becomes a test case. Without that step, the same failure ships after the next model update.
The eval generation step is what most teams skip. They diagnose the root cause, fix the immediate problem, and close the incident. Three months later, after a model upgrade or a prompt refactoring, the same failure ships again. Nobody connects it to the previous incident because the previous incident produced no artifact.
The pattern that prevents recurrence: every diagnosed root cause produces one eval case. Every eval case gets a severity and a grader.[5]
Severity determines CI behavior: P0 blocks the deploy. P1 warns in CI and requires an explicit override. P2 logs and tracks. Safety violations and compliance failures are P0 regardless of frequency. Quality regressions and grounding failures are P1. Formatting issues and downstream validation failures are P2 until they affect a compliance workflow.
Grader type follows failure class. Deterministic failures — the agent cited a document ID that was not in the retrieved set, the output JSON failed schema validation, fallback was invoked on a task type that should never trigger fallback — get assertion-based graders. Cheap to run, cheap to maintain, exact on recurrence. Semantic failures — the agent produced a plausible but factually wrong synthesis — get LLM-as-judge graders. More expensive, needed for cases where exact matching will never work.
The compounding effect is real. Chronicle Labs documented a team that started with 23 manually written evals and grew to 147 automatically generated from production failures over 60 days.[5] Regression coverage grows as a function of production failure rate — not developer time. The more failures the system sees, the harder it becomes to regress on known failure modes.
One honest constraint: evals generated from production failure clusters cover failure modes the agent has already exhibited. They don't catch novel failure modes. The production-to-eval loop needs companion coverage from adversarial testing and boundary analysis for failure classes the agent hasn't yet seen but will eventually encounter.
Signal-based trajectory sampling to select which failures warrant eval cases improves the efficiency of this work. Research on agent trajectory triage found that signal-based sampling achieves 82% informativeness versus 54% for random sampling — a 1.52× efficiency gain per informative trajectory selected.[2] The behavioral signals you already collected at execution time are the selection criteria. Grounding failures with downstream_ok=false and confidence below 0.70 are the highest-priority candidates. Clean runs with all signals in normal range contribute nothing new to the eval corpus.
591 documented incidents, 2023–2026. Missing monitoring and instrumentation — not model quality.[1]
The root cause is not at the failure node. Trace backward through behavioral signals.[3]
With structured diagnostic frameworks versus unstructured log investigation.[3]
82% informativeness versus 54% for random sampling when selecting which failures to act on.[2]
Alert fires from downstream consequence — audit finding, user complaint, report error
Engineer reviews logs for the triggering request without behavioral context
Root cause identified after hours of unstructured log reading
Fix deployed. Incident closed. No artifact produced.
Same failure ships again after the next model update or prompt refactor
Behavioral signal drop surfaces within one detection window — before downstream consequence
Failure cluster groups similar events; engineer investigates one representative trace
Root cause attributed in ~22 minutes via behavioral signal pattern matching
Fix deployed AND eval case generated with CI severity gate
Same failure class blocked at CI gate on the next model update or prompt refactor
LangSmith, Langfuse, and Arize Phoenix give you trace collection and eval running. They don't automatically close the loop from failure cluster to CI gate.
Gartner now predicts that 40% of organizations deploying AI will use dedicated AI observability tools by 2028 to monitor model performance, with 60% of software engineering teams adopting AI evaluation and observability platforms — up from just 18% in 2025.[11] That adoption curve means the tooling ecosystem is maturing fast, but not uniformly.
The current landscape splits into three categories:
Trace collection + visualization (LangSmith, Langfuse, Arize Phoenix, Datadog LLM Observability): These give you full agent traces — every LLM call, tool invocation, retrieval step, and sub-agent hop as a structured span tree. Arize Phoenix supports ten span kinds (CHAIN, LLM, TOOL, RETRIEVER, EMBEDDING, AGENT, RERANKER, GUARDRAIL, EVALUATOR) and is OpenTelemetry-native via OpenInference. Langfuse is self-hostable and framework-agnostic. Both can attach custom attributes — which is where your agent.behavioral.* signals land. What they don't do automatically: cluster your failures by behavioral signature, or route clusters to an eval case generator.
Eval running (Braintrust, Confident AI, LangSmith Evals): These run your eval cases, score them, and surface regressions. They don't generate eval cases from production failures. That generation step — the one that closes the loop — is yours to build.
Full-stack APM + LLM (Datadog, New Relic, Honeycomb): Infrastructure observability plus LLM-native layers. The pattern that scales: pair an LLM-native observability platform (for agent traces, eval, and LLM-specific metrics) with whole-stack APM (for infrastructure health). One without the other leaves blind spots on both ends.
The honest assessment: no single platform today fully automates the path from production failure → behavioral signal cluster → attributed root cause → generated eval case → CI gate. That path requires instrumentation code you write, a failure taxonomy you build, and a postmortem process you enforce. The tools handle the storage, querying, and execution layers. The diagnostic logic is the team's job.
Phased delivery ships something useful on day three. A complete design that is half-implemented is not an observability system — it's a plan.
Add agent.behavioral.grounded, fallback, confidence, and downstreamok as span attributes on every invokeagent span in production. Start with grounded and downstream_ok — these two catch the highest-frequency failure classes at near-zero compute cost. Collect 200+ runs per major workflow type before computing any baselines. You're building the vocabulary before the diagnostic conversation starts.
Store behavioral signal attributes in a queryable format — a time-series database or an OTel backend with custom attribute support. Run weekly failure reviews: pull all runs where grounded < 0.75 OR fallback=true OR downstream_ok=false from the past seven days. Apply open-coding to the first 30–50 failures. Don't automate the taxonomy before you've manually coded at least 100 failures across multiple failure types.
Build a root cause lookup table using the signal patterns from the attribution table above. For each failure cluster, match the behavioral signal pattern to the most likely cause class, then check the correlation evidence — deployment log, provider changelog, index change log. The lookup table speeds attribution from hours to minutes. It structures judgment; it doesn't replace it.
Add an eval case export step at the end of every incident postmortem. Assign severity and grader type based on failure class: P0 for safety and compliance failures, P1 for accuracy and grounding regressions, P2 for formatting and downstream validation failures. Wire P0 and P1 evals into CI — they gate deploys on the next model update, tool schema change, or prompt refactor. At 30 days you have a working loop. At 60 days you have production-derived eval coverage that catches regressions you never thought to write test cases for.
If the failure is checkable by code (wrong doc ID, schema validation failure, fallback invoked on a forbidden task type), an assertion grader is cheaper, faster, and exact. LLM judge on a deterministic failure is 10× the cost for the same precision.
If the failure is plausible-but-wrong (factually incorrect synthesis, reasonable-sounding policy misquote), assertion graders will miss recurrences that differ in phrasing. LLM judge is the right call even at higher cost.
Safety violations and compliance errors that get an override without documentation are the failures that eventually become regulatory findings.
A case that passes today documents a failure mode the system has seen. Delete it and you lose the constraint. Model updates, prompt changes, and tool schema changes can reopen closed failure modes.
40 events in one cluster produce one eval case targeting the cluster root cause. 40 individual evals for the same failure mode create maintenance overhead without improving coverage.
How is behavioral telemetry different from the detection layer — fingerprinting, semantic drift?
Detection layers compare output distributions to baselines. They tell you when something has changed across a population of runs. Behavioral telemetry is execution-time: it captures signals during the agent's reasoning loop on a specific run, before the output is produced. The two are complementary. Detection finds that a population is drifting. Behavioral signals diagnose why a specific run failed and which component failed it. Without detection you don't know to look. Without behavioral telemetry, detection gives you an anomaly you can't attribute.
How do I compute the grounded score without an LLM evaluator on every request?
Deterministically. The grounded score requires no model call. For agents with attribution data, check whether response references include IDs from the retrieved document set — a set membership check. For agents that append retrieved content to context, check whether the response contains phrases from retrieved passages — a string containment check. For agents without explicit attribution, track whether the retrieval tool was invoked and returned non-empty results on runs where the response was later flagged — a correlation proxy. All three are O(messages) operations with no inference cost.
What if my agent doesn't do retrieval — is grounded relevant?
No. The four signals are a starting taxonomy, not a mandatory checklist. Match signals to your agent's architecture. A tool-calling agent without retrieval benefits most from confidence and downstreamok. An agent with heavy retrieval needs grounded and downstreamok. A multi-step workflow agent needs fallback. The principle — capture behavioral execution signals alongside infrastructure completion signals — applies regardless of architecture. The specific signals depend on what your agent actually does.
At what failure volume does failure clustering pay off?
Start reading failures manually at any volume. Manual review produces a better taxonomy than premature automation. The automation becomes worth the investment when weekly failure volume exceeds what a human can review — roughly 50–100 failures per week. Below 20 per week, individual review is tractable and surfaces taxonomy patterns that automated clustering would miss. The transition point is not a fixed number. It's when the volume of repetitive-looking failures exceeds the bandwidth to read each one.
Does every production failure need an eval case?
No. One-off data issues and transient external service failures don't warrant evals — they warrant upstream infrastructure fixes. The deciding criterion: could this failure recur after a model update, a prompt change, or a tool schema change without anyone noticing? If yes, write an eval. If the failure was caused by a corrupted input or a temporary outage that has since been resolved, close it without an eval and document why. P0 failures — safety violations, compliance errors, complete task failures — always get evals regardless of root cause classification.
What's the difference between OpenInference and OpenTelemetry GenAI semantic conventions?
OpenInference is an Apache 2.0 spec originally from Arize, now adopted across the ecosystem (Arize Phoenix, LlamaIndex, others). OpenTelemetry GenAI semantic conventions are governed by the OpenTelemetry project and define genai.* attribute names. They are converging: Phoenix is OpenTelemetry-native via OpenInference, and Datadog LLM Observability natively maps OTel GenAI semconv to its product UI. Use OTel GenAI for the genai.* infrastructure attributes and agent.behavioral.* for your custom behavioral signals. Both can coexist on the same span.
The framework is not architecturally complex. Four behavioral attributes on existing spans. Weekly failure cluster reviews. A root cause lookup table that fits in a spreadsheet. One eval case per diagnosed incident, wired into CI.
By 2028, Gartner projects 40% of AI-deploying organizations will have dedicated observability tooling — which means 60% still won't.[11] The teams in that majority will keep discovering 18-day compliance failures through audit findings rather than behavioral signal drops.
What makes this hard isn't the instrumentation. It's the organizational habit of closing incidents at the fix stage. The eval generation step looks like overhead until the same failure ships again after a model upgrade six weeks later — and nobody connects it to the previous incident because the previous incident produced no artifact.
The test of whether your observability stack is complete is not whether you can detect degradation. Detection is the minimum. The test is whether every production failure that passes through the diagnostic pipeline becomes a constraint on the next deploy. If it doesn't, the loop isn't closed. The failure is waiting for its next deployment.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.