A compliance-checking agent ran for 18 days before anyone noticed. HTTP 200 on every request. Valid JSON on every response. No alert fired. The first signal was a regulatory audit finding: the agent had been citing fabricated policy references since a retrieval index update three weeks earlier.
The instrumentation layer was complete by standard measures — latency tracked, error rates monitored, token costs attributed. What it never captured: whether responses were grounded in retrieved documents, whether a fallback path had been silently invoked, or whether downstream systems could actually use the output. Eighty-eight percent of production agent failures trace to infrastructure gaps — absent monitoring, missing guardrails, inadequate trace instrumentation — not model quality.[1] The model was working exactly as designed. The diagnostic layer was never built.
Behavioral telemetry is the instrumentation layer between infrastructure metrics and output quality. It captures execution-time signals — grounded, fallback, confidence, downstream appropriateness — that describe how an agent behaved on a specific run, not just whether it returned 200. Without it, detection tells you something is degrading. Diagnosis remains impossible.
This is the four-step pipeline that teams research, partially prototype, and rarely complete: trace collection with behavioral signals → failure clustering → root cause attribution → eval generation. Each step compounds the last. The fourth step is the one that actually prevents recurrence.
Key Takeaways
- ✓
Behavioral telemetry adds four execution-time signals to every agent span: grounded (retrieval used?), fallback (primary path abandoned?), confidence (certainty of tool selection?), downstream_ok (output actionable downstream?).
- ✓
Failure clustering before root cause analysis: one cluster with 40 members is worth investigating. Forty individual failure events are not.
- ✓
63% of step-level agent failures propagate from an upstream step — the root cause is not where the failure manifests.[3]
- ✓
Root cause identification improved from a median of 4.2 hours to 22 minutes with structured diagnostic frameworks.[3]
- ✓
The eval-from-failure loop compounds: 23 manually written tests can become 147 automatically generated from production failures in 60 days.[5]
Infrastructure Metrics Tell You the Request Landed. Behavioral Signals Tell You What It Did.
The gap between HTTP 200 and production correctness is where agent failures live. Four signals close it without additional model calls.
Standard OpenTelemetry GenAI semantic conventions cover four things for every LLM call: model, token counts, latency, and finish reason.[8] These are infrastructure signals. They tell you the bytes arrived on time and under budget. They have no opinion on whether the agent's reasoning was grounded in retrieved evidence, whether it silently fell back to a degraded path, or whether the output was structured in a way downstream systems could consume.
Behavioral telemetry fills that gap. It is the set of model-free, execution-time signals that describe how the agent behaved — not just that it completed. Each signal attaches as a span attribute on the existing invoke_agent span. No new infrastructure. No added latency. Richer attributes on spans you are already emitting.
Four signals cover the most frequent classes of silent failure:
agent.behavioral.grounded — Did the agent's response draw from retrieved context, or from its training weights? For any agent that does retrieval, this is the single most important behavioral signal. A grounded rate that drops from 0.94 to 0.71 over two weeks is a retrieval degradation signal that no infrastructure metric will surface. Compute it by checking whether retrieved document IDs appear in the reasoning trace or output attribution — a set membership check, not a model call.
agent.behavioral.fallback — Did the agent invoke a fallback path (secondary tool, default response, escalation trigger) rather than completing the primary workflow? A rising fallback rate says the primary path is failing silently. The agent is handling the situation, but not as designed. This catches silent workflow degradation before output quality metrics do.
agent.behavioral.confidence — What was the agent's assessed certainty about its tool selection or decision? Some providers return logprobs for tool selection; others require a confidence estimation prompt. This signal catches the plausible-but-wrong failure class — outputs the agent produced while uncertain, which are the first candidates for downstream validation or human review.
agent.behavioral.downstream_ok — Could a downstream system actually use this output? Schema validation passes but field values are malformed. A date is in the wrong timezone. A required ID is null. This catches outputs that are structurally valid but operationally useless — the failure class that hallucination detection misses because the format is correct and the content is plausibly wrong.
behavioral_telemetry.py# behavioral_telemetry.py — model-free execution-time signal collection.
# Attaches four behavioral attributes to the current invoke_agent span.
# Zero additional model calls. Computed from trace data already available.
from opentelemetry import trace
from dataclasses import dataclass
tracer = trace.get_tracer("app.agent")
@dataclass
class BehavioralSignals:
grounded: float # 0.0–1.0: fraction of response citing retrieved context
fallback: bool # True if fallback path invoked instead of primary workflow
confidence: float # 0.0–1.0: agent certainty about tool selection or decision
downstream_ok: bool # True if output passes downstream appropriateness checks
def record_behavioral_signals(
span: trace.Span, signals: BehavioralSignals
) -> None:
"""Attach behavioral signals to the current agent span.
Call after the reasoning loop completes, before the span closes.
Namespace is app.agent.behavioral.* to avoid collision with gen_ai.*.
"""
span.set_attribute("agent.behavioral.grounded", signals.grounded)
span.set_attribute("agent.behavioral.fallback", signals.fallback)
span.set_attribute("agent.behavioral.confidence", signals.confidence)
span.set_attribute("agent.behavioral.downstream_ok", signals.downstream_ok)
def compute_grounded_score(
retrieved_doc_ids: list[str], response_citations: list[str]
) -> float:
"""Fraction of response citations that trace to retrieved documents.
Returns 0.0 if the agent produced no citations (treat as ungrounded).
"""
if not response_citations:
return 0.0
matched = sum(1 for c in response_citations if c in set(retrieved_doc_ids))
return matched / len(response_citations)Forty Failure Events Is Not a Pattern. One Cluster Is.
Raw failure counts are noise. Behavioral signal clusters are the unit of diagnostic work. The transition point is 20 failures per workflow type per week.
Detection gives you a list of failed runs. That list, unprocessed, is not actionable. A team investigating 40 individual failures will converge on nothing useful. A team investigating three failure clusters with 13 members each will converge on root causes in under an hour.[4]
Failure clustering groups failed runs by behavioral signature — shared failure type, common signal pattern, same execution step, similar tool call sequence — to surface the underlying issue rather than individual incidents. The operational test: when you look at the cluster and immediately recognize the common cause, the clustering is working. When each member looks different, the signal dimensions need refinement.
The bootstrapping challenge is real. You cannot build a failure taxonomy before you have seen failures, and you cannot cluster failures without a taxonomy. The approach that works:
Open coding first. Have a domain expert read 20–50 recent failed traces without a category framework. Write unstructured observations. Do not try to categorize — just observe. The question at this stage is: "What went wrong here?" not "Which category does this belong to?"
Axial coding second. Group open-coded observations into a tentative taxonomy. Count occurrences per cluster. Re-read the largest clusters and confirm the grouping is coherent. Repeat until new traces stop producing new categories — that is the point at which you have enough taxonomy coverage to automate.
Research on production agent fault characterization across 13,602 documented failures identified five behavioral signal patterns that cluster reliably: grounding failures, fallback rate spikes, confidence collapses, downstream rejection events, and tool call anomalies.[7] Each maps to a distinct root cause class. The taxonomy is the key artifact. Building it before you have enough failures means it will miss the most common modes. Automating clustering before you have a validated taxonomy generates better-organized noise.
For teams early in the process: read the first 20 failures manually. At 20 failures per week, manual review is still tractable and produces a better taxonomy than premature automation. The automation becomes worth the investment when the weekly volume exceeds what a human can review — around 50–100 failures per week depending on team bandwidth.
The Failure Didn't Start Where It Manifested. Trace Backward.
63% of step-level failures propagate from upstream. Behavioral signal patterns map to cause classes — the trail starts from the cluster, not the failure node.
Root cause attribution is where behavioral signals earn their instrumentation cost. Without them, you have a failure timestamp and an output. With them, you have a behavioral fingerprint of the run: what grounding rate the agent maintained, whether it invoked fallback, where confidence dropped, and whether the downstream output passed validation.
The most common diagnostic error is investigating the failure node rather than tracing backward to its cause. Research on multi-step agent workflows found that 63% of step-level failures are propagated from upstream errors — not locally caused.[3] When a downstream synthesis step produces a hallucinated fact, the root cause is often a retrieval failure or a tool argument malformation three steps earlier. Debugging the synthesis step finds nothing. Following the behavioral signal trail backward finds the break. Root cause identification improved from a median of 4.2 hours to 22 minutes with structured diagnostic frameworks.[3]
Four root cause patterns account for the majority of silent production failures:[4]
Provider model drift. Confidence drops and the execution fingerprint shifts — more fallback invocations, different tool selection ratios — but no code or prompt changed on your side. Correlate with the provider's changelog. A confidence collapse on a day you deployed nothing is a provider-side model update until proven otherwise.
Prompt regression. Grounded rate drops and downstream_ok falls. Correlate with prompt hash changes. A silent template change that removed a constraint produces exactly this: the model still runs, the tools still call, but responses stop being grounded and stop satisfying downstream validators.
Retrieval drift. Grounded rate drops while confidence stays stable. The agent attempts retrieval, finds nothing useful, and proceeds on weights. Check the retrieval index for changes: new documents that diluted relevance, stale embeddings, a query expansion change that altered what gets retrieved.
Tool schema change. Fallback rate spikes and downstream_ok fails. An upstream API changed its response schema. The tool call succeeded. The output is parsed without error. The values are malformed or missing. This is the failure mode that schema validation at the tool boundary would have caught and almost never does.
| Signal pattern | Most likely root cause | First diagnostic step |
|---|---|---|
| Confidence drops + fingerprint shifts, no deploy on your side | Provider model drift | Check provider changelog and model version log against failure timestamp |
| Grounded rate drops + downstream_ok falls, prompt hash changed | Prompt regression | Diff system prompt and tool descriptions against last known-good hash |
| Grounded rate drops, confidence stable, fallback flat | Retrieval drift | Inspect index changes; rerun retrieval queries manually on representative inputs |
| Fallback spikes + downstream_ok fails, no prompt change | Tool schema change | Compare current tool response schema against the schema from the last clean window |
| Confidence collapses on specific task type only | Capability boundary hit | Check whether task type appeared in training distribution; add explicit prompt constraint |
A Diagnosed Failure With No Eval Is a Problem You Will Debug Again
The diagnostic loop closes only when the failure becomes a test case. Without that step, the same failure ships after the next model update.
The eval generation step is what most teams skip. They diagnose the root cause, fix the immediate problem, and close the incident. Three months later, after a model upgrade or a prompt refactoring, the same failure ships again. Nobody connects it to the previous incident because the previous incident produced no artifact.
The pattern that prevents recurrence: every diagnosed root cause produces one eval case. Every eval case gets a severity and a grader.[5]
Severity determines CI behavior: P0 blocks the deploy. P1 warns in CI and requires an explicit override. P2 logs and tracks. Safety violations and compliance failures are P0 regardless of frequency. Quality regressions and grounding failures are P1. Formatting issues and downstream validation failures are P2 until they affect a compliance workflow.
Grader type follows failure class. Deterministic failures — the agent cited a document ID that was not in the retrieved set, the output JSON failed schema validation, fallback was invoked on a task type that should never trigger fallback — get assertion-based graders. Cheap to run, cheap to maintain, exact on recurrence. Semantic failures — the agent produced a plausible but factually wrong synthesis — get LLM-as-judge graders. More expensive, needed for the cases where exact matching will never work.
The compounding effect is real. Chronicle Labs documented a team that started with 23 manually written evals and grew to 147 automatically generated from production failures over 60 days.[5] Regression coverage grows as a function of production failure rate — not developer time. The more failures the system sees, the harder it becomes to regress on known failure modes.
One honest constraint: evals generated from production failure clusters cover failure modes the agent has already exhibited. They do not catch novel failure modes. The production-to-eval loop needs companion coverage from adversarial testing and boundary analysis for failure classes the agent has not yet seen but will eventually encounter.
Signal-based trajectory sampling to select which failures warrant eval cases improves the efficiency of this work. Research on agent trajectory triage found that signal-based sampling achieves 82% informativeness versus 54% for random sampling — a 1.52× efficiency gain per informative trajectory selected.[2] The behavioral signals you already collected at execution time are the selection criteria. Grounding failures with downstream_ok=false and confidence below 0.70 are the highest-priority candidates. Clean runs with all signals in normal range contribute nothing new to the eval corpus.
eval_from_failure.py# eval_from_failure.py — generate a regression eval case from a diagnosed failure cluster.
# P0 blocks CI. P1 warns with explicit override required. P2 logs.
from dataclasses import dataclass, field
from enum import Enum
class Severity(Enum):
P0 = "P0" # blocks deploy — safety, compliance, complete task failures
P1 = "P1" # warns in CI — quality regressions, grounding failures
P2 = "P2" # logged and tracked — formatting, downstream validation issues
@dataclass
class EvalCase:
id: str
cluster_id: str # links back to the failure cluster
input: dict # agent input that triggered the failure
expected_signals: dict # behavioral signal assertions for a correct run
grader: str # "assertion" | "llm_judge" | "behavioral_signals"
severity: Severity
root_cause: str # from root cause attribution step
tags: list[str] = field(default_factory=list)
def generate_eval_from_cluster(
cluster_id: str,
representative_trace: dict,
root_cause: str,
severity: Severity,
) -> EvalCase:
"""Extract a regression eval from the cluster representative trace.
Never delete passing cases — the corpus is institutional failure memory.
"""
return EvalCase(
id=f"PROD-{cluster_id}",
cluster_id=cluster_id,
input=representative_trace["input"],
expected_signals={
"grounded": ">= 0.80", # must cite retrieved context
"fallback": False, # must not invoke fallback on primary path
"confidence": ">= 0.70", # must be confident about decision
"downstream_ok": True, # output must pass downstream validation
},
grader="behavioral_signals",
severity=severity,
root_cause=root_cause,
tags=[f"cluster:{cluster_id}", f"root_cause:{root_cause}"],
)591 documented incidents, 2023–2026. Missing monitoring and instrumentation — not model quality.[1]
The root cause is not at the failure node. Trace backward through behavioral signals.[3]
With structured diagnostic frameworks versus unstructured log investigation.[3]
82% informativeness versus 54% for random sampling when selecting which failures to act on.[2]
Alert fires from downstream consequence — audit finding, user complaint, report error
Engineer reviews logs for the triggering request without behavioral context
Root cause identified after hours of unstructured log reading
Fix deployed. Incident closed. No artifact produced.
Same failure ships again after the next model update or prompt refactor
Behavioral signal drop surfaces within one detection window — before downstream consequence
Failure cluster groups similar events; engineer investigates one representative trace
Root cause attributed in ~22 minutes via behavioral signal pattern matching
Fix deployed AND eval case generated with CI severity gate
Same failure class blocked at CI gate on the next model update or prompt refactor
Build Order: Four Phases. Working Signal at Each Phase.
Phased delivery ships something useful on day three. A complete design that is half-implemented is not an observability system. It is a plan.
- 1
Phase 1: Behavioral Signal Instrumentation (Days 1–5)
Add agent.behavioral.grounded, fallback, confidence, and downstreamok as span attributes on every invokeagent span in production. Start with grounded and downstream_ok — these two catch the highest-frequency failure classes at near-zero compute cost. Collect 200+ runs per major workflow type before computing any baselines. You are building the vocabulary before the diagnostic conversation starts.
- 2
Phase 2: Failure Clustering Setup (Days 5–14)
Store behavioral signal attributes in a queryable format — a time-series database or an OTel backend with custom attribute support. Run weekly failure reviews: pull all runs where grounded < 0.75 OR fallback=true OR downstream_ok=false from the past seven days. Apply open-coding to the first 30–50 failures. Do not automate the taxonomy before you have manually coded at least 100 failures across multiple failure types. Automating a premature taxonomy generates better-organized noise.
- 3
Phase 3: Root Cause Attribution Templates (Days 14–21)
Build a root cause lookup table using the signal patterns from the table above. For each failure cluster, match the behavioral signal pattern to the most likely cause class, then check the correlation evidence — deployment log, provider changelog, index change log. The lookup table speeds attribution from hours to minutes. It structures judgment; it does not replace it. Update the table when a new root cause class appears that no existing pattern covers.
- 4
Phase 4: Eval Generation Integration (Days 21–30)
Add an eval case export step at the end of every incident postmortem. Assign severity and grader type based on failure class: P0 for safety and compliance failures, P1 for accuracy and grounding regressions, P2 for formatting and downstream validation failures. Wire P0 and P1 evals into the CI pipeline — they gate deploys on the next model update, tool schema change, or prompt refactor. At 30 days you have a working loop. At 60 days you have enough production-derived eval coverage to catch regressions you never thought to write test cases for.
Production Behavioral Telemetry Readiness
agent.behavioral.grounded recorded on every invoke_agent span — not just sampled
agent.behavioral.fallback recorded — distinguishes primary path completions from fallback invocations
agent.behavioral.confidence recorded — sourced from logprobs, tool selection scores, or confidence estimation
agent.behavioral.downstream_ok recorded — output passes downstream validation before span closes
Failed runs queryable by behavioral signal pattern, not just error message or HTTP status
Failure clusters reviewed weekly via open coding before any automated taxonomy is applied
Root cause lookup table maintained and updated when a new failure class appears
Every diagnosed failure cluster produces at least one eval case before the incident closes
P0 and P1 eval cases wired into CI pipeline — not stored only in a document
Eval corpus tagged with cluster ID and root cause — each test case links to the production failure that produced it
How is behavioral telemetry different from the detection layer — fingerprinting, semantic drift?
Detection layers compare output distributions to baselines. They tell you when something has changed across a population of runs. Behavioral telemetry is execution-time: it captures signals during the agent's reasoning loop on a specific run, before the output is produced. The two are complementary. Detection finds that a population is drifting. Behavioral signals diagnose why a specific run failed and which component failed it. Without detection you don't know to look. Without behavioral telemetry, detection gives you an anomaly you cannot attribute.
How do I compute the grounded score without an LLM evaluator on every request?
Deterministically. The grounded score requires no model call. For agents with attribution data, check whether response references include IDs from the retrieved document set — a set membership check. For agents that append retrieved content to context, check whether the response contains phrases from retrieved passages — a string containment check. For agents without explicit attribution, track whether the retrieval tool was invoked and returned non-empty results on runs where the response was later flagged — a correlation proxy. All three are O(messages) operations with no inference cost.
What if my agent doesn't do retrieval — is grounded relevant?
No. The four signals are a starting taxonomy, not a mandatory checklist. Match signals to your agent's architecture. A tool-calling agent without retrieval benefits most from confidence and downstreamok. An agent with heavy retrieval needs grounded and downstreamok. A multi-step workflow agent needs fallback. The principle — capture behavioral execution signals alongside infrastructure completion signals — applies regardless of architecture. The specific signals depend on what your agent actually does.
At what failure volume does failure clustering pay off?
Start reading failures manually at any volume. Manual review produces a better taxonomy than premature automation. The automation becomes worth the investment when weekly failure volume exceeds what a human can review — roughly 50–100 failures per week. Below 20 per week, individual review is tractable and surfaces taxonomy patterns that automated clustering would miss. The transition point is not a fixed number. It is when the volume of repetitive-looking failures exceeds the bandwidth to read each one.
Does every production failure need an eval case?
No. One-off data issues and transient external service failures do not warrant evals — they warrant upstream infrastructure fixes. The deciding criterion: could this failure recur after a model update, a prompt change, or a tool schema change without anyone noticing? If yes, write an eval. If the failure was caused by a corrupted input or a temporary outage that has since been resolved, close it without an eval and document why. P0 failures — safety violations, compliance errors, complete task failures — always get evals regardless of root cause classification.
The framework is not architecturally complex. Four behavioral attributes on existing spans. Weekly failure cluster reviews. A root cause lookup table that fits in a spreadsheet. One eval case per diagnosed incident, wired into CI.
What makes it hard is the organizational habit of closing incidents at the fix stage. The eval generation step looks like overhead until the same failure ships again after a model upgrade six weeks later — and nobody connects it to the previous incident because the previous incident produced no artifact.
The test of whether your observability stack is complete is not whether you can detect degradation. Detection is the minimum. The test is whether every production failure that passes through the diagnostic pipeline becomes a constraint on the next deploy. If it does not, the loop is not closed. The failure is waiting for its next deployment.
- [1]Trace-Driven Debugging for AI Agent Failures: From Production Incident to Regression Test — Zylos Research, Apr 2026(zylos.ai)↩
- [2]Signals: Trajectory Sampling and Triage for Agentic Interactions — Chen et al., arXiv:2604.00356, Apr 2026(arxiv.org)↩
- [3]AGENTEVAL: Evaluation Infrastructure for Multi-Step Agent Workflows — OpenReview, 2026(openreview.net)↩
- [4]Detecting AI Agent Failure Modes in Production: A Framework for Observability-Driven Diagnosis — Latitude(latitude.so)↩
- [5]Incident-to-Eval Synthesis: Production Failures as Evals — AgentPatterns.ai(agentpatterns.ai)↩
- [6]AgentRx: Diagnosing AI Agent Failures from Execution Trajectories — arXiv:2602.02475, Feb 2026(arxiv.org)↩
- [7]Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes — arXiv:2603.06847, Mar 2026(arxiv.org)↩
- [8]Runtime Evals and Observability for Agentic Systems — Vikas Goyal, Mar 2026(vikasgoyal.github.io)↩