Every dismiss, modify, and escalate is a labeled training signal. Most teams log it as a debug artifact and move on. Here is the audit schema, the weekly tuner, and the human approval gate that turn that signal into thresholds that converge in eight weeks.
The dashboard goes green while the model invents a refund policy. Status codes are not a quality signal for generative output. The fix is an eval stack: CI gates, judge models, sampled production scoring, and a dataset that compounds with every failure.
SRE runbooks assume one process, one stack trace, one bad line. Agent failures are distributed across dozens of reasoning steps — the wrong premise gets laundered through 33 more calls before the user sees it. Here is the taxonomy, the triage, the postmortem.
Most agent failures return HTTP 200. The dashboard stays green while the reasoning chain quietly compounds the wrong premise. Here is the triage runbook, the failure-mode field guide, and the postmortem template that survives non-deterministic systems.
Train once, control the weights, call it sovereignty. Twelve months later the model is confidently wrong about pricing, policy, and headcount. The playbook for when to retrain, what to retrain on, and how to validate without breaking live agents.
Latency, error rate, and token cost stay green while LLM output quality degrades for weeks. The infrastructure layer cannot see semantic failure. Sampled evals, prompt hash drift, and distribution alerts are the signals that catch it before users do.
Most teams promote to multi-agent before proving the single agent. Three gates — observability, override readiness, behavioral consistency — decide whether orchestration is earned or inherited. Skip them and a $3.50 task becomes a $47,000 incident.
How to apply semantic versioning and consumer-driven contract testing to AI agent system prompts — treating prompts as versioned API contracts with explicit breaking change classification, agent manifests, and CDC-style registration for multi-agent production systems.
Valid JSON, clean dashboards, no alerts — and the agent's reasoning depth dropped 67% between two model updates. Three detection layers catch what HTTP error rates structurally cannot: execution fingerprinting, semantic drift, and user-signal triangulation.
Most production agents run on intentions nobody wrote down. Here is how to write the behavioral spec — scope, invariants, testable success criteria, and failure modes — that translates business intent into something your infrastructure can enforce.
Detection tells you something is wrong. The four-step diagnostic pipeline — behavioral telemetry, failure clustering, root cause attribution, eval generation — tells you what failed, why, and how to stop it from shipping again. Most teams build partial detection and stop there.