Quality at five users is self-regulating. At fifty, it is a liability. Build the rubric layer, gate stack, and federated ownership model before consensus rots into theater — or your AI program gets cancelled with the next budget cycle.
Cosine similarity scores look fine while your RAG pipeline gives wrong answers. Four failure modes that produce confident, wrong outputs — and the retrieval stack that actually fixes them.
Every dismiss, modify, and escalate is a labeled training signal. Most teams log it as a debug artifact and move on. Here is the audit schema, the weekly tuner, and the human approval gate that turn that signal into thresholds that converge in eight weeks.
The dashboard goes green while the model invents a refund policy. Status codes are not a quality signal for generative output. The fix is an eval stack: CI gates, judge models, sampled production scoring, and a dataset that compounds with every failure.
SRE runbooks assume one process, one stack trace, one bad line. Agent failures are distributed across dozens of reasoning steps — the wrong premise gets laundered through 33 more calls before the user sees it. Here is the taxonomy, the triage, the postmortem.
Most agent failures return HTTP 200. The dashboard stays green while the reasoning chain quietly compounds the wrong premise. Triage runbook, failure-mode field guide, observability instrumentation, and postmortem template for non-deterministic systems.
Train once, control the weights, call it sovereignty. Twelve months later the model is confidently wrong about pricing, policy, and headcount. The playbook for when to retrain, what to retrain on, and how to validate without breaking live agents.
Latency, error rate, and token cost stay green while LLM output quality degrades for weeks. The infrastructure layer cannot see semantic failure. Sampled evals, prompt hash drift, and distribution alerts are the signals that catch it before users do.
Most teams promote to multi-agent before proving the single agent. Three gates — observability, override readiness, behavioral consistency — decide whether orchestration is earned or inherited. Skip them and a $3.50 task becomes a $47,000 incident.
How to apply semantic versioning and consumer-driven contract testing to AI agent system prompts — treating prompts as versioned API contracts with explicit breaking change classification, agent manifests, and CDC-style registration for multi-agent production systems.
Valid JSON, clean dashboards, no alerts — and the agent's reasoning depth dropped 67% between two model updates. Three detection layers catch what HTTP error rates structurally cannot: execution fingerprinting, semantic drift, and user-signal triangulation.
Most production agents run on intentions nobody wrote down. Here is how to write the behavioral spec — scope, invariants, testable success criteria, and failure modes — that translates business intent into something your infrastructure can enforce.
Detection tells you something is wrong. The four-step diagnostic pipeline — behavioral telemetry, failure clustering, root cause attribution, eval generation — tells you what failed, why, and how to stop it from shipping again. Most teams build partial detection and stop there.
Teams default to vibe-based model selection because they lack production data pre-launch. A profiling harness resolves the catch-22: build a synthetic task corpus, define quality oracles as code, run every tier, and read cost_per_success from data before the first request.
89% of teams have observability tooling. 62% can map a trace to a failure cause. Seven failure modes grounded in H1 2026 incident data — each with distinct OTel trace signatures and an LLM classifier that routes the incident before the postmortem.
MAST (NeurIPS 2025, UC Berkeley) identifies 14 MAS failure modes across 3 structural categories. This playbook maps them to 3 diagnostic questions — and tells you which layer to fix before touching the model.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.