SRE runbooks assume one process, one stack trace, one bad line. Agent failures are distributed across dozens of reasoning steps — the wrong premise gets laundered through 33 more calls before the user sees it. Here is the taxonomy, the triage, the postmortem.
Most agent failures return HTTP 200. The dashboard stays green while the reasoning chain quietly compounds the wrong premise. Here is the triage runbook, the failure-mode field guide, and the postmortem template that survives non-deterministic systems.
Most teams promote to multi-agent before proving the single agent. Three gates — observability, override readiness, behavioral consistency — decide whether orchestration is earned or inherited. Skip them and a $3.50 task becomes a $47,000 incident.
Valid JSON, clean dashboards, no alerts — and the agent's reasoning depth dropped 67% between two model updates. Three detection layers catch what HTTP error rates structurally cannot: execution fingerprinting, semantic drift, and user-signal triangulation.
Four agents coordinate. The trace backend shows 3 to 10 orphaned root spans, no causal thread. The model is not the failure. Context propagation is. Five gaps, the minimal code to close each, and the build order that actually ships.
Detection tells you something is wrong. The four-step diagnostic pipeline — behavioral telemetry, failure clustering, root cause attribution, eval generation — tells you what failed, why, and how to stop it from shipping again. Most teams build partial detection and stop there.