Four agents coordinate. The trace backend shows 3 to 10 orphaned root spans, no causal thread. The model is not the failure. Context propagation is. Five gaps, the minimal code to close each, and the build order that actually ships.
89% of teams have observability tooling. 62% can map a trace to a failure cause. Seven failure modes grounded in H1 2026 incident data — each with distinct OTel trace signatures and an LLM classifier that routes the incident before the postmortem.
MAST (NeurIPS 2025, UC Berkeley) identifies 14 MAS failure modes across 3 structural categories. This playbook maps them to 3 diagnostic questions — and tells you which layer to fix before touching the model.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.