When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
IBM Research applied MAST to 310 real IT automation traces in early 2026 and produced something more operationally useful than failure rates: each model class had a different dominant failure pattern, and each pointed to a different structural intervention.[4] GPT-oss-120b showed systematic FM-2.6 failures — reasoning mismatches that poisoned the task context over multiple turns and caused total session derailment. The fix was context hygiene and early error detection. Not a prompt rewrite. Kimi-K2 couldn't recognize task completion, generating runaway loops. The fix was a deterministic state machine. Gemini-3-Flash's most common fatal failure was FM-3.3: the agent self-terminated based on its own assessment rather than tool-mediated evidence that the alert was actually cleared. The fix was requiring a hard verification step — an AlertManager clearance or a Kubernetes state change — before marking any run successful.
Three architectures. Three dominant structural failures. None solved by touching the system prompt.
This is the wrong-layer problem. Every agent failure produces visible symptoms that look like model behavior — repeated steps, agent disagreement, degraded output. The team reaches for the nearest control surface. But MAST's 14 failure modes cluster into three structural categories that each require interventions at different stages:[1] specification failures (FC1, 41.77%) are fixable pre-deploy; coordination failures (FC2, 36.94%) require runtime architecture changes; verification failures (FC3, 21.30%) need an eval layer that infrastructure observability cannot replace. Applying the wrong fix delays the right one by exactly one incident cycle. Enough incident cycles and the pattern becomes invisible — every postmortem says 'model behavior was inconsistent' because nobody classified the structural root cause.
This protocol is for engineering leaders who need to classify a live agent failure in under 20 minutes and route it to the right team. MAST's categories become a speed-ordered triage sequence: fastest check first, ownership built in.
Specification issues are the largest single category — and the only one fully preventable before any agent runs.[1]
FM-2.6 (reasoning-action mismatch) and FM-2.2 (failure to ask for clarification) are the two most common modes — neither fixable via system prompt.[1]
TRAIL benchmark, Gemini-2.5-pro: models are poor at freeform trace diagnosis. Structured MAST classification is necessary.[6]
IBM Research + UC Berkeley applied MAST to real SRE, security, and FinOps agent tasks — yielding model-specific architectural prescriptions.[4]
The MAST category tells you which team has the capacity to prevent the failure from recurring — not just what went wrong.
The critical structural insight from MAST isn't the count of 14 failure modes. It's that the three categories map to fundamentally different ownership domains. The wrong-layer problem happens when incident response ignores that mapping.
FC1 — Specification Issues originate before the first agent turn. The agent misunderstood the task, stepped outside its role definition, repeated work already done, lost conversational context, or ran without a termination signal it could evaluate. Every FC1 failure is readable from the spec. The owner is whoever writes and reviews behavioral specifications before agents deploy — the system architect, the platform engineer who owns the agent's system prompt, or the team that governs the deployment review gate. The fix window is pre-deploy, making FC1 the cheapest failure category to close.
FC2 — Inter-Agent Misalignment emerges during execution from agent-to-agent interaction. Agents proceed on wrong assumptions (FM-2.2), reasoning doesn't match action taken (FM-2.6), task objectives drift mid-run without a decision point (FM-2.3). The owner is the platform or orchestration team — responsible for coordination protocol design, inter-agent message schemas, and shared task state. This isn't fixable from the system prompt. It requires architecture changes: clarification gates, explicit handoff validation, reasoning-action logging at each step. The MAST authors are direct: communication protocol improvements are insufficient for FC2 failures — coordination design is required.[1]
FC3 — Task Verification produces the most dangerous diagnostic signal: everything completes, all infrastructure checks pass, output is wrong. Premature termination (FM-3.1), missing verification (FM-3.2), incorrect verification (FM-3.3) — all three look like successful runs from the infrastructure layer. The owner is the eval or quality team running LLM-as-judge or domain validators against production outputs. Infrastructure observability cannot catch FC3 failures. The eval layer either exists or this entire failure category is invisible.[1]
| FC Category | Share of Failures | Fix Timing | Team Owner | Wrong Fix Pattern |
|---|---|---|---|---|
| FC1 — Specification | 41.77%[1] | Pre-deploy | Spec author + review gate | Prompt tone adjustment ('be more precise') |
| FC2 — Coordination | 36.94%[1] | Runtime redesign | Platform / orchestration | More retries, stronger coordination directives |
| FC3 — Verification | 21.30%[1] | Eval layer | Eval / quality team | Infrastructure alerting, schema validation |
The three checks don't take the same time. FC1 takes 30 seconds from the spec. FC2 takes 5 minutes in the coordination logs. FC3 takes 20 minutes to confirm absent eval coverage.
Under incident conditions, you need to route to the right fix team fast. The three checks are ordered by time to answer, not by complexity of the failure.
The 30-second check (FC1): Pull up the system prompt and role definition alongside the failed trace. Can you point to a missing constraint, an ambiguous role boundary, or a termination condition the agent couldn't evaluate? If yes — that's FC1. You don't need logs, tool traces, or inter-agent messages. You need the spec and the first point of divergence. FC1 is the largest category at 41.77%[1] and the cheapest to diagnose.
The 5-minute check (FC2): Go to the inter-agent message log at the step where behavior diverged. Compare stated reasoning to the actual tool call or message at that step. Check whether the agent had a decision point to ask for clarification before it proceeded on an assumption. If individual agent outputs look valid but the coordination produced the wrong result, you're in FC2.
The 20-minute check (FC3): If the first two checks come up clean, verify what the verifier actually checked. Did verification run at all? Did it run at intermediate steps or only at terminal output? Did it test task-level correctness rather than format or schema? FC3 takes longer because confirming absent verification requires understanding what eval coverage was supposed to exist.
Teams that skip the spec check and go directly to logs are doing the 20-minute check on a 30-second problem. Start with the fastest check that can falsify the hypothesis.
Pull the system prompt, task description, and role definition. Read them against the failed trace. Find the step where agent behavior diverged from stated intent. If you can identify a missing constraint, an undefined role boundary, a step repeated with no state guard, or a termination condition the agent couldn't evaluate — that's FC1. The agent followed its spec. The spec was incomplete. IBM Research's ITBench analysis showed that spec-level interventions were consistently the most architecturally impactful per model class tested.[4]
Open the inter-agent message log at the step where the failure emerged. Check for FM-2.6 (Reasoning-Action Mismatch): does the agent's stated reasoning match the actual tool call or message it sent? Check FM-2.2 (Failure to Ask for Clarification): was the instruction genuinely ambiguous, and did the agent proceed on an assumption without a clarification gate? IBM Research found that GPT-oss-120b traces showed systematic FM-2.6 failures compounding across turns — small reasoning mismatches in early turns poisoned the task context and caused total derailment by session end.[4] The intervention was context hygiene and early error detection: coordination-layer design changes, not prompt language.
If Q1 and Q2 both come up clean — spec was explicit, coordination looked intact — but the output is wrong, you're in FC3. Verify what the verifier actually checked. IBM Research found FM-3.3 (Incorrect Verification) was the most common fatal failure for Gemini-3-Flash in surgical-mode IT operations: the agent self-terminated when it assessed the task as complete, with no tool-mediated evidence that the alert had actually cleared.[4] Their fix: require hard, tool-grounded verification steps before any run is marked successful — not the model's self-assessment. ReliabilityBench confirms the gap: pass@1 metrics overestimate production reliability by 20–40% because they measure single-run success, not output correctness.[5]
Applying FC3 fixes to FC1 failures doesn't close the structural gap. It gives the same failure a different surface presentation at the next incident.
The most common wrong fix across all three failure categories is the same: reaching for the system prompt.
For FC1 failures, the prompt already contains an imprecise or absent constraint. More natural language doesn't add missing structure. An agent without an evaluatable termination condition cannot infer one from 'make sure to stop when the task is done.' The missing constraint has to be stated in terms the agent can check against its current state.
For FC2 failures, the wrong fix is more aggressive natural language coordination: 'be sure to ask for clarification when uncertain,' 'always verify with other agents before proceeding.' FM-2.2 requires a conditional in the execution flow — a structural gate that evaluates confidence before proceeding — not a sentence in the system prompt. You can add that instruction and FM-2.2 will recur on the same ambiguous input at the same rate.
For FC3 failures, the wrong fix is infrastructure alerting: circuit breakers that trip on span duration, budget ceiling alerts, tool error monitoring. FC3 failures produce clean infrastructure signals while output quality degrades. A Kubernetes automation agent completing a task and returning incorrect configuration is undetectable by span duration or error rate. Automated attribution frameworks like ErrorProbe are beginning to tackle this at scale by operationalizing MAST classification against live traces[8] — but the core fix is simpler: an LLM-as-judge or domain validator sampling real production outputs against task-level success criteria.
Fix ordering matters when a failure looks hybrid. FC1 is cheapest. If a trace shows both a spec gap (FC1) and a coordination failure (FC2), fix the spec first — the coordination failure may be a downstream effect of the ambiguous spec. Fixing FC3 before FC1 or FC2 means your eval layer is running over outputs from a broken spec or broken coordination protocol. You'll catch more failures but continue producing them at the same rate.
Agent repeats steps → add 'don't repeat yourself' → same behavior next run
Agents disagree → add coordination directive → FM-2.2 recurs next sprint
Output quality degrades → model upgrade considered → FC3 persists at higher cost
Postmortem: 'model behavior was inconsistent' → no structural change
Next incident: different surface, same structural cause, same wrong fix layer
Agent repeats steps → 30-sec spec check → FM-1.3 → state guard added pre-deploy
Agents disagree → 5-min log check → FM-2.2 → clarification gate owned by platform team
Output quality degrades → 20-min eval review → FM-3.2 → multi-level verifier deployed
Postmortem: 'FC2, FM-2.2, clarification gate absent' → platform team owns fix, timeline clear
Next incident: different FC, same triage protocol, structural pattern visible at 30 days
A single MAST-tagged incident tells you what failed. Thirty tagged incidents tell you which architectural layer has the recurring structural gap.
MAST classification per incident is triage. MAST distribution across 30 incidents is diagnosis.
If FC1 failures cluster — different agents, different tasks, same pattern of spec ambiguity or missing termination conditions — the problem is upstream of the agents. The behavioral spec review process doesn't exist or doesn't check the right properties. The fix is a deployment gate: a pre-deploy checklist that blocks agent shipments without explicit, evaluatable termination conditions and role constraint statements.
FC2 clusters indicate the coordination layer was designed as an afterthought. Inter-agent message schemas, clarification gates, and shared task state are infrastructure decisions, not prompt additions. More than three FM-2.6 or FM-2.2 failures in 30 postmortems almost always means there's no coordination protocol documentation. The fix is redesigning that layer from the protocol up.
FC3 clusters — multiple incidents that passed all infrastructure checks but produced degraded output — mean the eval coverage is measuring format instead of correctness. These are the most dangerous failures because downstream systems act on wrong outputs before engineering knows there's a problem.
IBM Research's ITBench application converted 'open models struggle on SRE tasks' into three concrete per-model architectural prescriptions.[4] Your production incident history can do the same. Tag 30 postmortems with FC category and failure mode code. Query the distribution. That distribution is the architectural backlog — not a list of individual model failures.
FC1 owner named: person or team who reviews behavioral specs before any agent deploys
FC2 owner named: platform or orchestration team responsible for coordination protocol design
FC3 owner named: eval or quality team running production output sampling
Postmortem template requires MAST category (FC1/FC2/FC3) before root cause is written
Escalation path from MAST category to fix team is documented and tested on a non-incident day
30-day FC distribution review scheduled — quarterly cadence minimum
What if the failure fits FC1 and FC2 at the same time?
Co-occurrence is common. An ambiguous termination condition (FC1) combined with an agent that doesn't ask for clarification when stuck (FC2) produces a trace that reads as FC2 from the output. Fix FC1 first — it's cheaper and faster. The spec ambiguity may be causing the coordination failure as a downstream effect. After the FC1 fix ships, run the same trace through an equivalent eval case. If FC2 symptoms persist, the coordination gate is a genuine structural gap. Fix ordering: FC1 first, FC2 second, FC3 third. Fixing FC3 before fixing FC1 or FC2 means your eval layer is running over outputs from broken upstream layers — you'll catch more failures but continue producing them.
Is MAST reliable enough for live incident triage, or is it a post-hoc research instrument?
The three-category classification (FC1/FC2/FC3) is reliable for live triage. The 14 fine-grained modes are better suited to the postmortem. MAST was validated at κ = 0.88 inter-annotator agreement on 1,642 benchmark traces[1], and IBM Research applied it to 310 production IT automation traces in 2026 with per-model prescriptions that shipped as architectural interventions.[4] The MAST LLM annotator achieves κ = 0.77 with human experts on offline traces[3] — not fast enough for bridge calls, but usable for automated postmortem classification. Use the three-question protocol on the live incident. Run the full 14-mode classification in the postmortem to build the FC distribution.
My agents are single-agent, not multi-agent. Does any of this apply?
FC1 applies fully — specification issues affect any agentic system regardless of agent count. FC3 applies fully. FC2 reduces to the two most common modes: FM-2.6 (Reasoning-Action Mismatch) and FM-2.2 (Failure to Ask for Clarification), both detectable from a single agent's reasoning traces. The triage protocol is the same. For single-agent deployments, the 30-second FC1 check is even more diagnostic because spec issues account for a larger share of failures without coordination complexity adding noise.
We already have an incident taxonomy. Why isn't it giving us structural fixes?
Most production incident taxonomies describe symptoms: wrong output, tool error, loop, timeout. MAST categorizes structural root causes with fix timing. If your taxonomy allows 'model behavior was inconsistent' as a root cause, you have a symptom taxonomy. The operational difference is routing: FC1 routes to the spec team, FC2 routes to the platform team, FC3 routes to the eval team. Symptom categories don't carry routing. Teams with symptom taxonomies commonly fix the same structural gap multiple times under different symptom names before anyone recognizes the pattern. The 30-day FC distribution test makes this visible: if the same category keeps recurring, the architectural fix hasn't shipped.
Every incident that exits with 'we adjusted the prompt' — without first running the 30-second FC1 check — is a misdiagnosis on record. The structural gap that caused the incident is still open. The next incident in the same FC category is already scheduled.
MAST doesn't require an ML research background to apply. It requires three questions, in speed order, before anyone changes the prompt. The first question takes 30 seconds. The model is usually not the answer.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.
MAST (NeurIPS 2025, UC Berkeley) identifies 14 MAS failure modes across 3 structural categories. This playbook maps them to 3 diagnostic questions — and tells you which layer to fix before touching the model.