MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Every FC1 failure in the MAST dataset was writable before the agent ran. Specification Issues — 41.77% of all failures across 1,642 annotated traces from seven open-source multi-agent frameworks — don't emerge from runtime surprises.[1] They're embedded in the spec the author wrote before the first token generated. The agent violated its task constraints because the constraints were absent. It repeated completed steps because the execution flow had no state-tracking guard. It ran past task completion because "return when done" is not a terminal condition an agent can evaluate against anything.
Researchers at UC Berkeley published MAST at NeurIPS 2025 after analyzing traces across MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, and AG2. Failure rates ranged from 41% to 86.7% depending on the system.[1] That range isn't primarily explained by model choice. It's explained by how completely each system's specs, coordination protocols, and verification layers were designed before execution. The MAST paper's targeted interventions on ChatDev improved performance by +15.6% without touching the model — only the spec.[1]
The MAST triage playbook answers: "My agent failed — which of the 14 failure modes is this, and which layer do I fix?" This playbook answers a different question: what must be true before the agent ships so that those failure modes cannot occur. The answer is three sequential pre-production review gates — one per MAST category — with 12 questions, explicit pass criteria, and assigned ownership. No postmortem required.
FC1 (41.77% of failures): 5 specification failure modes — all preventable before deploy with a structured spec review gate
FC2 (36.94%): 6 inter-agent coordination modes — all requiring architecture design review, not prompt revision
FC3 (21.30%): 3 verification failure modes — invisible to infrastructure without an eval layer built before shipping
12 pre-production gate questions (4 per MAST category) with pass criteria and named ownership
A sequential gate diagram showing which review unblocks the next
Guidance on which gates can be automated in CI and which require human architecture review
All five FC1 modes are preventable before any agent runs. Most teams discover them in a postmortem.[1]
The model didn't change. Only the spec changed. This is the ceiling improvement available from FC1 prevention.[1]
Agents with explicit behavioral contracts detect violations uncontracted baselines miss entirely.[5]
Single-run staging tests don't catch FC3 failures. The gap is closed by production sampling with correctness verification.[6]
Five failure modes, all caused by missing design decisions made before execution begins. The fix belongs in the spec review, not the postmortem.
FC1's five failure modes — Disobey Task Specifications (FM-1.1), Disobey Role Specifications (FM-1.2), Step Repetition (FM-1.3), Conversation Loss (FM-1.4), and Agents Unaware of Termination Conditions (FM-1.5) — all share an origin: they're consistent with the specification the agent was given.[1] The model isn't malfunctioning. The design is underspecified.
The test for FC1 is cheap: pull the system prompt and role definition next to the failed trace. If you can point to a line that's missing — a termination condition, a role constraint, a state check — that's FC1. If you find yourself writing "I assumed the model would know not to do that," you found the gap.
FM-1.5 (Termination Conditions) is the most consequential because it compounds. Without a verifiable terminal state, the agent continues until a hard limit fires — token budget, API timeout, or circuit breaker. "Return when the task is complete" is not a termination condition. "Return when the diff is clean, all tests pass, and the output matches the expected schema" is. Every agent needs a terminal condition it can evaluate against current state.
FM-1.3 (Step Repetition) shows up most in long-horizon tasks. AppWorld had the highest concentration in MAST-Data.[1] The prevention is a state-tracking guard at step start: before the agent executes a step, it checks whether the target state already exists. Without this, loops are the default state when the task is partially complete.
The Agent Behavioral Contracts (ABC) framework formalizes what MAST implies here: contracted agents operating with explicit preconditions, invariants, and recovery rules detect 5.2–6.8 soft violations per session that uncontracted baselines miss entirely.[5] The gap between an underspecified system prompt and a behavioral contract is exactly where FC1 failures live — and it's entirely preventable before deploy.
Six coordination modes that require structural architecture changes before deploy. Stronger system prompts don't create clarification gates. They create more articulate agents that still don't ask.
FC2's most common mode is Reasoning-Action Mismatch (FM-2.6, 13.98% of all MAST failures[1]). The agent states an intention and executes something different. This isn't a prompt problem — it's a monitoring gap. You can't catch FM-2.6 without logging reasoning-action pairs at each step, and you can't build that logging after the agent ships without changing the execution flow. The logging decision belongs in the pre-production design review.
Failure to Ask for Clarification (FM-2.2, 11.65%[1]) is structural, not behavioral. An agent that receives an ambiguous instruction will proceed on an assumption — that's what language models do. The prevention requires a designed clarification gate: a conditional in the execution flow that evaluates confidence before proceeding. Adding this after production is an architecture change. Doing it before deploy is a design decision.
The MAST paper's FC2 finding is direct: communication protocol improvements — cleaner message formats, structured handoffs, turn-taking confirmations — are insufficient for coordination failures.[1] You can enforce message schemas across every agent boundary and still ship a system where no gate stops an agent from proceeding on a wrong assumption. The gate is an architectural decision, not a protocol tweak.
Task Derailment (FM-2.3, 7.15%[1]) is the FC2 mode most likely to produce plausible-looking output. The agent is productive — completing steps, returning results, interacting with other agents — but on a drifted objective. Prevention requires shared task state visible to all agents: a single authoritative task definition that any agent can check against its instructions before continuing. This is a design artifact. If it doesn't exist at review time, it won't exist at runtime either.
A practical signal for FC2 readiness: when you review the architecture diagram, can you point to where the clarification gate sits in the execution flow? Can you show where reasoning-action pairs are logged? If both answers are "we'll add that later," the FC2 gate fails.
Premature termination, incomplete verification, incorrect verification — all three produce clean infrastructure signals. You can't retrofit eval coverage after output quality has already degraded.
FC3's three modes — Premature Termination (FM-3.1, 7.82%), No or Incomplete Verification (FM-3.2, 6.82%), and Incorrect Verification (FM-3.3, 6.66%[1]) — share one signature: the workflow completes, no alert fires, infrastructure is clean. The problem is in the output.
This makes FC3 the most expensive category to catch in production because it looks like no incident. Output quality degrades. Downstream systems process the output. The problem surfaces in a sprint retro or a customer complaint three weeks later.
The ReliabilityBench study found that pass@1 metrics overestimate production reliability by 20–40%.[6] That gap is the FC3 failure surface — agents that pass staging tests and fail at correctness in production, because staging verified format and production revealed correctness. The root cause is the same across modes: verification checked the wrong thing, or ran only at the terminal output instead of at intermediate steps.
FM-3.3 (Incorrect Verification) is the trap in evaluation design: you built an eval, but it tests the wrong property. A code agent verified by a linter that checks syntax but not execution. A research agent verified by citation count rather than claim accuracy. An FC3 eval that checks format is FM-3.3 applied to your own quality process.
The design question for FC3 pre-production review: what does "correct output" mean for this specific agent, and can you evaluate that before shipping? An LLM-as-judge sampling 10–20% of production outputs provides FC3 coverage without reviewing everything — but only if the judge tests task-level correctness. Defining the correctness criterion is the work that must happen before the eval layer ships alongside the agent. Teams that defer it ship an eval layer that catches the wrong failures.[8]
Four questions per MAST category, each with a pass condition and a named owner. The gate is sequential: FC1 must pass before FC2 is reviewed, FC2 before FC3.
| # | MAST Category | Pre-Production Question | Pass Condition | Review Owner |
|---|---|---|---|---|
| 1 | FC1 — Spec | Can you write the agent's terminal state as a verifiable assertion — not 'task complete' but a falsifiable condition? | Yes — documented and testable | Agent author |
| 2 | FC1 — Spec | Is each role boundary expressed as an explicit prohibition — what the agent is not authorized to do? | Yes — explicit 'not authorized' list per agent | Agent author |
| 3 | FC1 — Spec | Does a step-repetition guard exist — does the agent check whether the target state already holds before starting each step? | Yes — state check in the execution flow | Agent author |
| 4 | FC1 — Spec | For long-horizon tasks: is context window degradation treated as a design constraint, not an edge case? | Yes — windowing or summarization strategy documented in spec | Tech lead |
| 5 | FC2 — Coordination | Is there a clarification gate — a decision point that blocks execution when input confidence falls below a defined threshold? | Yes — threshold and escalation path in architecture design | Architect |
| 6 | FC2 — Coordination | Are reasoning-action pairs logged at each agent step — can you compare stated intent to actual tool call? | Yes — logging deployed before staging tests begin | Platform team |
| 7 | FC2 — Coordination | Is shared task state visible to all agents — no agent operates on a partial or stale view of the current task definition? | Yes — single authoritative state source in design spec | Architect |
| 8 | FC2 — Coordination | Does system-wide monitoring cover inter-agent handoffs, not only per-agent spans? | Yes — cross-agent trace coverage confirmed | Platform team |
| 9 | FC3 — Verification | Does verification run at intermediate steps, not only at the terminal pipeline output? | Yes — multi-level verification in design spec | QA/Eval lead |
| 10 | FC3 — Verification | Does the verifier check output correctness — not just format or schema compliance? | Yes — correctness criterion documented for this agent type | QA/Eval lead |
| 11 | FC3 — Verification | Is an eval layer deployed that samples production output before the agent ships? | Yes — LLM-as-judge or domain validator in place | Platform team |
| 12 | FC3 — Verification | Has the eval layer been validated against human reviewers for this agent's specific task type? | Yes — ≥85% agreement on held-out examples before relying on it as a gate | QA/Eval lead |
Prevention doesn't survive shared responsibility. Each category needs a named owner with a clear pass/fail call — not a process that everyone participates in and nobody owns.
FC1 review belongs to the agent author and tech lead. It's the cheapest review to run — pulling the spec and working through the four FC1 gate questions takes under an hour. The failure mode for FC1 ownership is diffuse accountability: everyone knows the spec exists, nobody is responsible for verifying it's complete. Make the FC1 review an explicit deployment ticket item with a named approver, not an implicit assumption about what passed staging.
FC2 review belongs to the architect who designed the coordination protocol. If no architect owns the coordination design, FC2 review ownership is undefined — which means FC2 failures are shipping by default. This is a structural assignment the engineering lead makes before the agent is built, not after it's built. FC2 review is not asking whether the agent "communicates correctly with other agents." It's asking: does a clarification gate exist in the design spec? Are reasoning-action logs part of the implementation plan? Is shared task state a named artifact? These are architecture decisions, reviewed against the design doc, before implementation is considered complete.
FC3 review belongs to the platform or eval team with LLM evaluation tooling. FC3 review is the hardest to complete before shipping because it requires the eval layer to already exist. Teams that don't have an eval layer can't do FC3 review — and that itself is a deployment blocker. If the eval layer doesn't exist when the agent is ready to ship, the sequencing problem is upstream of the agent: the agent is blocked on platform infrastructure, not development. That's the right resolution. Shipping without FC3 coverage is a named risk decision, not an unnoticed gap.
The most common failure of this ownership model is treating FC3 as optional for "low-risk" agents. The MAST paper doesn't distinguish risk tiers — all three verification modes occur across agent types regardless of task complexity.[1] The team decides the blast radius, not the taxonomy.
Agent passes staging test cases → approved for production
Coordination looked fine in isolated tests → ship it
Output schema validates → verification complete
Production failure → postmortem → 'the spec was ambiguous'
Fix applied → redeploy → discover the next structural gap
FC1 gate passes (terminal state, role constraints, step guard) → move to FC2 review
FC2 gate passes (clarification gate, reasoning logs, shared state) → move to FC3 review
FC3 gate passes (multi-level verification, eval deployed, judge validated) → approved for production
Production failure → MAST tag → structural category identified in 5 minutes, FC and FM code recorded
Postmortem closes with the gate question that should have caught the failure — loop closes
Does the MAST pre-production gate apply to single-agent systems, not just multi-agent?
FC1 applies fully — specification issues are not specific to multi-agent orchestration. FM-1.1 through FM-1.5 all occur in single-agent systems with task specs. FC2 applies partially: FM-2.6 (Reasoning-Action Mismatch) and FM-2.2 (Failure to Ask for Clarification) both translate directly to single-agent systems, even without an explicit coordination protocol. FM-2.6 shows up whenever an agent's stated reasoning diverges from its tool call, which is system-agnostic. FC3 applies fully. For single-agent deployments, run the full FC1 gate, the two most common FC2 questions (5 and 6 in the table), and the full FC3 gate.
What if our team doesn't have an eval layer yet — do we block the deploy?
Yes, by design. The FC3 gate fails when no eval layer exists — and that's the intended behavior. The pre-production gate surfaces an infrastructure gap as a deployment blocker before the agent ships, not after the first production failure. The decision to ship without FC3 coverage is now an explicit risk call logged in the deployment ticket, not an unnoticed gap. For agents with bounded blast radius and reversible outputs, some teams make that call deliberately. For agents touching financial data, customer-facing output, or downstream automation chains, skipping FC3 belongs in the risk register.
Which MAST gate questions can be automated in CI, and which require human review?
FC1 questions 1–4 (terminal state, role constraints, step guard, context strategy) can be partially automated: a spec review prompt running against the system prompt and role definitions can flag missing termination conditions and absent prohibition lists. FC3 questions 9–12 are directly CI-integrable: the agent runs against a test suite, the eval pipeline scores outputs, and CI fails on quality threshold breach. FC2 questions 5–8 require human architecture review — no automated tool can verify that a clarification gate exists in the design before seeing the agent fail to ask at runtime. Automate FC1 and FC3 checks; keep FC2 as a human architecture sign-off.
How does this pre-production gate close the loop with the MAST postmortem process?
The loop closes through MAST tagging in postmortems. When an agent failure is tagged with an FC category and FM code in the postmortem (e.g., FC1/FM-1.5 — Agents Unaware of Termination Conditions), that tag maps directly to the pre-production gate question that should have blocked the deploy: Question 1 in the FC1 gate. The postmortem answers 'what failed.' The pre-production gate answers 'which review was skipped or gave a false pass.' Thirty days of MAST-tagged postmortems with gate attribution tells you whether your recurring failures trace to FC1 spec reviews that don't cover termination conditions, FC2 architecture reviews that skip clarification gate design, or FC3 eval coverage that tests format instead of correctness. That's not a quality metric — it's a process audit.
The MAST paper showed that targeted FC1 spec fixes improved performance by +15.6%, then immediately noted that simple fixes are insufficient — more fundamental system design changes are required.[1] That's not a pessimistic conclusion. It's a precise one: the failure modes are structural, the fixes are structural, and the right time to make them is before the system runs.
Thirty deployed agents without FC1 gates, FC2 coordination design, or FC3 eval coverage are thirty scheduled postmortems. The failures are already written in the specs. The production incidents just haven't run yet.
The three gates force three decisions that every agent deployment requires anyway — made before shipping, with named owners, against explicit pass criteria, instead of discovered under pressure after a failure with an ambiguous root cause and no clear fix layer.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.
MAST (NeurIPS 2025, UC Berkeley) identifies 14 MAS failure modes across 3 structural categories. This playbook maps them to 3 diagnostic questions — and tells you which layer to fix before touching the model.
89% of teams have observability tooling. 62% can map a trace to a failure cause. Seven failure modes grounded in H1 2026 incident data — each with distinct OTel trace signatures and an LLM classifier that routes the incident before the postmortem.