MAST (NeurIPS 2025, UC Berkeley) identifies 14 MAS failure modes across 3 structural categories. This playbook maps them to 3 diagnostic questions — and tells you which layer to fix before touching the model.
The MAST dataset shows failure rates from 41% to 86.7% across seven leading open-source multi-agent systems.[1] That's not the alarming part. The alarming part is what engineering teams do after the failure fires: they tune the prompt when the spec was wrong, add retries when the coordination protocol had no clarification gate, or upgrade the model when the verification layer was missing entirely. Three structurally distinct problems. Three different intervention timings. Each invisible from the vantage point of the other two.
MAST — the Multi-Agent System Failure Taxonomy, published at NeurIPS 2025 by researchers at UC Berkeley — is the first empirically grounded classification of why multi-agent systems fail.[1] It identifies 14 fine-grained failure modes validated with inter-annotator agreement of κ = 0.88 across six expert annotators.[1] The taxonomy is organized into three overarching categories that are operationally meaningful: Specification Issues (FC1, 41.77% of failures), Inter-Agent Misalignment (FC2, 36.94%), and Task Verification (FC3, 21.30%).[1]
Separate validation from Patronus AI's TRAIL benchmark makes the diagnostic challenge sharper: even the best long-context LLMs (Gemini-2.5-pro) score only 11% when asked to debug agent execution traces.[7] Your model cannot diagnose its own failures. You need a structural framework — and that's precisely what this playbook provides.
MAST's 14 failure modes mapped to 3 root-cause categories — FC1 (spec), FC2 (coordination), FC3 (verification) — each requiring a different fix at a different stage
Three diagnostic questions that classify any failed trace in under five minutes, with sub-checks for each FC category
FC1 failures are preventable before deploy: a behavioral spec review and explicit termination conditions eliminate most of the 41.77% category
FC2 failures (36.94%) demand coordination design, not just protocol hardening — adding retries or stronger system prompts doesn't close the gap
FC3 failures (21.30%) are invisible to infrastructure: spans succeed, retries pass, output quality degrades silently — only an eval layer catches this category
A runnable Python classifier, pre-production audit checklist, and a decision tree diagram for operationalizing MAST in your incident process
Measured in MAST-Data: 1,642 annotated traces from MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, and AG2 — NeurIPS 2025[1]
The largest single category. Most FC1 failures are preventable before any agent runs[1]
Agent states one intention and takes a different action. Detectable in inter-agent trace logs but not from task output alone[1]
TRAIL benchmark: 148 annotated traces, 841 errors. Your model cannot reliably self-diagnose — you need structured triage[7]
MAST's categories aren't a reading list — they map to three different intervention points. The category determines when you fix it, not just what you fix.
The MAST paper's most important finding is not the 14 failure modes. It's the three-category structure and what each category implies about where the fault lives.
FC1 — Specification Issues are failures that originate before execution. The agent misunderstood the task, violated its role definition, repeated steps already completed, lost conversational context, or didn't know when to stop. Every FM-1.x mode traces back to a design decision made before the first token was generated. This makes them largely preventable — not through better models, but through tighter specs.
FC2 — Inter-Agent Misalignment emerges during execution as agents coordinate, hand off state, and negotiate task progress. Agents proceed on wrong assumptions, reasoning doesn't match action, crucial information gets withheld, task objectives drift mid-run. The MAST paper makes a pointed observation here: solutions focused on communication protocols are often insufficient for FC2 failures — they demand deeper coordination design.[1] You can't add retries to fix a clarification failure. You need a gate that forces the agent to ask before proceeding.
FC3 — Task Verification is the most expensive category to miss. Three modes — premature termination, incomplete verification, and incorrect verification — all produce the same external signature: the workflow completes, no alert fires, output looks plausible. The MAST paper shows that relying on final-stage verification alone is structurally inadequate.[1] Multi-level verification is required, not a single check at pipeline end.
The operational consequence of this structure is strict: FC1 failures are preventable pre-deploy. FC2 failures are detectable at runtime. FC3 failures are only catchable with an active eval layer. Each category requires different tooling, different ownership, and a different cadence.
| Code | Failure Mode | Category | Fix Timing | Frequency in MAST-Data |
|---|---|---|---|---|
| FM-1.1 | Disobey Task Specifications | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-1.2 | Disobey Role Specifications | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-1.3 | Step Repetition | FC1 — Spec Issues | Pre-deploy | AppWorld most affected[1] |
| FM-1.4 | Conversation Loss | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-1.5 | Agents Unaware of Termination Conditions | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-2.1 | Conversation Reset | FC2 — Coordination | Runtime | 2.33%[1] |
| FM-2.2 | Failure to Ask for Clarification | FC2 — Coordination | Runtime | 11.65%[1] |
| FM-2.3 | Task Derailment | FC2 — Coordination | Runtime | 7.15%[1] |
| FM-2.4 | Information Withholding | FC2 — Coordination | Runtime | 1.66%[1] |
| FM-2.5 | Ignored Other Agent's Input | FC2 — Coordination | Runtime | 0.17%[1] |
| FM-2.6 | Reasoning-Action Mismatch | FC2 — Coordination | Runtime | 13.98%[1] |
| FM-3.1 | Premature Termination | FC3 — Verification | Eval layer | 7.82%[1] |
| FM-3.2 | No or Incomplete Verification | FC3 — Verification | Eval layer | 6.82%[1] |
| FM-3.3 | Incorrect Verification | FC3 — Verification | Eval layer | 6.66%[1] |
Five failure modes, all caused by design decisions you made before the first agent turn. The fix isn't in the prompt — it's in the specification you wrote before the prompt.
FC1 failures feel like model failures because they manifest as incorrect agent behavior. The agent ignores task requirements (FM-1.1), steps outside its role (FM-1.2), repeats work it already completed (FM-1.3), loses prior context (FM-1.4), or runs indefinitely without finding a stopping condition (FM-1.5). The model is working correctly. The spec was incomplete.
The diagnostic signal for FC1: when you look at the failed trace and can identify exactly what constraint was missing from the system prompt or role definition — that's FC1. The agent didn't hallucinate the behavior; the behavior was consistent with an underspecified instruction.
Step repetition (FM-1.3) is the most visible FC1 mode in production because it generates volume — a looping agent produces trace data that's hard to miss. The AppWorld benchmark had the highest concentration of this mode in MAST-Data.[1] The fix is an explicit state-tracking mechanism: the agent should check at each step whether the target state has already been achieved. This is a spec addition, not a prompt tone adjustment.
Agents unaware of termination conditions (FM-1.5) is the subtler FC1 failure. The agent has no explicit completion signal, so it continues past the point where the task is done — or worse, treats the absence of failure as permission to keep going. Every agentic workflow needs a stated terminal condition that the agent can evaluate against its current state. 'Return when the task is complete' is not a termination condition. 'Return when the output file matches the expected schema and all validation checks pass' is.
A 2026 paper on Agent Behavioral Contracts (ABC) formalizes what MAST implies: when contracted agents operate with explicit preconditions, invariants, and recovery rules, they detect 5.2–6.8 soft violations per session that uncontracted baselines miss entirely.[8] The implication is that the gap between MAST's FC1 failures and spec-complete systems is real and closable — but it requires treating behavioral constraints as first-class artifacts, not incidental prompt additions.
Six failure modes rooted in agent-to-agent interaction. The most common — reasoning-action mismatch at 13.98% — is invisible from task output alone. It only surfaces in inter-agent logs.
FC2 failures are where most multi-agent debugging tools fall short. The failures emerge from agent interaction — not from any single agent's behavior in isolation — so neither the orchestrator logs nor the individual agent logs clearly isolate the cause.
The most frequent FC2 mode is reasoning-action mismatch (FM-2.6, 13.98% of all MAST failures[1]). The agent explains in its reasoning step that it will take action X, then takes action Y. This produces outputs that are internally incoherent — the reasoning chain looks valid, the action looks valid, but they describe different things. Teams that review only action outputs miss this entirely. You need to compare stated reasoning to actual action at each agent step.
The second most common FC2 mode is failure to ask for clarification (FM-2.2, 11.65%[1]). An agent encounters an ambiguous instruction, makes a plausible assumption, and proceeds — sometimes correctly, often not. The critical design gap is the absence of a clarification gate: a decision point that checks whether the agent has enough information to proceed with confidence before continuing. Most MAS designs skip this gate because it requires additional latency. The alternative is a wrong execution at scale.
Task derailment (FM-2.3, 7.15%[1]) is the hardest FC2 mode to catch retrospectively because the agent looks productive throughout. It's completing steps, returning results, interacting with other agents — just on the wrong objective. The point where the task went sideways is usually many turns before the point at which the output reveals it.
The MAST paper's FC2 insight is worth stating plainly: communication protocol improvements — cleaner message formats, more structured handoffs — are insufficient to prevent FC2 failures.[1] These failures require structural coordination design: explicit clarification gates, shared task state that all agents can inspect, and reasoning traces checked against action outputs, not just presented as justification.
The LumiMAS monitoring framework (AAMAS 2026) offers a concrete detection path: its three-layer architecture — monitoring, anomaly detection, and anomaly explanation — detects FM-2.6-style coordination failures in under 0.07 seconds with a low false positive rate.[10] The key observation is that system-wide anomaly detection, not per-agent log review, is what surfaces FC2 failures reliably.
Premature termination, incomplete verification, incorrect verification — three modes that produce clean infrastructure signals while output quality degrades. Infrastructure observability doesn't help here.
FC3 is where the gap between technical observability and actual quality lives. All three FC3 modes produce the same surface signature: the workflow completes, every span succeeds, no circuit breaker trips, no budget ceiling fires. Your dashboard is green. The output is wrong.
Premature termination (FM-3.1, 7.82%[1]) is the most benign-looking FC3 mode. The agent stops when it shouldn't have — it reached a partial completion state that satisfied its termination condition but not the user's intent. The trace looks like a successful short run. Downstream systems process truncated output without knowing it's truncated.
Incomplete or absent verification (FM-3.2, 6.82%[1]) is the architectural pattern where the system produces output but never checks whether that output is correct. Teams that rely on final-stage checks only — a single validation at the end of a long pipeline — find that multi-step errors compound before the check runs. The MAST paper specifically calls this out: multi-level verification is required, not just terminal verification.[1]
Incorrect verification (FM-3.3, 6.66%[1]) is the most expensive FC3 mode to discover because the team believes they have quality controls. The verifier runs. The verifier says 'pass.' The output is wrong. This happens when the verifier tests the wrong property — checking format when correctness matters, checking syntax when semantics matter, or checking against stale expectations.
The ReliabilityBench study independently confirms the verification gap: pass@1 metrics overestimate production reliability by 20–40% because they measure single-run success, not whether the agent's output is actually correct.[6] FC3 failures are what that 20–40% gap is made of. Teams without an active eval layer — an LLM-as-judge or domain-specific validator sampling production outputs — are flying blind on the entire FC3 category. Survey data from 2025-2026 shows LLM-as-judge adoption at 52% among teams doing model-based evaluation, but the critical constraint is what the judge evaluates: it must check task-level correctness, not schema validity or format compliance.[5]
Agent produces wrong output → adjust system prompt → redeploy
Agent loops on same steps → add 'don't repeat yourself' instruction → same behavior
Agents disagree on next step → add 'coordinate carefully' directive → FM-2.2 recurs
Output quality degrades over time → model seems weaker → consider model upgrade
Next failure: different surface presentation, same structural cause, same wrong fix layer
Agent produces wrong output → check spec against trace → identify FM-1.1 → add explicit constraint
Agent loops → identify FM-1.3 (Step Repetition) → add state-tracking check at each step
Agents disagree → identify FM-2.2 (Failure to Ask for Clarification) → add clarification gate before execution
Output quality degrades → FC3 confirmed → add multi-level verification + LLM-as-judge eval
Next failure: different mode, same diagnostic, targeted fix layer, postmortem has root cause
Run these in order against a failed trace. The first question that answers 'yes' determines your category and your fix target.
Read the system prompt, task description, and role definition alongside the failed trace. Find the step where behavior diverged from intent. If you can point to a missing constraint, an ambiguous role boundary, or a termination condition that wasn't defined — you're in FC1. The fix is a spec change, not a model change. Add the constraint explicitly. Define the termination condition precisely. Test the updated spec against the failed trace before deploying. The MAST paper shows +15.6% performance improvement in ChatDev after targeted FC1 interventions — but also notes that superficial fixes are insufficient; structural spec redesigns are what close the gap.[1] The Agent Behavioral Contracts framework formalizes this: ABC-contracted agents detect 5.2–6.8 soft violations per session that uncontracted baselines miss entirely.[8]
Look at the inter-agent message logs at the step where the failure emerged. Check for reasoning-action mismatch (FM-2.6): does the stated reasoning in the agent's output match the tool call or message it actually sent? Check for missing clarification (FM-2.2): did the agent proceed on an assumption when the instruction was genuinely ambiguous? Did task objectives drift mid-run (FM-2.3) without any explicit decision to change direction? FC2 failures are visible in coordination logs but not in task output. If the individual agent outputs look plausible but the multi-agent interaction produced the wrong result, you're in FC2. The fix is a coordination design change: a clarification gate, shared task state visible to all agents, or explicit handoff validation. Adding retries or stronger language in the system prompt won't close FC2. System-wide monitoring frameworks like LumiMAS detect these failures in real time (under 0.07 seconds[10]) — individual span review won't.
If questions 1 and 2 come up negative — the spec was fine, coordination looked intact — but the output is wrong, you're in FC3. Check whether verification ran at all (FM-3.2), whether it ran at the right level of granularity (multi-step verification, not just terminal), and whether it checked the right property (FM-3.3). Infrastructure observability cannot help you here. The fix is an eval layer: a verifier that runs against intermediate outputs, not just the final result, and checks correctness rather than format. An LLM-as-judge sampling production outputs provides FC3 coverage — but what the judge evaluates determines whether it catches FM-3.3 or becomes one itself.
MAST categories tell you what to fix. The tradeoffs tell you what to expect when you do.
| Category | Correct Fix | Common Wrong Fix | Why the Wrong Fix Fails | Time to Impact |
|---|---|---|---|---|
| FC1 — Spec Issues | Rewrite the behavioral spec; add verifiable terminal states and role constraints | Prompt tone adjustment ('be more precise', 'follow instructions carefully') | Natural language instructions can't substitute for explicit constraints — the model already followed the spec it was given | 1–2 deploy cycles after spec review |
| FC2 — Coordination | Add a clarification gate; implement shared task state; log reasoning-action pairs | More retries; stronger system prompt language about coordination | Retries replay the same broken coordination with the same outcome; prompt language can't create structural gates | Next sprint if coordination is redesigned, not patched |
| FC3 — Verification | Multi-level verifier checking intermediate outputs; LLM-as-judge for correctness, not format | Infrastructure alerting; output schema validation | Format checks pass on incorrect outputs; schema validation is FM-3.3 applied to your own eval pipeline | Immediate if eval layer is deployed — but coverage ramps over time |
MAST tags on individual postmortems are table stakes. The structural signal comes from 30 days of tagged incidents — the distribution tells you which architectural layer has the recurring gap.
A single MAST-tagged postmortem tells you what went wrong. Thirty days of them tells you where your system has a structural gap.
If FC1 failures cluster — agents repeatedly violating spec, stepping out of role, or running past termination conditions — the problem is a process gap in how behavioral specs are written and reviewed before deploy, not individual agent failures. The fix is a spec review gate in your deployment pipeline.
If FC2 failures cluster — coordination breakdowns, reasoning-action mismatches, agents proceeding on wrong assumptions — the problem is that your coordination design is treating inter-agent interaction as incidental rather than structural. The fix is redesigning the coordination layer: shared state, explicit clarification gates, handoff validation.
If FC3 failures cluster — outputs that pass infrastructure checks but fail at correctness — the problem is that your eval coverage is measuring the wrong thing. The fix is an eval layer that assesses task-level correctness, not just format or schema validity.
Tag 30 postmortems. Query the distribution. That distribution is your architectural backlog. MAST doesn't just explain what went wrong — it tells you where to invest before the next incident fires.
MAST was built on benchmark traces, not production data. Does it apply to real deployments?
This is the right question to ask. MAST was derived from traces of seven open-source MAS frameworks — MetaGPT, ChatDev, AG2, and others — running coding, math, and general agent tasks.[1] These are not production deployments at scale. The taxonomy's validity comes from its κ = 0.88 inter-annotator agreement and balanced failure distribution across categories, not from production incident data. In practice, MAST's three categories map well to production failure patterns described in other empirical datasets — but the specific mode frequencies (41.77% FC1 etc.) should be treated as directional signals from benchmark conditions, not as precise production measurements. Run MAST classification on your own incident history for 90 days before assuming its distribution matches your system.
My agents are single-agent, not multi-agent. Does MAST apply?
FC1 applies fully — specification issues are not specific to multi-agent systems. FM-1.1 through FM-1.5 describe failures that occur in any agentic system with a task specification. FC2 requires inter-agent interaction to be relevant; for single-agent systems, FM-2.6 (Reasoning-Action Mismatch) and FM-2.2 (Failure to Ask for Clarification) are the two modes that translate most directly. FC3 applies to any system producing output that needs verification. If you're operating single agents, MAST gives you FC1 and FC3 as near-direct diagnostics, and a subset of FC2.
Can I use the MAST LLM annotator directly instead of writing my own classifier?
The MAST GitHub repository ships a pip-installable annotator via pip install agentdash.[3] The annotator uses OpenAI's o1 model and achieves κ = 0.77 agreement with human experts on held-out traces.[1] It's production-usable for offline trace analysis — not appropriate for inline, real-time classification. For real-time use, the single-prompt approach in this article is the better path: lighter, faster, and tunable to your system's vocabulary. Run the full MAST annotator offline against your incident history to build baseline FC distributions; use the single-prompt classifier for live triage. For real-time system-wide anomaly detection, LumiMAS (arXiv:2508.12412) operates under 0.07 seconds per detection event.[10]
FC3 failures need an eval layer. How do I get coverage without reviewing every output?
Sampling is sufficient if the sample is representative. LLM-as-judge coverage at 10–20% of production outputs per agent type gives you FC3 signal without reviewing everything. The critical design decision is what the judge evaluates: it must check task-level correctness, not output format or schema validity. For coding agents, that means execution correctness, not code style. For research agents, that means claim accuracy against sources, not citation format. Define the success criterion for each agent type before building the evaluator. A judge that tests the wrong property is FM-3.3 (Incorrect Verification) applied to your own quality process.
How does MAST relate to the TRAIL benchmark for agent trace debugging?
They're complementary. MAST classifies failure modes taxonomically — it tells you what category a failure belongs to. TRAIL (Patronus AI, 2025) measures how well LLMs can debug agent traces autonomously.[7] TRAIL's finding that Gemini-2.5-pro scores only 11% on trace debugging explains why MAST's structured classification is necessary: LLMs are poor at unstructured trace debugging without a taxonomy to reason against. Giving a model the MAST framework as a prompt scaffold — as the classifier in this article does — significantly improves diagnostic accuracy over asking it to debug a trace freeform.
Is the fix-timing boundary between FC categories really that clean?
No — and acknowledging this matters. FC1 and FC2 failures often co-occur: a poor termination spec (FM-1.5) combined with an agent that doesn't ask for clarification when it's stuck (FM-2.2) produces a failure that looks like one thing from each angle. The diagnostic tree routes hybrid cases to 'Escalate for manual review' for exactly this reason. Fix ordering matters: address FC1 first (spec is cheapest to change), then FC2 (coordination redesign), then FC3 (eval coverage). Fixing FC3 before FC1/FC2 means your eval layer is running over outputs from broken specs and coordination — you'll catch more failures, but you're treating symptoms without closing the structural gap.
MAST's most useful contribution is not the 14 failure modes. It's the proof that these 14 modes are not a flat list — they're three structurally distinct failure surfaces, each requiring a different intervention at a different stage. Treating FC3 with spec changes is as wrong as treating FC1 with an eval layer. The failure mode determines the fix timing. The fix timing determines which team owns it.
Thirty days of MAST-tagged postmortems tells you whether your recurring failures cluster in FC1 (spec discipline problem), FC2 (coordination design problem), or FC3 (eval coverage problem). That's not a diagnosis of a single incident. It's a map of which architectural layer has the structural gap.
The fix is never the model.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.