The MAST dataset shows failure rates from 41% to 86.7% across seven leading open-source multi-agent systems.[1] That's not the alarming part. The alarming part is what engineering teams do after the failure fires: they tune the prompt when the spec was wrong, add retries when the coordination protocol had no clarification gate, or upgrade the model when the verification layer was missing entirely. Three structurally distinct problems. Three different intervention timings. Each invisible from the vantage point of the other two.
MAST — the Multi-Agent System Failure Taxonomy, published at NeurIPS 2025 by researchers at UC Berkeley — is the first empirically grounded classification of why multi-agent systems fail.[1] It identifies 14 fine-grained failure modes validated with inter-annotator agreement of κ = 0.88 across six expert annotators.[1] The taxonomy is organized into three overarching categories that are operationally meaningful: Specification Issues (FC1, 41.77% of failures), Inter-Agent Misalignment (FC2, 36.94%), and Task Verification (FC3, 21.30%).[1]
The taxonomy is real. The operational playbook is missing. No published guide tells an engineering leader how to determine which of the three categories their current failure belongs to, or which layer the fix targets. That gap is what this article closes.
Key Takeaways
- ✓
MAST identifies 14 MAS failure modes in 3 categories — FC1 (spec issues), FC2 (coordination), FC3 (verification) — each requiring a different fix at a different stage
- ✓
FC1 failures are preventable before deploy: a behavioral spec review and explicit termination conditions eliminate most of the 41.77% category
- ✓
FC2 failures (36.94%) are detectable at runtime — but adding communication protocols alone is insufficient; the MAST paper shows FC2 demands deeper coordination design, not just more retries
- ✓
FC3 failures (21.30%) are invisible to infrastructure: all spans succeed, all retries pass, and output quality degrades silently — only an eval layer can catch this category
- ✓
Three diagnostic questions determine which FC you're in within five minutes of reading a failed trace; each question points to a different fix layer and a different timeline
Measured in MAST-Data: 1,642 annotated traces from MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, and AG2 — NeurIPS 2025[1]
The largest single category. Most FC1 failures are preventable before any agent runs[1]
Agent states one intention and takes a different action. Detectable in inter-agent trace logs but not from task output alone[1]
Validated across 6 expert annotators and 3 refinement rounds — high enough to run reliable automated classification[1]
The Three Structural Failures Behind Every Agent Incident
MAST's categories aren't a reading list — they map to three different intervention points. The category determines when you fix it, not just what you fix.
The MAST paper's most important finding is not the 14 failure modes. It's the three-category structure and what each category implies about where the fault lives.
FC1 — Specification Issues are failures that originate before execution. The agent misunderstood the task, violated its role definition, repeated steps already completed, lost conversational context, or didn't know when to stop. Every FM-1.x mode traces back to a design decision made before the first token was generated. This means they're largely preventable — not through better models, but through tighter specs.
FC2 — Inter-Agent Misalignment emerges during execution as agents coordinate, hand off state, and negotiate task progress. Agents proceed on wrong assumptions, reasoning doesn't match action, crucial information gets withheld, task objectives drift mid-run. The MAST paper offers a pointed insight here: solutions focused on communication protocols are often insufficient for FC2 failures — they demand deeper coordination design.[1] You can't add retries to fix a clarification failure. You need a gate that forces the agent to ask before proceeding.
FC3 — Task Verification is the most expensive category to miss. Three modes — premature termination, incomplete verification, and incorrect verification — all produce the same external signature: the workflow completes, no alert fires, output looks plausible. The MAST paper shows that relying on final-stage verification alone is structurally inadequate.[1] You need multi-level verification, not a single check at the end.
The operational consequence of this structure: FC1 failures are preventable pre-deploy. FC2 failures are detectable at runtime. FC3 failures are only catchable with an active eval layer. Each category requires different tooling and a different team cadence to address.
| Code | Failure Mode | Category | Fix Timing | Frequency in MAST-Data |
|---|---|---|---|---|
| FM-1.1 | Disobey Task Specifications | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-1.2 | Disobey Role Specifications | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-1.3 | Step Repetition | FC1 — Spec Issues | Pre-deploy | AppWorld most affected[1] |
| FM-1.4 | Conversation Loss | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-1.5 | Agents Unaware of Termination Conditions | FC1 — Spec Issues | Pre-deploy | Part of FC1's 41.77%[1] |
| FM-2.1 | Conversation Reset | FC2 — Coordination | Runtime | 2.33%[1] |
| FM-2.2 | Failure to Ask for Clarification | FC2 — Coordination | Runtime | 11.65%[1] |
| FM-2.3 | Task Derailment | FC2 — Coordination | Runtime | 7.15%[1] |
| FM-2.4 | Information Withholding | FC2 — Coordination | Runtime | 1.66%[1] |
| FM-2.5 | Ignored Other Agent's Input | FC2 — Coordination | Runtime | 0.17%[1] |
| FM-2.6 | Reasoning-Action Mismatch | FC2 — Coordination | Runtime | 13.98%[1] |
| FM-3.1 | Premature Termination | FC3 — Verification | Eval layer | 7.82%[1] |
| FM-3.2 | No or Incomplete Verification | FC3 — Verification | Eval layer | 6.82%[1] |
| FM-3.3 | Incorrect Verification | FC3 — Verification | Eval layer | 6.66%[1] |
FC1: Your Spec Is the Bug, Not the Model
Five failure modes, all caused by design decisions you made before the first agent turn. The fix isn't in the prompt — it's in the specification you wrote before the prompt.
FC1 failures feel like model failures because they manifest as incorrect agent behavior. The agent ignores task requirements (FM-1.1), steps outside its role (FM-1.2), repeats work it already completed (FM-1.3), loses prior context (FM-1.4), or runs indefinitely without finding a stopping condition (FM-1.5). The model is working correctly. The spec was incomplete.
The diagnostic signal for FC1 is this: when you look at the failed trace and can identify exactly what constraint was missing from the system prompt or role definition — that's FC1. The agent didn't hallucinate the behavior; the behavior was consistent with an underspecified instruction.
Step repetition (FM-1.3) is the most visible FC1 mode in production because it generates volume — a looping agent produces trace data that's hard to miss. The AppWorld benchmark had the highest concentration of this mode in MAST-Data.[1] The fix is an explicit state-tracking mechanism: the agent should check at each step whether the target state has already been achieved. This is a spec addition, not a prompt tone adjustment.
Agents unaware of termination conditions (FM-1.5) is the subtler FC1 failure. The agent has no explicit completion signal, so it continues past the point where the task is done — or worse, treats the absence of failure as permission to keep going. Every agentic workflow needs a stated terminal condition that the agent can evaluate against its current state. 'Return when the task is complete' is not a termination condition. 'Return when the output file matches the expected schema and all validation checks pass' is.
FC2: The Coordination Protocol Doesn't Exist Yet
Six failure modes rooted in agent-to-agent interaction. The most common — reasoning-action mismatch at 13.98% — is invisible from task output alone. It only surfaces in inter-agent logs.
FC2 failures are where most multi-agent debugging tools fall short. The failures emerge from agent interaction — not from any single agent's behavior in isolation — so neither the orchestrator logs nor the individual agent logs clearly isolate the cause.
The most frequent FC2 mode is reasoning-action mismatch (FM-2.6, 13.98% of all MAST failures[1]). The agent explains in its reasoning step that it will take action X, then takes action Y. This produces outputs that are internally incoherent — the reasoning chain looks valid, the action looks valid, but they describe different things. Teams that review only action outputs miss this entirely. You need to compare stated reasoning to actual action at each agent step.
The second most common FC2 mode is failure to ask for clarification (FM-2.2, 11.65%[1]). An agent encounters an ambiguous instruction, makes a plausible assumption, and proceeds — sometimes correctly, often not. The critical design gap is the absence of a clarification gate: a decision point that checks whether the agent has enough information to proceed with confidence before continuing. Most MAS designs skip this gate because it requires additional latency. The alternative is a wrong execution at scale.
Task derailment (FM-2.3, 7.15%[1]) is the hardest FC2 mode to catch retrospectively because the agent looks productive throughout. It's completing steps, returning results, interacting with other agents — just on the wrong objective. The point at which the task went sideways is usually many turns before the point at which the output reveals it.
The MAST paper's FC2 insight is worth stating plainly: communication protocol improvements — cleaner message formats, more structured handoffs — are insufficient to prevent FC2 failures.[1] These failures require structural coordination design: explicit clarification gates, shared task state that all agents can inspect, and reasoning traces that are checked against action outputs, not just presented as justification.
FC3: The Only Failures Your Infrastructure Cannot See
Premature termination, incomplete verification, incorrect verification — three modes that produce clean infrastructure signals while output quality degrades. Infrastructure observability doesn't help here.
FC3 is where the gap between technical observability and actual quality lives. All three FC3 modes produce the same surface signature: the workflow completes, every span succeeds, no circuit breaker trips, no budget ceiling fires. Your dashboard is green. The output is wrong.
Premature termination (FM-3.1, 7.82%[1]) is the most benign-looking FC3 mode. The agent stops when it shouldn't have — it reached a partial completion state that satisfied its termination condition but not the user's intent. The trace looks like a successful short run. Downstream systems process truncated output without knowing it's truncated.
Incomplete or absent verification (FM-3.2, 6.82%[1]) is the architectural pattern where the system produces output but never checks whether that output is correct. Teams that rely on final-stage checks only — a single validation at the end of a long pipeline — find that multi-step errors compound before the check runs. The MAST paper specifically calls this out: multi-level verification is needed, not just terminal verification.[1]
Incorrect verification (FM-3.3, 6.66%[1]) is the most expensive FC3 mode to discover because the team believes they have quality controls. The verifier runs. The verifier says 'pass.' The output is wrong. This happens when the verifier is not testing the right property — checking format when correctness matters, checking syntax when semantics matter, or checking against stale expectations.
The ReliabilityBench study independently confirms the verification gap: pass@1 metrics overestimate production reliability by 20–40% because they measure single-run success, not whether the agent's output is actually correct.[6] FC3 failures are what that 20–40% gap is made of. Teams without an active eval layer — an LLM-as-judge or domain-specific validator sampling production outputs — are flying blind on the entire FC3 category, which accounts for 21.30% of documented failures.
Agent produces wrong output → adjust system prompt → redeploy
Agent loops on same steps → add 'don't repeat yourself' instruction → same behavior
Agents disagree on next step → add 'coordinate carefully' directive → FM-2.2 recurs
Output quality degrades over time → model seems weaker → consider model upgrade
Next failure: different surface presentation, same structural cause, same wrong fix layer
Agent produces wrong output → check spec against trace → identify FM-1.1 → add explicit constraint
Agent loops → identify FM-1.3 (Step Repetition) → add state-tracking check at each step
Agents disagree → identify FM-2.2 (Failure to Ask for Clarification) → add clarification gate before execution
Output quality degrades → FC3 confirmed → add multi-level verification + LLM-as-judge eval
Next failure: different mode, same diagnostic, targeted fix layer, postmortem has root cause
The Three Diagnostic Questions
Run these in order against a failed trace. The first question that answers 'yes' determines your category and your fix target.
- [01]
Question 1: Did the agent violate or misinterpret its specification?
Read the system prompt, task description, and role definition alongside the failed trace. Find the step where behavior diverged from intent. If you can point to a missing constraint, an ambiguous role boundary, or a termination condition that wasn't defined — you're in FC1. The fix is a spec change, not a model change. Add the constraint explicitly. Define the termination condition precisely. Test the updated spec against the failed trace before deploying. The MAST paper shows +15.6% performance improvement in ChatDev after targeted FC1 interventions — but also notes that superficial fixes are insufficient; structural spec redesigns are what close the gap.[1]
- [02]
Question 2: Did inter-agent coordination break down?
Look at the inter-agent message logs at the step where the failure emerged. Check for reasoning-action mismatch (FM-2.6): does the stated reasoning in the agent's output match the tool call or message it actually sent? Check for missing clarification (FM-2.2): did the agent proceed on an assumption when the instruction was genuinely ambiguous? Did task objectives drift mid-run (FM-2.3) without any explicit decision to change direction? FC2 failures are visible in coordination logs but not in task output. If the individual agent outputs look plausible but the multi-agent interaction produced the wrong result, you're in FC2. The fix is a coordination design change: a clarification gate, shared task state visible to all agents, or explicit handoff validation. Adding retries or stronger language in the system prompt won't close FC2.
- [03]
Question 3: Did the workflow complete but produce degraded or unverified output?
If questions 1 and 2 come up negative — the spec was fine, coordination looked intact — but the output is wrong, you're in FC3. Check whether verification ran at all (FM-3.2), whether it ran at the right level of granularity (multi-step verification, not just terminal), and whether it checked the right property (FM-3.3). Infrastructure observability cannot help you here. The fix is an eval layer: a verifier that runs against intermediate outputs, not just the final result, and checks correctness rather than format. An LLM-as-judge sampling 10–20% of production outputs provides FC3 coverage. Without it, FC3 failures accumulate silently until a customer or auditor surfaces them.
mast_classifier.py# MAST failure mode classifier — runs offline against exported trace data.
# Returns the FC category, specific failure mode, and fix_timing.
# fix_timing tells you WHEN to intervene, not just what's wrong.
MAST_PROMPT = """
Classify this failed multi-agent execution trace using the MAST taxonomy.
FC1 — SPECIFICATION ISSUES (failures from design or ambiguous specs, fix pre-deploy):
FM-1.1: Disobey Task Specifications — agent violates stated task requirements
FM-1.2: Disobey Role Specifications — agent acts outside its assigned role
FM-1.3: Step Repetition — agent repeats already-completed steps
FM-1.4: Conversation Loss — agent loses prior conversation context
FM-1.5: Agents Unaware of Termination Conditions — agent cannot determine when to stop
FC2 — INTER-AGENT MISALIGNMENT (failures from coordination breakdown, fix at runtime):
FM-2.1: Conversation Reset — conversation history resets unexpectedly
FM-2.2: Failure to Ask for Clarification — agent proceeds on wrong assumptions
FM-2.3: Task Derailment — task veers off-course mid-execution
FM-2.4: Information Withholding — crucial info not shared between agents
FM-2.5: Ignored Other Agent's Input — agent ignores inputs from other agents
FM-2.6: Reasoning-Action Mismatch — stated reasoning differs from action taken
FC3 — TASK VERIFICATION (failures from inadequate output checking, fix with eval layer):
FM-3.1: Premature Termination — task ends before completion
FM-3.2: No or Incomplete Verification — output correctness not checked
FM-3.3: Incorrect Verification — verification runs but gives wrong result
TRACE:
{trace_text}
Output JSON only:
{{"category": "FC1|FC2|FC3",
"failure_mode": "FM-X.X",
"confidence": 0.0,
"evidence": "key evidence from the trace",
"fix_timing": "pre-deploy|runtime|eval",
"fix_target": "spec|role|termination|protocol|clarification|verification|multi-level-eval"}}
"""
def classify_mast(trace: str) -> dict:
"""
Classify a failed MAS trace against the MAST taxonomy.
Returns category, specific mode, confidence, and fix_timing.
fix_timing == 'pre-deploy' → The spec was wrong. Fix before shipping.
fix_timing == 'runtime' → Add coordination gate or protocol enforcement.
fix_timing == 'eval' → Infrastructure won't catch this. Add an eval layer.
"""
response = llm.complete(MAST_PROMPT.format(trace_text=trace))
result = json.loads(response.content)
return result
# Wire to your postmortem process:
# After any agent incident, run classify_mast(trace) before writing the root cause.
# The fix_timing field tells the team which layer owns the fix — spec, runtime, or eval.
# Tag postmortems with MAST codes to track which category is recurring.Pre-Production MAST Audit
FC1 check: Every agent has an explicit terminal condition — not 'complete the task' but a verifiable end state
FC1 check: Role boundaries are stated as constraints, not descriptions — each agent knows what it is NOT authorized to do
FC1 check: State tracking prevents step repetition — agent checks current state before starting any step
FC2 check: A clarification gate exists for inputs below a confidence threshold — agent asks before assuming
FC2 check: Reasoning-action pairs are logged at each agent step — you can compare intent to action after the fact
FC2 check: Shared task state is visible to all agents — no agent operates on a stale or partial view of the task
FC3 check: Verification runs at intermediate steps, not only at terminal output — multi-level, not single-check
FC3 check: Verifier tests correctness of the output, not just format — checked against the task's success criteria
FC3 check: An eval layer samples production outputs — LLM-as-judge or domain validator, not just infrastructure metrics
Postmortem template requires MAST category (FC1/FC2/FC3) and failure mode (FM-X.X) before root cause is written
MAST was built on benchmark traces, not production data. Does it apply to real deployments?
This is the right question to ask. MAST was derived from traces of seven open-source MAS frameworks — MetaGPT, ChatDev, AG2, and others — running coding, math, and general agent tasks.[1] These are not production deployments at scale. The taxonomy's validity comes from its κ = 0.88 inter-annotator agreement and balanced failure distribution across categories, not from production incident data. In practice, MAST's three categories map well to production failure patterns described in other empirical datasets — but the specific mode frequencies (41.77% FC1 etc.) should be treated as directional signals from benchmark conditions, not as precise production measurements. Run MAST classification on your own incident history for 90 days before assuming its distribution matches your system.
My agents are single-agent, not multi-agent. Does MAST apply?
FC1 applies fully — specification issues are not specific to multi-agent systems. FM-1.1 through FM-1.5 describe failures that occur in any agentic system with a task specification. FC2 requires inter-agent interaction to be relevant; for single-agent systems, FM-2.6 (Reasoning-Action Mismatch) and FM-2.2 (Failure to Ask for Clarification) are the two modes that translate most directly. FC3 applies to any system producing output that needs verification. If you're operating single agents, MAST gives you FC1 and FC3 as near-direct diagnostics, and a subset of FC2.
Can I use the MAST LLM annotator directly instead of writing my own classifier?
The MAST GitHub repository (github.com/multi-agent-systems-failure-taxonomy/MAST)[3] ships a pip-installable annotator via pip install agentdash. The annotator uses OpenAI's o1 model and achieves κ = 0.77 agreement with human experts on held-out traces.[1] It's production-usable for offline trace analysis — not appropriate for inline, real-time classification. For real-time use, the single-prompt approach in this article is the better path: lighter, faster, and tunable to your system's vocabulary. Run the full MAST annotator offline against your incident history to build baseline FC distributions; use the single-prompt classifier for live triage.
FC3 failures need an eval layer. How do I get coverage without reviewing every output?
Sampling is sufficient if the sample is representative. LLM-as-judge coverage at 10–20% of production outputs per agent type gives you FC3 signal without reviewing everything. The critical design decision is what the judge evaluates: it must check task-level correctness, not just output format or schema validity. For coding agents, that means execution correctness, not code style. For research agents, that means claim accuracy against sources, not citation format. Define the success criterion for each agent type before building the evaluator. A judge that tests the wrong property is FM-3.3 (Incorrect Verification) applied to your own quality process.
MAST's most useful contribution is not the 14 failure modes. It's the proof that these 14 modes are not a flat list — they're three structurally distinct failure surfaces, each requiring a different intervention at a different stage. Treating FC3 with spec changes is as wrong as treating FC1 with an eval layer. The failure mode determines the fix timing. The fix timing determines which team owns it.
Thirty days of MAST-tagged postmortems tells you whether your recurring failures cluster in FC1 (spec discipline problem), FC2 (coordination design problem), or FC3 (eval coverage problem). That's not a diagnosis of a single incident. It's a map of which architectural layer has the structural gap.
The fix is never the model.
- [1]Why Do Multi-Agent LLM Systems Fail? — MAST taxonomy, 14 failure modes, 1642 annotated traces (Cemri et al., UC Berkeley, arXiv:2503.13657, NeurIPS 2025 Datasets and Benchmarks Track)(arxiv.org)↩
- [2]NeurIPS 2025 — Why Do Multi-Agent LLM Systems Fail? (official proceedings)(proceedings.neurips.cc)↩
- [3]MAST GitHub repository — taxonomy definitions, MAST-Data dataset, LLM annotator(github.com)↩
- [4]MAST project page — UC Berkeley Sky Computing Lab(sky.cs.berkeley.edu)↩
- [5]State of Agent Engineering 2025 — LangChain survey, 1,340 respondents: 57% agents in production, 89% have observability, 32% cite quality as top barrier(langchain.com)↩
- [6]ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions — pass@1 overestimates production reliability by 20-40% (arXiv:2601.06112)(arxiv.org)↩