MAST Pre-Production Gate: Prevent Agent Failures Before They Ship

Every FC1 failure in the MAST dataset was writable before the agent ran. Specification Issues — 41.77% of all failures across 1,642 annotated traces from seven open-source multi-agent frameworks — don't emerge from runtime surprises.^[1] They're embedded in the spec the author wrote before the first token generated. The agent violated its task constraints because the constraints were absent. It repeated completed steps because the execution flow had no state-tracking guard. It ran past task completion because "return when done" is not a terminal condition an agent can evaluate against anything.

Researchers at UC Berkeley published MAST at NeurIPS 2025 after analyzing traces across MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, and AG2. Failure rates ranged from 41% to 86.7% depending on the system.^[1] That range isn't primarily explained by model choice. It's explained by how completely each system's specs, coordination protocols, and verification layers were designed before execution. The MAST paper's targeted interventions on ChatDev improved performance by +15.6% without touching the model — only the spec.^[1]

The MAST triage playbook answers: "My agent failed — which of the 14 failure modes is this, and which layer do I fix?" This playbook answers a different question: what must be true before the agent ships so that those failure modes cannot occur. The answer is three sequential pre-production review gates — one per MAST category — with 12 questions, explicit pass criteria, and assigned ownership. No postmortem required.

What This Covers

✓
FC1 (41.77% of failures): 5 specification failure modes — all preventable before deploy with a structured spec review gate
✓
FC2 (36.94%): 6 inter-agent coordination modes — all requiring architecture design review, not prompt revision
✓
FC3 (21.30%): 3 verification failure modes — invisible to infrastructure without an eval layer built before shipping
✓
12 pre-production gate questions (4 per MAST category) with pass criteria and named ownership
✓
A sequential gate diagram showing which review unblocks the next
✓
Guidance on which gates can be automated in CI and which require human architecture review

41.77%

Share of MAST failures from FC1 — Specification Issues

All five FC1 modes are preventable before any agent runs. Most teams discover them in a postmortem.^[1]

+15.6%

Performance gain in ChatDev from targeted FC1 spec fixes

The model didn't change. Only the spec changed. This is the ceiling improvement available from FC1 prevention.^[1]

5.2–6.8

Soft violations per session caught by contracted vs. uncontracted agents

Agents with explicit behavioral contracts detect violations uncontracted baselines miss entirely.^[5]

20–40%

Pass@1 overestimate of true production reliability (ReliabilityBench)

Single-run staging tests don't catch FC3 failures. The gap is closed by production sampling with correctness verification.^[6]

FC1 Failures Are Committed at Spec Time

Five failure modes, all caused by missing design decisions made before execution begins. The fix belongs in the spec review, not the postmortem.

FC1's five failure modes — Disobey Task Specifications (FM-1.1), Disobey Role Specifications (FM-1.2), Step Repetition (FM-1.3), Conversation Loss (FM-1.4), and Agents Unaware of Termination Conditions (FM-1.5) — all share an origin: they're consistent with the specification the agent was given.^[1] The model isn't malfunctioning. The design is underspecified.

The test for FC1 is cheap: pull the system prompt and role definition next to the failed trace. If you can point to a line that's missing — a termination condition, a role constraint, a state check — that's FC1. If you find yourself writing "I assumed the model would know not to do that," you found the gap.

FM-1.5 (Termination Conditions) is the most consequential because it compounds. Without a verifiable terminal state, the agent continues until a hard limit fires — token budget, API timeout, or circuit breaker. "Return when the task is complete" is not a termination condition. "Return when the diff is clean, all tests pass, and the output matches the expected schema" is. Every agent needs a terminal condition it can evaluate against current state.

FM-1.3 (Step Repetition) shows up most in long-horizon tasks. AppWorld had the highest concentration in MAST-Data.^[1] The prevention is a state-tracking guard at step start: before the agent executes a step, it checks whether the target state already exists. Without this, loops are the default state when the task is partially complete.

The Agent Behavioral Contracts (ABC) framework formalizes what MAST implies here: contracted agents operating with explicit preconditions, invariants, and recovery rules detect 5.2–6.8 soft violations per session that uncontracted baselines miss entirely.^[5] The gap between an underspecified system prompt and a behavioral contract is exactly where FC1 failures live — and it's entirely preventable before deploy.

FC2 Failures Are Designed Into the Architecture, Then Discovered in Production

Six coordination modes that require structural architecture changes before deploy. Stronger system prompts don't create clarification gates. They create more articulate agents that still don't ask.

FC2's most common mode is Reasoning-Action Mismatch (FM-2.6, 13.98% of all MAST failures^[1]). The agent states an intention and executes something different. This isn't a prompt problem — it's a monitoring gap. You can't catch FM-2.6 without logging reasoning-action pairs at each step, and you can't build that logging after the agent ships without changing the execution flow. The logging decision belongs in the pre-production design review.

Failure to Ask for Clarification (FM-2.2, 11.65%^[1]) is structural, not behavioral. An agent that receives an ambiguous instruction will proceed on an assumption — that's what language models do. The prevention requires a designed clarification gate: a conditional in the execution flow that evaluates confidence before proceeding. Adding this after production is an architecture change. Doing it before deploy is a design decision.

The MAST paper's FC2 finding is direct: communication protocol improvements — cleaner message formats, structured handoffs, turn-taking confirmations — are insufficient for coordination failures.^[1] You can enforce message schemas across every agent boundary and still ship a system where no gate stops an agent from proceeding on a wrong assumption. The gate is an architectural decision, not a protocol tweak.

Task Derailment (FM-2.3, 7.15%^[1]) is the FC2 mode most likely to produce plausible-looking output. The agent is productive — completing steps, returning results, interacting with other agents — but on a drifted objective. Prevention requires shared task state visible to all agents: a single authoritative task definition that any agent can check against its instructions before continuing. This is a design artifact. If it doesn't exist at review time, it won't exist at runtime either.

A practical signal for FC2 readiness: when you review the architecture diagram, can you point to where the clarification gate sits in the execution flow? Can you show where reasoning-action pairs are logged? If both answers are "we'll add that later," the FC2 gate fails.

MAST Pre-Production Gate: Three Sequential Reviews Before Deploy

Each gate has different ownership and a different correction path. Gates are sequential: FC1 must pass before FC2, FC2 before FC3. Running FC3 eval coverage over a broken FC2 coordination design produces well-evaluated wrong outputs.

FC3 Failures Are Only Catchable with an Eval Layer You Built Before Shipping

Premature termination, incomplete verification, incorrect verification — all three produce clean infrastructure signals. You can't retrofit eval coverage after output quality has already degraded.

FC3's three modes — Premature Termination (FM-3.1, 7.82%), No or Incomplete Verification (FM-3.2, 6.82%), and Incorrect Verification (FM-3.3, 6.66%^[1]) — share one signature: the workflow completes, no alert fires, infrastructure is clean. The problem is in the output.

This makes FC3 the most expensive category to catch in production because it looks like no incident. Output quality degrades. Downstream systems process the output. The problem surfaces in a sprint retro or a customer complaint three weeks later.

The ReliabilityBench study found that pass@1 metrics overestimate production reliability by 20–40%.^[6] That gap is the FC3 failure surface — agents that pass staging tests and fail at correctness in production, because staging verified format and production revealed correctness. The root cause is the same across modes: verification checked the wrong thing, or ran only at the terminal output instead of at intermediate steps.

FM-3.3 (Incorrect Verification) is the trap in evaluation design: you built an eval, but it tests the wrong property. A code agent verified by a linter that checks syntax but not execution. A research agent verified by citation count rather than claim accuracy. An FC3 eval that checks format is FM-3.3 applied to your own quality process.

The design question for FC3 pre-production review: what does "correct output" mean for this specific agent, and can you evaluate that before shipping? An LLM-as-judge sampling 10–20% of production outputs provides FC3 coverage without reviewing everything — but only if the judge tests task-level correctness. Defining the correctness criterion is the work that must happen before the eval layer ships alongside the agent. Teams that defer it ship an eval layer that catches the wrong failures.^[8]

The 12-Question MAST Pre-Production Gate

Four questions per MAST category, each with a pass condition and a named owner. The gate is sequential: FC1 must pass before FC2 is reviewed, FC2 before FC3.

#	MAST Category	Pre-Production Question	Pass Condition	Review Owner
1	FC1 — Spec	Can you write the agent's terminal state as a verifiable assertion — not 'task complete' but a falsifiable condition?	Yes — documented and testable	Agent author
2	FC1 — Spec	Is each role boundary expressed as an explicit prohibition — what the agent is not authorized to do?	Yes — explicit 'not authorized' list per agent	Agent author
3	FC1 — Spec	Does a step-repetition guard exist — does the agent check whether the target state already holds before starting each step?	Yes — state check in the execution flow	Agent author
4	FC1 — Spec	For long-horizon tasks: is context window degradation treated as a design constraint, not an edge case?	Yes — windowing or summarization strategy documented in spec	Tech lead
5	FC2 — Coordination	Is there a clarification gate — a decision point that blocks execution when input confidence falls below a defined threshold?	Yes — threshold and escalation path in architecture design	Architect
6	FC2 — Coordination	Are reasoning-action pairs logged at each agent step — can you compare stated intent to actual tool call?	Yes — logging deployed before staging tests begin	Platform team
7	FC2 — Coordination	Is shared task state visible to all agents — no agent operates on a partial or stale view of the current task definition?	Yes — single authoritative state source in design spec	Architect
8	FC2 — Coordination	Does system-wide monitoring cover inter-agent handoffs, not only per-agent spans?	Yes — cross-agent trace coverage confirmed	Platform team
9	FC3 — Verification	Does verification run at intermediate steps, not only at the terminal pipeline output?	Yes — multi-level verification in design spec	QA/Eval lead
10	FC3 — Verification	Does the verifier check output correctness — not just format or schema compliance?	Yes — correctness criterion documented for this agent type	QA/Eval lead
11	FC3 — Verification	Is an eval layer deployed that samples production output before the agent ships?	Yes — LLM-as-judge or domain validator in place	Platform team
12	FC3 — Verification	Has the eval layer been validated against human reviewers for this agent's specific task type?	Yes — ≥85% agreement on held-out examples before relying on it as a gate	QA/Eval lead

Who Reviews What: Ownership by MAST Category

Prevention doesn't survive shared responsibility. Each category needs a named owner with a clear pass/fail call — not a process that everyone participates in and nobody owns.

FC1 review belongs to the agent author and tech lead. It's the cheapest review to run — pulling the spec and working through the four FC1 gate questions takes under an hour. The failure mode for FC1 ownership is diffuse accountability: everyone knows the spec exists, nobody is responsible for verifying it's complete. Make the FC1 review an explicit deployment ticket item with a named approver, not an implicit assumption about what passed staging.

FC2 review belongs to the architect who designed the coordination protocol. If no architect owns the coordination design, FC2 review ownership is undefined — which means FC2 failures are shipping by default. This is a structural assignment the engineering lead makes before the agent is built, not after it's built. FC2 review is not asking whether the agent "communicates correctly with other agents." It's asking: does a clarification gate exist in the design spec? Are reasoning-action logs part of the implementation plan? Is shared task state a named artifact? These are architecture decisions, reviewed against the design doc, before implementation is considered complete.

FC3 review belongs to the platform or eval team with LLM evaluation tooling. FC3 review is the hardest to complete before shipping because it requires the eval layer to already exist. Teams that don't have an eval layer can't do FC3 review — and that itself is a deployment blocker. If the eval layer doesn't exist when the agent is ready to ship, the sequencing problem is upstream of the agent: the agent is blocked on platform infrastructure, not development. That's the right resolution. Shipping without FC3 coverage is a named risk decision, not an unnoticed gap.

The most common failure of this ownership model is treating FC3 as optional for "low-risk" agents. The MAST paper doesn't distinguish risk tiers — all three verification modes occur across agent types regardless of task complexity.^[1] The team decides the blast radius, not the taxonomy.

Ship When Staging Passes

Agent passes staging test cases → approved for production
Coordination looked fine in isolated tests → ship it
Output schema validates → verification complete
Production failure → postmortem → 'the spec was ambiguous'
Fix applied → redeploy → discover the next structural gap

Ship When MAST Gate Passes

FC1 gate passes (terminal state, role constraints, step guard) → move to FC2 review
FC2 gate passes (clarification gate, reasoning logs, shared state) → move to FC3 review
FC3 gate passes (multi-level verification, eval deployed, judge validated) → approved for production
Production failure → MAST tag → structural category identified in 5 minutes, FC and FM code recorded
Postmortem closes with the gate question that should have caught the failure — loop closes

FC1 — Pre-Deploy

Spec completeness: terminal states, role prohibitions, step-repetition guards. Owner: agent author + tech lead. Under 1 hour to review.

FC2 — Architecture Review

Coordination design: clarification gates, reasoning logs, shared task state. Owner: architect. Requires design doc, not just code.

FC3 — Eval Layer

Verification coverage: multi-level checks, correctness criterion, production sampling. Owner: platform/eval team. Blocks deploy if eval layer is absent.

Sequential Order

FC1 must pass before FC2 review. FC2 before FC3. Running eval coverage over a broken spec evaluates the wrong system.

Does the MAST pre-production gate apply to single-agent systems, not just multi-agent?

FC1 applies fully — specification issues are not specific to multi-agent orchestration. FM-1.1 through FM-1.5 all occur in single-agent systems with task specs. FC2 applies partially: FM-2.6 (Reasoning-Action Mismatch) and FM-2.2 (Failure to Ask for Clarification) both translate directly to single-agent systems, even without an explicit coordination protocol. FM-2.6 shows up whenever an agent's stated reasoning diverges from its tool call, which is system-agnostic. FC3 applies fully. For single-agent deployments, run the full FC1 gate, the two most common FC2 questions (5 and 6 in the table), and the full FC3 gate.

What if our team doesn't have an eval layer yet — do we block the deploy?

Yes, by design. The FC3 gate fails when no eval layer exists — and that's the intended behavior. The pre-production gate surfaces an infrastructure gap as a deployment blocker before the agent ships, not after the first production failure. The decision to ship without FC3 coverage is now an explicit risk call logged in the deployment ticket, not an unnoticed gap. For agents with bounded blast radius and reversible outputs, some teams make that call deliberately. For agents touching financial data, customer-facing output, or downstream automation chains, skipping FC3 belongs in the risk register.

Which MAST gate questions can be automated in CI, and which require human review?

FC1 questions 1–4 (terminal state, role constraints, step guard, context strategy) can be partially automated: a spec review prompt running against the system prompt and role definitions can flag missing termination conditions and absent prohibition lists. FC3 questions 9–12 are directly CI-integrable: the agent runs against a test suite, the eval pipeline scores outputs, and CI fails on quality threshold breach. FC2 questions 5–8 require human architecture review — no automated tool can verify that a clarification gate exists in the design before seeing the agent fail to ask at runtime. Automate FC1 and FC3 checks; keep FC2 as a human architecture sign-off.

How does this pre-production gate close the loop with the MAST postmortem process?

The loop closes through MAST tagging in postmortems. When an agent failure is tagged with an FC category and FM code in the postmortem (e.g., FC1/FM-1.5 — Agents Unaware of Termination Conditions), that tag maps directly to the pre-production gate question that should have blocked the deploy: Question 1 in the FC1 gate. The postmortem answers 'what failed.' The pre-production gate answers 'which review was skipped or gave a false pass.' Thirty days of MAST-tagged postmortems with gate attribution tells you whether your recurring failures trace to FC1 spec reviews that don't cover termination conditions, FC2 architecture reviews that skip clarification gate design, or FC3 eval coverage that tests format instead of correctness. That's not a quality metric — it's a process audit.

The MAST paper showed that targeted FC1 spec fixes improved performance by +15.6%, then immediately noted that simple fixes are insufficient — more fundamental system design changes are required.^[1] That's not a pessimistic conclusion. It's a precise one: the failure modes are structural, the fixes are structural, and the right time to make them is before the system runs.

Thirty deployed agents without FC1 gates, FC2 coordination design, or FC3 eval coverage are thirty scheduled postmortems. The failures are already written in the specs. The production incidents just haven't run yet.

The three gates force three decisions that every agent deployment requires anyway — made before shipping, with named owners, against explicit pass criteria, instead of discovered under pressure after a failure with an ambiguous root cause and no clear fix layer.

Key terms in this piece

MAST pre-production agent failure preventionagent deployment gate engineering leadersFC1 spec review before agent shipsmulti-agent failure prevention checklistLLM agent behavioral spec reviewagent failure modes pre-deploy auditMAST categories prevention playbook

Sources

MAST Category

Pre-Production Question

Pass Condition

Review Owner

FC1 — Spec

Can you write the agent's terminal state as a verifiable assertion — not 'task complete' but a falsifiable condition?

Yes — documented and testable

Agent author

FC1 — Spec

Is each role boundary expressed as an explicit prohibition — what the agent is not authorized to do?

Yes — explicit 'not authorized' list per agent

Agent author

FC1 — Spec

Does a step-repetition guard exist — does the agent check whether the target state already holds before starting each step?

Yes — state check in the execution flow

Agent author

FC1 — Spec

For long-horizon tasks: is context window degradation treated as a design constraint, not an edge case?

Yes — windowing or summarization strategy documented in spec

Tech lead

FC2 — Coordination

Is there a clarification gate — a decision point that blocks execution when input confidence falls below a defined threshold?

Yes — threshold and escalation path in architecture design

Architect

FC2 — Coordination

Are reasoning-action pairs logged at each agent step — can you compare stated intent to actual tool call?

Yes — logging deployed before staging tests begin

Platform team

FC2 — Coordination

Is shared task state visible to all agents — no agent operates on a partial or stale view of the current task definition?

Yes — single authoritative state source in design spec

Architect

FC2 — Coordination

Does system-wide monitoring cover inter-agent handoffs, not only per-agent spans?

Yes — cross-agent trace coverage confirmed

Platform team

FC3 — Verification

Does verification run at intermediate steps, not only at the terminal pipeline output?

Yes — multi-level verification in design spec

QA/Eval lead

FC3 — Verification

Does the verifier check output correctness — not just format or schema compliance?

Yes — correctness criterion documented for this agent type

QA/Eval lead

FC3 — Verification

Is an eval layer deployed that samples production output before the agent ships?

Yes — LLM-as-judge or domain validator in place

Platform team

FC3 — Verification

Has the eval layer been validated against human reviewers for this agent's specific task type?

Yes — ≥85% agreement on held-out examples before relying on it as a gate

QA/Eval lead

The Spec Is Where Production Failures Start

What This Covers

FC1 Failures Are Committed at Spec Time

FC2 Failures Are Designed Into the Architecture, Then Discovered in Production

FC3 Failures Are Only Catchable with an Eval Layer You Built Before Shipping

The 12-Question MAST Pre-Production Gate

Who Reviews What: Ownership by MAST Category

Related

Model Selection Isn't a Configuration Choice. It's Architecture.

MAST Agent Failure Triage: 14 Failure Modes, 3 Root Causes, 1 Question Each

Agentic System Failure Modes: 7 Trace Signatures On-Call Teams Miss

The Spec Is Where Production Failures Start

What This Covers

FC1 Failures Are Committed at Spec Time

FC2 Failures Are Designed Into the Architecture, Then Discovered in Production

FC3 Failures Are Only Catchable with an Eval Layer You Built Before Shipping

The 12-Question MAST Pre-Production Gate

Who Reviews What: Ownership by MAST Category

Related

Model Selection Isn't a Configuration Choice. It's Architecture.

MAST Agent Failure Triage: 14 Failure Modes, 3 Root Causes, 1 Question Each

Agentic System Failure Modes: 7 Trace Signatures On-Call Teams Miss