Q4 2025. One engineering team's biggest invoice line was $47,000 — billed across four weeks against two agents stuck in a coordination loop for eleven days. No exception. No alert. The system did exactly what it was built to do: each agent returned valid JSON to the other while the meter ran.[1]
The part that never makes it into the postmortem: those agents were promoted to multi-agent nine days after the single-agent prototype was declared "working." No baseline success rate. No tested manual override. No drift detection on the roadmap. They graduated because the single agent felt ready and multi-agent felt like the obvious next move.
The single-agent first framework is the structural answer to that decision: three gates — Observability, Override Readiness, Behavioral Consistency — each with measurable pass criteria, each blocking promotion until evidence clears it. The gates are not a checklist. They are confidence signals. Each one closes a specific failure class that becomes dramatically harder to recover from once coordination logic is in the path.
This is not an argument against multi-agent systems. It is an argument that the right to build one has to be earned.
What this covers
- ✓
Why reliability math makes multi-agent systems riskier than they look on paper
- ✓
The MAST failure taxonomy: 14 failure modes, 3 root categories, all architectural
- ✓
New research: single agents match or beat multi-agent under equal compute budgets
- ✓
Gate 1 — Observability: specific thresholds and a runnable pass/fail check
- ✓
Gate 2 — Override Readiness: circuit breaker pattern with working Python implementation
- ✓
Gate 3 — Behavioral Consistency: agent drift measurement and the 30-day stability window
- ✓
Decision matrix: when multi-agent genuinely earns its complexity
- ✓
Monday-morning rules-list for teams currently in the promotion decision
Pilot failure has climbed from 60% in 2024 to 78% in Q1 2026 — while model capability improved. The driver is orchestration complexity, not model quality.[6]
MTTR moves from 18 minutes to 67 minutes when teams promote single-agent to multi-agent. The failure surface is now distributed across coordination logic, not localized to one component.[5]
Standard probability: three agents at 97% behind an orchestrator at 99% lands at ~89% end-to-end. Each agent you add multiplies the failure surface.[4]
Reliability Compounds. So Does the Bill.
The math is multiplicative. The failure topology is distributed. Both fight the same direction.
Reliability across multiple agents multiplies. It does not add. Orchestrator at 99%, three subagents at 97%, end-to-end lands at roughly 99% × 97% × 97% × 97% ≈ 89%.[4] Four nines on the slide deck, eleven percentage points of failure surface in production. Every additional agent adds another factor that has to land green for a request to succeed.
The topology compounds the math. Single-agent failures are local — one component, one trace, one fix. Multi-agent failures travel: a handoff carries a wrong assumption forward, a subagent reasons over stale context from three steps back, an orchestrator misreads a subagent's output and routes the session into the wrong branch. One team reported MTTR moving from 18 minutes to 67 minutes after promoting to a three-agent customer service pipeline.[5] The system was not worse. The investigation surface was bigger.
Cost moves the same way. A four-agent research pipeline that runs cleanly at $3.50 per session in development can spike to $40+ per failed session in production once retries, context re-injection, and cascading delegation start to compound. One modeled scenario — 50% failure rate against 1,000 daily sessions — produces $10,950 of daily spend against a $3,500 baseline.[8] Real numbers depend on your failure rate, agent count, retry config. The direction does not.
None of this is an argument against multi-agent. It is an argument that the decision to add orchestration deserves the rigor you would apply to any architectural change that multiplies blast radius.
Promote to multi-agent when the single agent 'feels ready'
Discover failure modes after they hit production at volume
MTTR averages 67 minutes — tracing across coordination logic
No behavioral baseline. Drift between deployments is invisible
Override is an ad-hoc kill command with no clean-state guarantee
$47K loop runs eleven days before anyone checks the bill
Promote only when three gates pass with documented evidence
Failure modes flushed out at single-agent scale before scale exists
Any production failure traces to a specific step in under five minutes
30-day behavioral baseline catches drift before it reaches the orchestrator
Override path drilled monthly with verified clean-state recovery
Session budget cap is a hard stop. Cost runaway dies at the session boundary
Failures Are Architectural. Prompt Tuning Won't Save You.
The systematic evidence on where multi-agent systems break — and what that implies for the promotion decision.
The most useful research on multi-agent failure is not a vendor blog. It is MAST — Multi-Agent System Failure Taxonomy — a UC Berkeley paper from March 2025 that analyzed 1,600+ execution traces across seven open-source frameworks: MetaGPT, ChatDev, HyperAgent, OpenManus, AppWorld, Magentic, AG2.[2][3]
MAST identified 14 distinct failure modes, organized into three categories: system design issues, inter-agent misalignment, task verification failures. The most common single mode was step repetition — the agent loops previously executed steps without progress — at 15.7% of all annotated failures. Disobeying task specification came in at 11.8%. Unrecognized termination conditions at 12.4%. Action-reasoning mismatch — chain-of-thought concludes one thing, the action taken contradicts it — is present in a staggering 92% of Kimi-K2's failures when tested against the IT-Bench SRE benchmark.[11]
The headline: MAS failures are architectural, not capability gaps. ChatDev — a widely-used multi-agent coding framework running on state-of-the-art models — landed at 33.33% correctness on the ProgramDev benchmark. Targeted interventions like prompt engineering for memory-related failures yield only up to around 15.6% performance improvement. Introducing new structural mechanisms — a Summarizer Agent or explicit context management — can achieve up to 53% improvement. The paper's conclusion: mitigating these failures requires changes in system design, not surface prompt tuning.[2][11]
The implication for promotion is direct. If your single agent has undocumented failure modes — patterns you have not seen because your observability is thin — you carry those patterns forward into a multi-agent system where they interact, compound, and become much harder to isolate. Gate 1 exists precisely to flush those patterns out at single-agent scale, while debugging is still tractable.
The Compute Parity Problem: Single Agents Already Win at Equal Budget
New research flips the standard multi-agent justification — and sharpens what 'genuinely needs orchestration' actually means.
Multi-agent systems report strong benchmark numbers. Most of those benchmarks give multi-agent pipelines proportionally more compute than the single-agent baseline — more tokens, more inference calls, more steps. That is not a comparison. It is an uncontrolled experiment.
Tran and Kiela (April 2026) ran the controlled version.[9] Their paper — Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets — tested Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5 across multi-hop reasoning benchmarks, holding total thinking-token budget constant across conditions. Single agents matched or exceeded multi-agent performance across all three model families.
The theoretical backbone is the Data Processing Inequality: every inter-agent handoff can only lose information, never create it. Each time an orchestrator summarizes a subagent's output, some signal is lost. Each time a subagent reasons from a truncated version of the original context, it starts from an information deficit. A single agent with access to the full context is — under equal compute — more information-efficient than a pipeline that re-summarizes across handoffs.
Multi-agent systems become competitive when a single agent's effective context utilization genuinely degrades — long document analysis where context windows hit physical limits, or tasks where parallel execution compresses wall-clock time in ways that matter. That is the narrow band where orchestration earns its complexity. Outside it, the architectural overhead costs you and pays you nothing.
Three Gates. Each Closes a Failure Class.
Gates with measurable pass criteria. Each one gates a phase. Each one blocks a class of incident orchestration would amplify.
Treat agent maturity the way you treat a deployment pipeline: nothing moves until the previous phase passes, and passage requires evidence, not intent.
Each gate targets one failure class. Gate 1 (Observability) prevents undebuggable failures — incidents where you have no trace of what the agent was doing or why. Gate 2 (Override Readiness) prevents runaway cost and irreversible external actions taken by an agent that has gone sideways. Gate 3 (Behavioral Consistency) prevents drift — the slow degradation that stays invisible until it surfaces as a coordinated incident at multi-agent scale.
The framework does not block multi-agent. It demands you have answered three questions with artifacts: can you debug failures? Can you stop and recover from failures? Can you detect when behavior is drifting before it costs you? Three yeses, with evidence — that is the entry ticket.
- [01]
Gate 1: Observability — Can Any Failure Be Debugged?
The Gate 1 test is not 'do you have logging.' It is 'can you reconstruct the root cause of any session failure from your traces in under five minutes?' If the answer is no for any failure class, the gate fails. The constraint exists because multi-agent failures are 3.7× harder to debug, and you cannot back-fill instrumentation after orchestration is in the path. OpenTelemetry's GenAI semantic conventions are the emerging standard for vendor-neutral agent tracing — structured spans for every model invocation, tool call, memory read, and decision branch with causal links between steps.
- [02]
Gate 2: Override Readiness — Can You Stop and Recover?
An eleven-day cost runaway is what happens when no override path was ever drilled. Gate 2 demands you can halt a misbehaving session in 30 seconds, recover to clean state without data corruption, and that a hard budget cap enforces itself without a human in the loop. 'Drilled' means evidence from the last monthly run — not a runbook entry that says you could. Multi-layered cost controls are now standard: per-session caps, per-agent-per-hour caps, and a loop detector that fires when a session exceeds a step threshold without forward progress.
- [03]
Gate 3: Behavioral Consistency — Can You See Drift Before It Hurts?
Multi-agent topologies amplify drift. Research on agent behavioral degradation (Agent Drift, arXiv 2601.04170) finds that mitigation strategies reduce error rates by 67–81% — but only when drift is measured and detected early.[10] A prompt change that nudges single-agent success by 4% will move a three-agent pipeline by more, because each agent's drift compounds along the chain. Gate 3 demands 30 consecutive days of stable behavior at single-agent scale and an eval suite that covers the failure surface — before drift becomes a multi-agent problem to manage.
agent_gate_check.py# Gate 1 pass/fail check. Run before flipping Observability to PASSED.
# Every threshold has to clear. One failure blocks promotion.
from dataclasses import dataclass
from typing import Optional
@dataclass
class Gate1Check:
"""Observability gate: every failure has to be debuggable."""
# Coverage requirements — must be 1.0 (100%)
session_correlation_coverage: float # fraction of sessions with a session_id
tool_call_logged_rate: float # fraction of tool calls with input+output logged
outcome_labeling_coverage: float # fraction of sessions with an outcome label
# Operational visibility — all must be True
cost_per_success_tracked: bool # cost per completed task, not token totals
step_count_per_session_tracked: bool
p95_step_latency_tracked: bool
# Debuggability requirement
max_reconstruction_time_minutes: float # worst-case trace reconstruction time from logs
undebugable_failures_last_30d: int # failures with no traceable root step
def passed(self) -> tuple[bool, Optional[str]]:
if self.session_correlation_coverage < 1.0:
return False, f"Session correlation coverage {self.session_correlation_coverage:.1%} — must be 100%"
if self.tool_call_logged_rate < 1.0:
return False, f"Tool call log coverage {self.tool_call_logged_rate:.1%} — must be 100%"
if self.outcome_labeling_coverage < 0.99:
return False, f"Outcome labeling coverage {self.outcome_labeling_coverage:.1%} — must be ≥99%"
if not self.cost_per_success_tracked:
return False, "cost_per_success not tracked — token totals are an alibi, not a metric"
if self.max_reconstruction_time_minutes > 5:
return False, f"Worst-case trace reconstruction {self.max_reconstruction_time_minutes}min — must be ≤5min"
if self.undebugable_failures_last_30d > 0:
return False, f"{self.undebugable_failures_last_30d} failures with no root step — every failure must be traceable"
return True, None
# Run after the 30-day production window
check = Gate1Check(
session_correlation_coverage=1.0,
tool_call_logged_rate=0.97, # FAIL: 3% of tool calls missing
outcome_labeling_coverage=0.999,
cost_per_success_tracked=True,
step_count_per_session_tracked=True,
p95_step_latency_tracked=True,
max_reconstruction_time_minutes=4.2,
undebugable_failures_last_30d=0,
)
passed, reason = check.passed()
print(f"Gate 1: {'PASS' if passed else 'FAIL'}")
if reason:
print(f"Blocking issue: {reason}")
# Gate 1: FAIL
# Blocking issue: Tool call log coverage 97.0% — must be 100%agent_budget_circuit_breaker.py# Gate 2: Session-level budget circuit breaker.
# Hard-stops any session that exceeds cost OR step thresholds.
# Wire this into your agent loop — not as a soft alert, as a raised exception.
from dataclasses import dataclass, field
from typing import Optional
import time
class BudgetExceeded(Exception):
"""Raised when a session crosses a hard cost or step limit."""
pass
class LoopDetected(Exception):
"""Raised when a session makes no forward progress past the step ceiling."""
pass
@dataclass
class SessionBudgetGuard:
"""
Hard limits for a single agent session.
Raise on violation — never just log and continue.
"""
max_cost_usd: float = 5.0 # hard ceiling per session
max_steps: int = 20 # step ceiling; beyond this without success = loop
max_wall_seconds: int = 300 # 5-minute wall-clock limit
_cost_usd: float = field(default=0.0, init=False)
_steps: int = field(default=0, init=False)
_started_at: float = field(default_factory=time.monotonic, init=False)
_last_success_step: Optional[int] = field(default=None, init=False)
def record_step(self, cost_usd: float, produced_output: bool) -> None:
"""Call after every agent step. Raises on any limit breach."""
self._steps += 1
self._cost_usd += cost_usd
if produced_output:
self._last_success_step = self._steps
elapsed = time.monotonic() - self._started_at
if self._cost_usd > self.max_cost_usd:
raise BudgetExceeded(
f"Session cost ${self._cost_usd:.2f} exceeded hard cap ${self.max_cost_usd:.2f}"
)
if elapsed > self.max_wall_seconds:
raise BudgetExceeded(
f"Session wall time {elapsed:.0f}s exceeded limit {self.max_wall_seconds}s"
)
if self._steps >= self.max_steps:
# If we haven't had a successful output in the last half of our steps, it's a loop
steps_since_success = (
self._steps - self._last_success_step
if self._last_success_step
else self._steps
)
if steps_since_success > self.max_steps // 2:
raise LoopDetected(
f"No forward progress in {steps_since_success} steps — loop detected"
)
# Usage in an agent loop:
guard = SessionBudgetGuard(max_cost_usd=5.0, max_steps=20)
try:
for step in agent.run():
guard.record_step(
cost_usd=step.token_cost,
produced_output=step.has_output,
)
except BudgetExceeded as e:
agent.terminate(reason=str(e))
metrics.increment("session.budget_exceeded")
except LoopDetected as e:
agent.terminate(reason=str(e))
metrics.increment("session.loop_detected")Agent Drift: The Failure Mode That Looks Like Stability
Behavioral degradation in production agents is gradual, measurable, and already well-documented. The gate exists because drift you can see at single-agent scale becomes drift you can't isolate at multi-agent scale.
Agent drift is not a configuration problem. It is a property of the system over time. Rath et al. (January 2026) define it as the progressive degradation of agent behavior, decision quality, and inter-agent coherence across extended interaction sequences.[10] Their measurement framework — the Agent Stability Index, a composite across twelve dimensions including response consistency, tool usage patterns, reasoning pathway stability, and inter-agent agreement rates — found that behavioral degradation could affect nearly half of long-running agents. The projected task success rate reduction: 42%. The human intervention rate increase: 3.2×.
Three mitigation strategies reduced error rates by 67–81% — episodic memory consolidation, drift-aware routing, and adaptive behavioral anchoring. All three require knowing your baseline. You cannot detect drift without a prior measurement to drift from.[10]
This is why Gate 3's 30-day window is not bureaucratic caution. It is the minimum observation period to establish a behavioral baseline worth anchoring to. A one-week snapshot catches acute regressions. It misses the slow drift that looks like normal variance week over week but compounds over a month. And in a three-agent pipeline, that monthly drift rate gets multiplied by three before it surfaces as an incident.
The practical implementation: run your eval suite on a daily production sample. Track the score distribution, not just the mean. A mean that holds while the tail gets heavier is still drift — you have just not waited for the tail to bite yet.
Gate Thresholds at a Glance
Every row in a gate has to pass. One missing row blocks promotion.
| Gate | What It Measures | Threshold | Why It Holds |
|---|---|---|---|
| Gate 1: Observability | Session trace completeness, tool call logging, outcome labeling, debuggability | 100% session correlation, cost-per-success tracked, every failure root-traced under 5 min, zero 'unknown cause' closures in 30 days | Multi-agent failures are 3.7× harder to debug.[5] Missing instrumentation now becomes permanently invisible after orchestration ships |
| Gate 2: Override Readiness | Manual kill path, session budget cap, loop detection, rollback quality, human review queue | Override confirmed under 30 sec, hard budget cap enforced, loop detector live at ≥20 steps, rollback produces clean state, monthly drill on the books | The $47K loop ran eleven days because no override path had been drilled.[1] Override is verified or it is fiction |
| Gate 3: Behavioral Consistency | Success rate stability, eval coverage, drift detection, first-seen failure rate | ±3% week-over-week variance for 30 consecutive days, evals cover top 20 failure patterns, no new failure modes in 14 days | MAST: failures are architectural, not prompt problems.[2] Agent drift research shows 42% task success reduction in drifting agents.[10] A 30-day window surfaces both |
When Single-Agent Genuinely Runs Out of Room
The narrow set of conditions where multi-agent earns its complexity — and what the promotion decision should actually look like.
One team ran a multi-agent customer service pipeline for three months before benchmarking against the single-agent alternative. Accuracy delta: 2.1 percentage points (94.3% vs 92.2%). Monthly cost delta: $24,700 in orchestration overhead.[5] The right call depended on whether 2.1% was worth the complexity. They never asked the question before shipping, which means they never calculated the breakeven.
The cases where multi-agent earns its complexity are real. They are also narrower than the discourse suggests — and narrower still after Tran and Kiela's compute-controlled benchmarks.[9]
Genuinely parallel subtasks. Three independent tasks running sequentially can be parallelized via specialized agents to compress wall-clock latency. The operative word is independent — one agent's output does not feed another's input. The moment that dependency exists, sequential single-agent chaining is usually cleaner.
Context window exhaustion. A single agent session that legitimately needs more context than fits — full document analysis, large codebase traversal — justifies decomposition into specialized subagents with scoped context. Note: this is a physical constraint, not a preference.
Verification and self-critique. A second agent playing adversarial reviewer against the first agent's output is a real leverage point for high-stakes decisions. It works as a deliberate two-step pattern. Not as a sprawling five-agent pipeline.
The promotion decision itself deserves the same rigor as a production deployment: a specific proposal, a named owner, documented justification for why the three-gate-proven single agent is no longer enough, a rollback plan if multi-agent underperforms. Teams that treat promotion as natural evolution rather than an architectural decision are the ones who find $47,000 of API charges four weeks later.[1]
| Scenario | Single-Agent? | Multi-Agent? | Key Signal |
|---|---|---|---|
| Linear task, single context window fits | Yes — default choice | No — overhead without benefit | No parallelism available; full context fits in one session |
| Context window exhaustion — document too large | No — physical limit | Yes — decompose with scoped subagents | Context length hits model ceiling, not a preference |
| Genuinely parallel subtasks with no inter-dependency | No — sequential is slow | Yes — fan-out to specialized agents | Subtask A does not read subtask B's output |
| High-stakes output needing adversarial review | Marginal — single pass | Yes — two-agent: generator + critic | Error cost justifies the verification overhead |
| Task where single agent hits 92% accuracy | Yes — if accuracy target is ≤92% | Only if 2% gap is worth $24K/month overhead | Run the breakeven calculation before promoting |
| Multi-agent because 'it seems more capable' | Yes — default choice | No — Tran-Kiela: single agent matches under equal compute | There is no named bottleneck — the justification is vibes |
What to Do on Monday
Concrete actions for teams currently facing a promotion decision or mid-way through one.
Single-Agent First: Operational Rules
Name the specific bottleneck before opening the multi-agent ticket
Context window exhaustion, verified accuracy gap, genuine parallelism need — one of these, with a measurement. 'Multi-agent feels more powerful' is not a bottleneck.
If you can't reconstruct last week's failure in under 5 minutes, Gate 1 is not passed
Pull up your traces and time yourself. If you reach for Slack or ask a teammate before reaching the root step, your instrumentation is thin. Fix it before building the orchestrator.
The budget cap must raise an exception, not send an alert
Alerts depend on humans. Exceptions stop the session. The $47K incident had no exception — it had no cap at all. Wire the circuit breaker into the agent loop itself.
Drill the override path before the next sprint ends, not before the incident
Monthly evidence means a calendar event and a logged outcome, not a runbook that says you could do it. Run the drill. Log the execution time.
Track success rate as a distribution, not a mean
A stable mean with a widening tail is drift you haven't caught yet. Plot the weekly distribution. Alert when the P10 drops, not just when the mean does.
Each new capability added to a single agent resets the Gate 3 observation clock for that capability
You can't carry behavioral stability credit from the old agent to the expanded agent. New capability surface means new observation period before that surface goes into the orchestrated system.
Multi-Agent Promotion Readiness
Gate 1 PASSED: any session failure reconstructable from traces in under 5 minutes
Gate 1 PASSED: cost per successful task tracked as a first-class metric
Gate 1 PASSED: zero 'unknown cause' failure closures in the last 30 days
Gate 2 PASSED: override path drilled in the last 30 days, under 30-second termination confirmed
Gate 2 PASSED: session budget cap enforces a hard stop, not a soft alert
Gate 2 PASSED: loop detector live — fires when session exceeds 20 steps without forward progress
Gate 2 PASSED: rollback to prior version drilled with clean state confirmed
Gate 3 PASSED: success rate within ±3% week-over-week for 30 consecutive days
Gate 3 PASSED: eval suite covers top 20 failure patterns from production traces
Gate 3 PASSED: tool schema drift detection live with under 15-minute alert latency
Gate 3 PASSED: zero first-seen failure modes in the last 14 days
Multi-agent justification documented: a specific bottleneck single-agent cannot close
Compute parity check done: single agent tested under equal token budget before committing to multi-agent
Rollback plan for the multi-agent version named before first deployment
Operating Questions
Practical questions from platform engineers and engineering leads working through the gates
What counts as a 'passing' baseline success rate before Gate 3?
Stability, not absolute level. A 72% success rate that holds at 72% ± 3% for 30 consecutive days passes Gate 3. A 95% success rate that swings between 88% and 97% over the same window does not. Instability at single-agent scale becomes amplified instability at multi-agent scale, and the 30-day window exists to surface it before you are tracing the wobble across coordination layers. If your absolute success rate is unacceptably low, fix it. That is a different problem from the gate.
We're under deadline pressure to ship multi-agent faster. Can we run the gates in parallel?
Gates 2 and 3 have temporal components — a drilled override needs monthly evidence; Gate 3 needs 30 consecutive days — so parallelizing them with Gate 1 buys nothing. You can build override infrastructure during the Gate 1 observability window. You can build evals during the observation period. What you cannot compress is the observation window itself. Slow-moving drift will not show up on a one-time test. If the deadline is real, the honest move is scope reduction: the smallest single-agent workflow that clears all three gates, not a large multi-agent system that clears none.
Our agent talks to external systems we don't fully control. How does the override gate handle that?
Gate 2 demands human review queues for irreversible action classes. It does not demand control over the external system. If your agent can trigger a payment, send an email, or modify an external record, those action types need a review queue before the action fires — not a kill switch after. The gate is asking: can you stop an irreversible action by a misbehaving agent before it executes? If the answer is no for any action class, that is a blocker. The fix is usually a draft-mode pattern: the agent stages the action and queues it for human approval instead of firing directly.
What if my single agent's failure modes keep changing as I add new capabilities?
That is the gate working. Gate 3 demands no first-seen failure modes in the last 14 days before promotion. Shipping new capabilities resets the observation clock on those capability areas. The implication: scope the initial promotion against a stable, well-defined capability set, not against a rapidly evolving agent. New capabilities can be added later, but each significant expansion of the single agent's capability surface should trigger Gate 1 re-evaluation for that capability — instrumented, observed, stability confirmed before that capability is promoted into the multi-agent context.
Is there a case where you skip the framework and ship multi-agent first?
One narrow case. A proof-of-concept where the explicit goal is to learn multi-agent failure modes at small scale, with no production traffic, no real user data, and a hard budget cap enforced before the first run. Exploration is legitimate. The framework governs production deployments, not experiments. The mistake is treating a successful PoC as evidence of production readiness. MAST showed even popular, well-maintained multi-agent frameworks had failure rates that made them unsuitable for production without architectural revision.[2] A clean demo means multi-agent is worth investigating. It does not mean it is ready to serve users.
The Tran-Kiela paper says single agents match multi-agent at equal compute. Does that mean multi-agent is never worth it?
No — it means multi-agent justification now requires a higher bar than 'the benchmark numbers looked better.' The paper controls for compute, which most benchmark comparisons don't. Under that control, single agents hold. Multi-agent wins when there is a genuine physical constraint — context window exhaustion, true task parallelism — or when the coordination overhead is demonstrably paid back by an accuracy or latency gain you have measured, not assumed. The decision matrix in this article covers the cases. If your scenario isn't in the 'yes' column, you're adding complexity for noise.
How do I set the session step limit for Gate 2's loop detector?
Start with 20 steps as the ceiling and monitor the P95 step count of successful sessions over your first 30-day window. If your legitimate successful sessions peak at 12 steps, tighten to 18. If complex tasks routinely need 30, adjust up — but document the reasoning and add a secondary check: sessions beyond 30 steps that haven't produced a partial output in the last 10 steps should fire a loop alert regardless. The budget code example in this article shows both checks: the hard step ceiling and the 'no progress in N steps' secondary trigger.
The teams who build reliable multi-agent systems are the ones who treated the single-agent phase as the real engineering work — not the prototype. Three gates, evidence-based, in order. The orchestrator gets built on a foundation that has already been stress-tested, debugged, and proven stable. That is not caution. It is how you avoid finding $47,000 on your next invoice.
On the statistics cited in this article
The $47K incident[1] is documented across multiple independent sources reporting the same event. The 78% pilot failure rate[6] is industry reporting from early 2026, not primary survey data — directional, not precise. The MTTR figures (18 min vs 67 min)[5] come from one team's reported experience at Iterathon, not a broad study. The compound reliability math is standard probability applied to the scenario in [4], not measured production data. The MAST findings[2][3] come from controlled benchmark evaluation of open-source frameworks, which may not reflect production deployments with stronger engineering practices. The IT-Bench/MAST enterprise findings[11] are from 310 annotated traces across three model classes — a meaningful sample but not representative of all enterprise deployments. The Tran-Kiela results[9] apply to multi-hop reasoning benchmarks under controlled compute — results on other task types may differ. The agent drift projections[10] are research estimates with projected effectiveness ranges, not measured production outcomes. Use these numbers for directional intuition, not SLO targets.
- [1]We Spent $47K on AI Agents in Production. Here's What Nobody Tells You — Towards AI (Oct 2025)(pub.towardsai.net)↩
- [2]Why Do Multi-Agent LLM Systems Fail? (MAST) — arXiv / UC Berkeley (Mar 2025)(arxiv.org)↩
- [3]MAST: Multi-Agent System Failure Taxonomy — UC Berkeley Sky Computing Lab (2025)(sky.cs.berkeley.edu)↩
- [4]Multi-Agent vs Single-Agent Architecture: A Production Decision Framework — Towards AI (Mar 2026)(pub.towardsai.net)↩
- [5]Multi-Agent Orchestration Economics: When Single Agents Win 2026 — Iterathon (Jan 2026)(iterathon.tech)↩
- [6]The Orchestration Illusion: Why Multi-AI Fails — iEnable (Mar 2026)(ienable.ai)↩
- [7]Multi-Agent Orchestration: The Handoff Problem That Quietly Destroys Production Systems — Ravoid (Apr 2026)(ravoid.com)↩
- [8]Multi-Agent Systems Fail Up to 87% of the Time — Runcycles (Mar 2026)(runcycles.io)↩
- [9]Single-Agent LLMs Outperform Multi-Agent Systems on Multi-Hop Reasoning Under Equal Thinking Token Budgets — Tran & Kiela, arXiv (Apr 2026)(arxiv.org)↩
- [10]Agent Drift: Quantifying Behavioral Degradation in Multi-Agent LLM Systems Over Extended Interactions — Rath et al., arXiv (Jan 2026)(arxiv.org)↩
- [11]IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST — IBM Research / HuggingFace (NeurIPS 2025)(huggingface.co)↩