One orchestrator decomposes the question. Subagents work the threads in isolation. Synthesis weighs the evidence. The brief lands in twenty minutes — not because the model is faster, but because the topology stopped wasting wall-clock on serial wait.
The three-phase model: decomposition, parallel execution, synthesis — and how each fails differently
Exact decomposition rules: the independence test, scope boundaries, output format enforcement
Production failure modes: context overflow, silent failures, synthesis averaging, slow-thread drag
A complete worked example with confidence scores, partial failures, and the resulting brief
When NOT to use this pattern — the question type that breaks it
Implementation checklist and decision table for getting it running Monday
Slack pings at 9:14 AM: Conversion in Germany dropped 8% last week. What happened?
You know the topology. Open the experiment tracker. Cross-reference the incident log. Scan three weeks of release notes. Check what a competitor shipped. Each source lives behind a different surface, demands its own context, and burns 30 to 90 minutes of focused attention. By the time a working hypothesis exists, the day is half spent.
Now change the topology. One question hits a system that decomposes it into four independent threads, dispatches a specialized subagent to each, waits for all four to report, and synthesizes a weighted hypothesis brief with confidence levels. Wall-clock cost: about twenty minutes when the decomposition matches the problem.
This is the orchestrator-subagent pattern for knowledge work. Anthropic's multi-agent research system — which ships Claude Opus 4 as orchestrator with Claude Sonnet 4 subagents — reports 90.2% better performance over single-agent Claude Opus 4 on complex research tasks.[1] That number comes with a precise caveat Anthropic documents explicitly: token volume alone accounts for 80% of success variance on the BrowseComp evaluation.[1] The wall-clock advantage is real. What drives it is compute surface area, not magic decomposition.
Decomposition, parallel execution, synthesis. Each fails differently. Each failure poisons the whole pipeline.
The pattern has three phases. Most teams that fail at this collapse two of them into one and never figure out why their briefs come out incoherent.
Phase 1: Decomposition. The orchestrator takes a question and splits it into independent threads. Independent is the load-bearing word. If thread B needs the output of thread A, they cannot run in parallel. Good decomposition produces threads that execute simultaneously without coordination.
Phase 2: Parallel execution. Each subagent runs its assigned thread inside its own context window with its own tools. Subagents don't communicate during execution. That isolation cuts the coordination tax to zero and keeps every context window clean.[2] Attention is quadratic in sequence length — a single 20K-token pass costs 25x more than a 4K pass. Keeping each subagent's context narrow isn't architectural tidiness, it's an economic necessity.
Phase 3: Synthesis. The orchestrator collects every report and produces the unified brief. Confidence levels, weighted hypotheses, cross-thread correlations all emerge here. Synthesis is the phase that turns raw findings into a decision — and the phase most often written as "summarize these reports," which produces nothing actionable.
Microsoft's multi-agent orchestration documentation and Anthropic's engineering post both describe the same topology from different implementation angles.[1][2] The architectural consensus as of 2026 has converged on one pattern: orchestrator plus isolated subagents that return summaries, not full context.
Bad splits create duplicate effort or blind spots. Good splits produce a brief no single analyst could assemble in the same window.
Take the live example: Why did conversion drop 8% in Germany? A capable orchestrator splits this into four independent threads, each owning a different causal category. The table below maps them — notice that each thread specifies not just what to investigate, but explicitly what falls outside its scope.
| Subagent | Thread Focus | Data Sources | Explicit Out-of-Scope |
|---|---|---|---|
| Experiment Tracker | Active and recently concluded A/B tests touching the DE funnel | Experiment platform API, feature flag system | Global tests with <1% DE traffic allocation |
| Incident Log Analyst | Production incidents, latency spikes, payment errors in EU-west | PagerDuty, Datadog, payment gateway logs | Incidents resolved in under 2 minutes; non-EU regions |
| Release Notes Parser | Deploys touching checkout, pricing, or localization | GitHub releases, deploy logs, changelog | Infrastructure changes with no user-facing impact |
| Competitive Monitor | Competitor launches, pricing moves, campaigns in DACH market | Web search, press releases, app store updates | Markets outside DACH; organic traffic shifts |
Every subagent gets four things: an objective, an output format, tool guidance, and a scope boundary. The boundary is the part teams skip. Without it, subagents drift outside their lane and burn tokens exploring territory another thread already owns. Vague task descriptions produce vague results. Vague results poison synthesis.
The decomposition prompt follows a predictable shape. Here's what the orchestrator generates internally for each thread.
Three constraints on parallel execution. Drop any one and the system either crashes on the first API error or silently produces a worse brief.
Once the task list exists, execution is mechanically simple but operationally strict. Three concerns dominate.
Isolation. Each subagent runs in its own context window with its own tool sessions. No shared state during execution. This isn't just architectural cleanliness — concatenating five agents into one prompt costs 5x more in attention compute and exposes answers to position bias, where agents in the middle of a long prompt receive less attention weight than those at the edges.[7] Separate contexts solve both problems at once.
Bounded resources. No subagent gets disproportionate compute. Token budget per thread. Wall-clock timeout per thread. Simple fact-finding gets 3–10 tool calls; deep analysis might need 30+. The orchestrator sets the budget at decomposition time, not in flight. One slow subagent drags the entire synthesis window if you skip this.
Progress tracking. Promise.allSettled() in JavaScript (or its equivalent elsewhere) is the right primitive — it waits for every promise to resolve regardless of individual outcome. Promise.all() aborts on the first rejection. In a four-thread research run, that means one flaky API destroys the pipeline. Fire-and-forget is not progress tracking; it's hope.
All-or-nothing pipelines crash on the first rate limit. Graceful degradation, not retry storms, is the production answer.
Subagents fail. APIs go dark. Rate limits trigger. Context windows overflow on a dataset nobody sized for. The question is never will something fail — it's what does the system do when it does?[4]
The worst design treats failure as binary: every subagent succeeds or the run aborts. Real analysis tolerates partial information all the time. A senior analyst who can't reach the incident log still produces a useful brief from the other three sources — they just flag lower confidence on the infrastructure angle. The parallel research machine has to behave the same way.
Production data on multi-agent deployments is sobering: 40% of agentic AI projects fail to deliver expected returns.[8] Context window overflow is among the most common architectural failure modes — the orchestrator accumulates context from every worker, and at four or more workers with rich outputs, the orchestrator's own window frequently exceeds limits.[4] The fix is structured reports: subagents return summaries, not raw tool outputs.
Promise.all() — one rejection kills the run
Retry indefinitely — or until the billing alert fires
Silently drop failed threads from the brief
One global timeout regardless of task complexity
No confidence adjustment when data is missing
Subagents return raw tool output — orchestrator context overflows
Promise.allSettled() — every outcome captured
Retry once. Then fall back to cached data or mark unavailable.
Flag missing threads explicitly with confidence impact
Per-task timeouts calibrated to expected data volume
Confidence scores drop in proportion to missing sources
Subagents return structured summaries — orchestrator context stays bounded
Every subagent report carries a structured status field. That structure is what lets synthesis weight findings honestly instead of averaging away the gaps.
The naive synthesis prompt averages confidence and produces nothing actionable. The good one weighs evidence, stacks convergent signals, and names the gaps.
Synthesis is where the orchestrator-subagent pattern diverges from simple parallelization. The naive version concatenates reports and asks the model to summarize. The result is a worse version of what the subagents already produced separately.
One thing we got wrong early: synthesis prompts that naively averaged confidence scores across threads. A 70% finding from the incident log and a 70% finding from the experiment tracker don't combine to 70%. Either they reinforce — pushing confidence higher — or they describe different causal factors and should stay separate. Synthesis prompts need explicit rules for combining versus stacking evidence. "Produce a summary" isn't a rule. It's an abdication.
The synthesis prompt has to do five things: ingest every report with its completeness metadata, identify convergent evidence across threads, generate ranked hypotheses, assign confidence scores tied to evidence weight, and surface the gaps that cap confidence.
Trace the full pipeline from question to brief. Watch the partial failure at thread two and how synthesis handles it.
Trace the full pipeline against the running example. The product lead asks: Why did conversion drop 8% in Germany last week?
The orchestrator splits the question into four threads and dispatches subagents simultaneously. Eighteen minutes of parallel execution later, here's what comes back.
Two active experiments. DE-checkout-v3 allocated 30% of DE traffic to a new checkout flow. Variant ran 12% below control on conversion. The experiment was still live — nobody had stopped it.
One P2 incident: a payment provider timeout hitting SEPA direct debit between Tuesday and Thursday. Roughly 3,200 users hit the error. Datadog metrics blocked at the API rate limit — duration estimate is approximate and flagged in the report.
Release v4.12.0 shipped Monday with updated VAT calculation logic for EU countries. Changelog flag: 'corrected edge case for German reverse-charge invoices.' No rollback issued.
Main competitor launched a 15%-off spring promotion in DACH on Wednesday. Campaign running across Google Shopping and social. No pricing moves from any other competitor.
Synthesis converts raw findings into ranked hypotheses with explicit confidence — and surfaces the gaps that cap it.
The orchestrator feeds all four reports into the synthesis prompt. What comes out is structured enough for a product lead to act on immediately. Notice how the partial failure on thread 2 (70% completeness) reduces confidence on the SEPA hypothesis rather than silently disappearing.
Checkout experiment DE-checkout-v3 is the dominant cause. Estimated 3.6% of the 8% drop, derived from 30% traffic allocation and 12% variant underperformance.
Convergent evidence: experiment data (thread 1) and the VAT logic change (thread 3) both touch the checkout funnel in the same window.
The VAT update plausibly introduced a display bug that compounds the variant's negative effect — these two factors may not be independent.
Competitor spring promotion siphoned intent-stage traffic starting Wednesday.
Timing aligns with the steepest part of the conversion decline.
Single-thread evidence — capped at 60% before the partial-data penalty applies.
SEPA payment timeout hit ~3,200 users. Alone it accounts for roughly 0.5–1% of the drop.
Confidence reduced because the Datadog metrics were only partially available. Thread 2 finished at 70% completeness — the ceiling drops accordingly.
The pattern breaks on ambiguous questions and iterative investigations. Knowing the failure condition is as important as knowing the setup.
The 90.2% performance lift Anthropic reports is real — for the right question type. On the wrong one, confident decomposition actively damages the output.
Parallel research machines work best on well-structured recurring questions where the causal categories are known in advance: "What changed?" "What broke?" "What did competitors do?" The decomposition almost writes itself.
They work badly on truly novel questions where the relevant dimensions are themselves uncertain. When you don't know what you're looking for, an orchestrator that confidently spawns four threads in four directions can miss the fifth direction that actually matters. A single thoughtful agent exploring iteratively — following the thread, noticing the unexpected signal, changing tack — will often produce better output. You also spend three cycles refining a decomposition that was wrong from the start, and the wall-clock advantage evaporates.
A contrarian threshold worth using: if you can't write the decomposition table before you see any findings, the question probably isn't ready for parallel execution.
| Signal | Parallelize | Single exploratory agent |
|---|---|---|
| Question structure | Known causal categories, recurring question pattern | Ambiguous scope, first-time investigation |
| Data sources | Distinct APIs you can enumerate before running | Unknown which sources matter; discovery is the first step |
| Failure tolerance | Brief remains useful even with one or two missing threads | Missing a key angle makes the whole output misleading |
| Cost appetite | ~15x token spend is justified by time saved | Exploration budget is uncertain; serial is cheaper to abort |
| Iteration speed | Template can reuse the same decomposition repeatedly | Each run needs a different investigation approach |
Adopting the pattern wholesale on Monday produces a debugging swamp. Start with one recurring question that costs the team hours.
Massive infrastructure isn't the prerequisite. The pattern works on any LLM with tool use, and the orchestration logic fits in a few hundred lines of TypeScript or Python. What matters most is what you build it around.
Pick one recurring question that costs the team more than two hours every time it surfaces. Map the data sources. Write the decomposition template. Build a minimal orchestrator that spawns subagents, collects reports, runs the synthesis prompt.
The first run will be rough. The decomposition will miss an angle. A subagent will fail in a way nobody anticipated. The synthesis confidence levels will feel arbitrary. All of that is expected, and all of it improves quickly with iteration on the templates. Log every decomposition. Review the ones that produced weak briefs. The failure mode is almost always one of three things: boundaries too narrow (subagents finish instantly with nothing), boundaries too broad (subagents time out), or synthesis averaging (no hypothesis breaks 30% confidence).
Cross-thread dependencies force serial execution and erase the wall-clock advantage. Resolve them in synthesis or in the decomposition itself — never at runtime.
Structured outputs make synthesis predictable. Explicit boundaries prevent thread overlap. Omit either and briefs drift.
The orchestrator accumulates context from every subagent. At 4+ workers with rich outputs, the orchestrator's context window overflows. Summaries bound this.
One slow subagent should not delay synthesis. Per-task timeouts paired with fallback strategies keep throughput intact.
Synthesis needs failure context to adjust confidence honestly and to recommend follow-ups that close the right gap.
Convergent evidence from two threads and single-thread evidence require different treatment. State the rules explicitly or the model averages, which produces nothing actionable.
Multi-agent runs cost ~15x a single chat. Where that spend goes — and where caching claws it back — determines whether the economics hold.
Token volume is the primary lever. Anthropic's research makes this explicit: token usage explains 80% of success variance on BrowseComp, with model choice and tool call count as the other two factors.[1] That means your primary cost optimization is controlling subagent token budgets — not prompt brevity.
Use a capable but efficient model for subagents and reserve the most capable model for orchestration and synthesis, where reasoning depth carries the weight. Anthropic's own system uses Claude Opus 4 for the orchestrator and synthesis pass, Claude Sonnet 4 for the research subagents.[1] One operational caution: mixing model families makes debugging harder when outputs look stylistically inconsistent. Standardize on one family per tier and document the routing logic.
Caching is the other high-impact optimization. Many research questions share subqueries. If the competitive monitor scanned the DACH market yesterday, today's run starts from cached results and only checks the delta. A competitive intelligence subagent that runs daily can cache its baseline and compute incremental updates in a fraction of the token budget.
My subagents keep producing overlapping findings. How do I fix this?
Tighten the scope boundaries. Each task specifies what to investigate and what falls outside scope. If the experiment tracker and release parser both surface the same checkout change, write explicit exclusions: experiment tracker covers A/B test impacts only, release parser covers code changes and intended behavior only. Overlapping findings in the brief are fine — that's convergent evidence, which strengthens confidence. Overlapping investigation wastes tokens and introduces duplication that synthesis can't distinguish from real convergence.
How many subagents? Is more better?
No. Each subagent adds coordination cost and token spend. For most business research questions, three to five threads hit the right window. Past five, synthesis starts struggling to integrate everything coherently and the brief loses sharpness. The orchestrator's context accumulates all subagent reports — at more than five workers with detailed outputs, context overflow becomes a real constraint. Start narrow. Add threads only when you can name a distinct data source the current threads miss.
What does a bad decomposition look like?
Three symptoms: subagents finish instantly with nothing (boundaries too narrow), subagents time out repeatedly (boundaries too broad), or synthesis can't push any hypothesis past 30% confidence (threads aren't capturing the right causal categories). The fix is operational — log every decomposition, tag the ones that produced weak briefs, and refine the templates. Good decomposition improves with 5–10 iterations on real questions. It doesn't emerge from the first design session.
Different models for different subagents — yes or no?
Yes, but with discipline. Route simple data-retrieval threads to faster, cheaper models. Reserve capable models for threads that need judgment — competitive analysis, hypothesis ranking. Anthropic's own system uses Claude Opus 4 for orchestration and Claude Sonnet 4 for subagents.[1] One operational caution: mixing model families makes debugging harder when outputs look stylistically inconsistent. Standardize on one family per tier and document the routing logic — future maintainers need to know why the SQL subagent runs on a different model than competitive intelligence.
How do I test before connecting real data sources?
Build mock subagents that return canned responses with varying status and completeness. Stress-test synthesis and failure handling without burning API credits. Include at least one mock that returns partial data, one that fails outright, one that returns nothing-found (signal), and one with findings that contradict another thread. If the synthesis prompt can't handle all four cleanly, it won't handle production either.
What's the right token budget per subagent?
Simple data retrieval (API query, filter, format): 2,000–4,000 tokens. Moderate analysis (compare multiple sources, compute impact): 6,000–10,000 tokens. Deep judgment (competitive landscape, causal inference): up to 20,000 tokens. Set these at decomposition time. Don't let subagents inherit a global budget — tasks with wildly different complexity should have wildly different budgets.
Nobody adopts the orchestrator-subagent pattern wholesale. Pick one recurring question that costs the team more than two hours every time it surfaces. Map the data sources. Write the decomposition template. Build a minimal orchestrator that spawns subagents, collects reports, runs the synthesis prompt.
The first run will be rough. The decomposition will miss an angle. A subagent will fail in a way nobody anticipated. The synthesis confidence levels will feel arbitrary. All of that improves quickly with iteration on the templates — but only if you're logging the decompositions and reviewing the weak briefs.
The wall-clock won't feel rough. The first time a weighted brief lands in twenty minutes for a question that used to eat a morning, the pattern stops needing a sales pitch. After that, serial research starts feeling like coordination tax you no longer have to pay — and you start recognizing every recurring question that deserves its own machine.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.