Slack pings at 9:14 AM: Conversion in Germany dropped 8% last week. What happened?
You know the topology. Open the experiment tracker. Cross-reference the incident log. Scan three weeks of release notes. Check what a competitor shipped. Each source lives behind a different surface, demands its own context, and burns 30 to 90 minutes of focused attention. By the time a working hypothesis exists, the day is half spent.
Now change the topology. One question hits a system that decomposes it into four independent threads, dispatches a specialized subagent to each, waits for all four to report, and synthesizes a weighted hypothesis brief with confidence levels. Wall-clock cost: about twenty minutes when the decomposition matches the problem.
This is the orchestrator-subagent pattern for knowledge work. Anthropic's multi-agent research reports up to a 90% performance lift over single-agent runs on specific complex research tasks[1] — with the obvious caveat that the gain depends entirely on whether the decomposition matches the problem structure. The number stops being abstract the first time you watch four subagents work the same question simultaneously while you read the brief that comes out the other side.
Three Phases. Confuse Them and the System Breaks.
Decomposition, parallel execution, synthesis. Each fails differently. Each fails the whole pipeline.
The pattern has three phases. Most teams that fail at this collapse two of them into one and never figure out why their briefs come out incoherent.
Phase 1: Decomposition. The orchestrator takes a question and splits it into independent threads. Independent is the load-bearing word. If thread B needs the output of thread A, they cannot run in parallel. Good decomposition produces threads that execute simultaneously without coordination.
Phase 2: Parallel execution. Each subagent runs its assigned thread inside its own context window with its own tools. Subagents do not communicate during execution. That isolation is a feature — it cuts the coordination tax to zero and keeps every context window clean.[2]
Phase 3: Synthesis. The orchestrator collects every report and produces the unified brief. Confidence levels, weighted hypotheses, cross-thread correlations all emerge here. Synthesis is the phase that turns raw findings into a decision.
Microsoft's multi-agent orchestration guidance and Anthropic's multi-agent research are worth reading alongside this — they document the same topology from different angles.
Decomposition Is Where the Orchestrator Earns Its Keep
Bad splits create duplicate effort or blind spots. Good splits produce a brief no single analyst could assemble in the same window.
Decomposition is where the orchestrator earns its keep. Split the question badly and your subagents either duplicate each other's work or miss the angle that matters. Split it well and the brief lands in a window no single analyst could match.
Take the live example: Why did conversion drop 8% in Germany? A capable orchestrator splits this into four independent threads, each owning a different causal category.
| Subagent | Thread Focus | Data Sources | Output Format |
|---|---|---|---|
| Experiment Tracker | Active and recently concluded A/B tests touching the DE funnel | Experiment platform API, feature flag system | Experiment list with traffic allocation and measured impact |
| Incident Log Analyst | Production incidents, latency spikes, payment errors in the DE region | PagerDuty, Datadog, payment gateway logs | Incident timeline with duration and affected user count |
| Release Notes Parser | Deploys that touched checkout, pricing, or localization | GitHub releases, deploy logs, changelog | Annotated release list with change descriptions |
| Competitive Monitor | Competitor launches, pricing moves, campaigns in the DE market | Web search, press releases, app store updates | Competitive event summary with timing and relevance |
Every subagent gets four things: an objective, an output format, tool guidance, and a scope boundary. The boundary is the part teams skip. Without it, subagents drift outside their lane and burn tokens exploring territory another thread already owns. Vague task descriptions produce vague results. Vague results poison synthesis.
The decomposition prompt itself follows a predictable shape. Here is what the orchestrator generates internally.
orchestrator-decomposition.tsinterface SubagentTask {
id: string;
objective: string;
outputFormat: string;
tools: string[];
boundaries: string;
timeoutMs: number;
fallbackStrategy: "skip" | "retry-once" | "use-cached";
}
function decomposeQuestion(question: string): SubagentTask[] {
// Orchestrator LLM emits these from the question. Boundaries are not optional.
return [
{
id: "experiment-tracker",
objective: `Find every A/B test running in the DE market during the
affected window. Report traffic allocation, variant performance,
and whether any test concluded with a winner deployed.`,
outputFormat: "JSON array of { testName, status, deImpact, confidence }",
tools: ["experiment-api", "feature-flags"],
boundaries: "DE-impacting tests only. Skip global tests with <1% DE traffic.",
timeoutMs: 120_000,
fallbackStrategy: "retry-once"
},
{
id: "incident-log",
objective: `Identify production incidents in EU-west region during
the affected window. Focus on checkout, payment, page-load.`,
outputFormat: "JSON array of { incident, severity, startTime, duration, usersAffected }",
tools: ["pagerduty-api", "datadog-api"],
boundaries: "EU-west only. Drop incidents resolved in under 2 minutes.",
timeoutMs: 90_000,
fallbackStrategy: "use-cached"
},
// ... release-notes-parser, competitive-monitor
];
}Isolation, Bounded Resources, Progress Tracking
Three constraints. Drop any one and parallel execution stops behaving as parallel.
Once the task list exists, execution is mechanically simple but operationally strict. Three concerns dominate.
Isolation. Each subagent runs in its own context window with its own tool sessions. No shared state during execution. If subagent A discovers something subagent B should know, that connection is made during synthesis — never mid-flight. Shared state during execution introduces race conditions and debugging nightmares.[4]
Bounded resources. No subagent gets to consume disproportionate compute. Token budget per thread. Wall-clock timeout per thread. Simple fact-finding gets 3-10 tool calls. Deep analysis might need 30+. The orchestrator picks the budget at decomposition time, not in flight.
Progress tracking. The orchestrator needs to know which subagents have finished, which are still running, which have failed. Promise.allSettled() in JavaScript or its equivalent elsewhere is the right primitive — it waits for every promise to resolve regardless of individual outcome. Fire-and-forget is not progress tracking. It is hope.
- [01]
Spawn isolated subagent instances with task-specific prompts
typescriptconst subagentPromises = tasks.map(task => spawnSubagent({ systemPrompt: buildSubagentPrompt(task), tools: task.tools, tokenBudget: 8_000, timeout: task.timeoutMs, }) ); - [02]
Run every subagent concurrently and collect outcomes
typescriptconst results = await Promise.allSettled(subagentPromises); // results: Array<{status: 'fulfilled', value} | {status: 'rejected', reason}> - [03]
Classify outcomes and apply per-task fallback strategies
typescriptconst classified = results.map((result, i) => ({ taskId: tasks[i].id, status: result.status, data: result.status === 'fulfilled' ? result.value : null, error: result.status === 'rejected' ? result.reason : null, fallback: tasks[i].fallbackStrategy, }));
Subagents Will Fail. Design for Partial Success.
All-or-nothing pipelines crash on the first rate limit. The fix is graceful degradation, not retry storms.
Subagents fail. APIs go dark. Rate limits trigger. Context windows overflow on a dataset nobody sized for. The question is never will something fail — it is what does the system do when it does.[4]
The worst design treats failure as binary: every subagent succeeds or the run aborts. Real analysis tolerates partial information all the time. A senior analyst who cannot reach the incident log still produces a useful brief from the other three sources — they just flag lower confidence on the infrastructure angle.
The parallel research machine has to behave the same way. Otherwise one flaky API takes the whole pipeline down on a recurring basis, and the team learns to distrust the system.
Promise.all() — one rejection kills the run
Retry indefinitely until success — or until billing notices
Silently drop failed threads from the brief
One global timeout regardless of task complexity
No confidence adjustment when data is missing
Promise.allSettled() — every outcome captured
Retry once. Then fall back to cached data or mark unavailable.
Flag missing threads explicitly with confidence impact
Per-task timeouts calibrated to expected data volume
Confidence scores drop in proportion to missing sources
Every subagent report carries a structured status field. That structure is what lets synthesis weight the findings honestly instead of averaging away the gaps.
subagent-report.tsinterface SubagentReport {
taskId: string;
status: "complete" | "partial" | "failed";
completeness: number; // 0.0 to 1.0
findings: Finding[];
sourcesConsulted: string[];
sourcesUnavailable: string[];
executionTimeMs: number;
tokenUsage: number;
notes: string; // Free-text context on what was missing or weird
}Synthesis Is Reasoning Across Reports, Not Concatenation
The naive synthesis prompt averages confidence and produces nothing actionable. The good one weighs evidence.
Synthesis is where the orchestrator-subagent pattern diverges from simple parallelization. The naive version concatenates reports and asks the model to summarize. The result is a worse version of what the subagents already produced separately.
The sharp version asks the model to reason across reports — find correlations between threads, weigh evidence by completeness, flag contradictions instead of averaging them away, and assign confidence levels to ranked hypotheses.
The synthesis prompt has to do five things: ingest every report with its completeness metadata, identify convergent evidence across threads, generate ranked hypotheses, assign confidence scores tied to evidence weight, and surface the gaps that cap confidence.
synthesis-prompt.tsfunction buildSynthesisPrompt(reports: SubagentReport[]): string {
const reportSummaries = reports.map(r => `
## ${r.taskId} (${r.status}, ${Math.round(r.completeness * 100)}% complete)
${r.findings.map(f => `- ${f.summary}`).join('\n')}
Sources unavailable: ${r.sourcesUnavailable.join(', ') || 'none'}
`).join('\n');
return `You are a senior research analyst synthesizing findings from
${reports.length} parallel investigation threads.
Reports:
${reportSummaries}
Produce a hypothesis brief with this structure:
1. **Top hypothesis** with confidence (0-100%) and supporting evidence
2. **Alternative hypotheses** ranked by likelihood
3. **Evidence gaps** — what was missing or incomplete
4. **Recommended next steps** to close the gaps
Confidence rules — apply them, do not average them away:
- Start at 50% prior. Adjust based on evidence weight.
- Convergent evidence from 2+ threads: +15-25%
- Single-thread evidence only: cap at 60%
- Each failed/partial subagent: -10% ceiling reduction
- Contradictory evidence: flag explicitly. Never resolve by averaging.`;
}Worked Example: The Germany Drop, Eighteen Minutes In
Trace the full pipeline from question to brief. Watch the partial failure at thread two.
Trace the full pipeline against the running example. The product lead asks: Why did conversion drop 8% in Germany last week?
The orchestrator splits the question into four threads and dispatches subagents. Eighteen minutes of parallel execution later, here is what comes back.
- [01]
Experiment Tracker reports back — status: complete, 100%
Two active experiments. DE-checkout-v3 allocated 30% of DE traffic to a new checkout flow. Variant ran 12% below control on conversion. The experiment was still running. Nobody had stopped it.
- [02]
Incident Log Analyst reports back — status: partial, 70%
One P2 incident: a payment provider timeout hitting SEPA direct debit between Tuesday and Thursday. Roughly 3,200 users hit the error. Datadog metrics blocked at the API rate limit, so the duration estimate is approximate — flagged in the report.
- [03]
Release Notes Parser reports back — status: complete, 100%
Release v4.12.0 shipped Monday with updated VAT calculation logic for EU countries. Changelog flag: 'corrected edge case for German reverse-charge invoices.' No rollback issued.
- [04]
Competitive Monitor reports back — status: complete, 95%
Main competitor launched a 15%-off spring promotion in DACH on Wednesday. Campaign running across Google Shopping and social. No pricing moves from any other competitor.
The Brief: Four Reports, One Decision Surface
Synthesis converts raw findings into ranked hypotheses with explicit confidence — and the gaps that cap it.
The orchestrator feeds all four reports into the synthesis prompt. What comes out is structured enough for a product lead to act on immediately.
Primary Hypothesis — Confidence 72%
Checkout experiment DE-checkout-v3 is the dominant cause. Estimated 3.6% of the 8% drop, derived from 30% traffic allocation and 12% variant underperformance.
Convergent evidence: experiment data (thread 1) and the VAT logic change (thread 3) both touch the checkout funnel in the same window.
The VAT update plausibly introduced a display bug that compounds the variant's negative effect.
Secondary Hypothesis — Confidence 45%
Competitor spring promotion siphoned intent-stage traffic starting Wednesday.
Timing aligns with the steepest part of the conversion decline.
Single-thread evidence — capped at 60% before the partial-data penalty pulls it lower.
Contributing Factor — Confidence 35%
SEPA payment timeout hit ~3,200 users. Alone it accounts for 0.5-1% of the drop.
Confidence reduced because the Datadog metrics were only partially available — thread 2 finished at 70% completeness.
Build It Around One Question First
Adopting the pattern wholesale on Monday produces a debugging swamp. Start with one recurring question that costs you hours.
Massive infrastructure is not the prerequisite. The pattern works on any LLM with tool use, and the orchestration logic fits in a few hundred lines of TypeScript or Python. The thing that matters most is what you build around.
One thing we got wrong early: we wrote synthesis prompts that naively averaged confidence scores across threads. A 70% finding from the incident log and a 70% finding from the experiment tracker do not combine to 70%. Either they reinforce — pushing confidence higher — or they describe different causal factors and should stay separate. Synthesis prompts need explicit rules for combining versus stacking evidence. "Produce a summary" is not a rule. It is an abdication.
Implementation Readiness Checklist
Three to five recurring research questions identified — each currently costs >2 hours of focused work
Data sources mapped per question — APIs, databases, web search endpoints
Programmatic access verified for every source — API keys, query endpoints, rate limits
Decomposition templates designed for the most common question patterns
Token budgets and timeouts set per subagent type, not globally
Promise.allSettled() or equivalent in place — never Promise.all()
Structured report format with status and completeness fields, enforced
Synthesis prompt written with explicit confidence-scoring rules — not "summarize"
Per-thread logging for execution time, token usage, failure rate
Tested with intentional failures — graceful degradation verified, not assumed
Decomposition Design Rules
Every subagent thread runs without output from any other thread
Cross-thread dependencies force serial execution and erase the wall-clock advantage. Resolve them in synthesis or in the decomposition itself — not at runtime.
Every task specifies an output format, not just an objective
Structured outputs make synthesis predictable. Free-form responses produce inputs the synthesis prompt cannot weight, and the brief drifts.
Subagents never share context windows during execution
Shared context introduces race conditions and context pollution. Cross-thread insights belong in synthesis, where the orchestrator has the whole picture.
Timeouts are per-task, not global
One slow subagent should not delay the pipeline. Per-task timeouts paired with fallback strategies keep throughput intact.
Failed subagents report why they failed, not just that they failed
Synthesis needs failure context to adjust confidence honestly and to recommend follow-ups that close the right gap.
Cost, Caching, and the Question Type That Breaks the Pattern
Multi-agent runs cost roughly 15x a single chat. The cost is justified — except for the question type that should not be decomposed at all.
Multi-agent runs consume roughly 15x the tokens of a single chat — a rough estimate that swings significantly with subagent complexity. The cost is justified when the alternative is 4 to 8 hours of senior analyst time. It still demands attention.
A contrarian point worth holding: parallel research machines work best on well-structured recurring questions. They work badly on truly novel ones. When the question itself is ambiguous — when the dimensions that matter are themselves uncertain — an orchestrator that confidently decomposes and dispatches can do worse than a single thoughtful agent exploring iteratively. The wall-clock advantage evaporates when you spend three cycles refining a decomposition that was wrong from the start.
The primary tuning levers are token usage, tool call frequency, and model selection. Use a capable but efficient model for subagents — Claude Sonnet 4 handles most research threads — and reserve the most capable model (Claude Opus 4) for orchestration and synthesis where reasoning depth carries the weight.[2]
Caching is the other high-impact optimization. Many research questions share subqueries. If the competitive monitor scanned the market yesterday, today's run starts from cached results and only checks the delta. Gartner reports a surge in multi-agent system inquiries between 2024 and 2025[6] — the tooling is maturing, but enterprise adoption is early and the patterns are still settling.
Where This Pattern Fails in Practice
My subagents keep producing overlapping findings. How do I fix this?
Tighten the boundaries. Each task specifies what to investigate and what falls outside scope. If the experiment tracker and the release parser both surface the same checkout change, write explicit exclusions: experiment tracker covers A/B test impacts only, release parser covers code changes and intended behavior. Overlapping findings in the brief are fine. Overlapping investigation wastes tokens.
How many subagents? Is more better?
No. Each subagent adds coordination cost and token spend. For most business research questions, three to five threads hit the right window. Past five, synthesis starts struggling to integrate everything coherently and the brief loses sharpness. Start narrow. Add threads only when you can name a distinct data source the current threads miss.
What does a bad decomposition look like?
Bad decomposition is the single most common failure mode. The signs: subagents that finish instantly with nothing (boundaries too narrow), subagents that time out repeatedly (boundaries too broad), or synthesis that cannot push any hypothesis past 30% confidence. The fix is operational — log every decomposition, review the ones that produced weak briefs, refine the templates over time.
Different models for different subagents — yes or no?
Yes. Route simple data-retrieval to faster, cheaper models. Reserve capable models for threads that need judgment — competitive analysis, hypothesis ranking. Anthropic's own system uses Claude Opus 4 for orchestration and Claude Sonnet 4 for subagents. One operational caution: mixing model families makes debugging harder when outputs look stylistically inconsistent. Standardize on one family per tier and document the routing logic. Future maintainers need to know why the SQL subagent runs on a different model than the competitive intelligence one.
How do I test before connecting real data sources?
Build mock subagents that return canned responses with varying status and completeness. This stress-tests synthesis and failure handling without burning API credits. Include at least one mock that returns partial data, one that fails outright, and one with contradictory findings. If the synthesis prompt cannot handle all three cleanly, it will not handle production either.
Pick One Question. Build the Machine Around It.
Nobody adopts the orchestrator-subagent pattern wholesale. Pick one recurring question that costs the team more than two hours every time it surfaces. Map the data sources. Write the decomposition template. Build a minimal orchestrator that spawns subagents, collects reports, runs the synthesis prompt.
The first run will be rough. The decomposition will miss an angle. A subagent will fail in a way nobody anticipated. The synthesis confidence levels will feel arbitrary. All of that is expected, and all of it improves quickly with iteration on the templates.
The wall-clock will not feel rough. The first time a weighted brief lands in twenty minutes for a question that used to eat a morning, the pattern stops needing a sales pitch. After that, decomposing every complex question into parallel threads is no longer a choice — serial research starts feeling like coordination tax you no longer have to pay.
- [1]Anthropic — Building a Multi-Agent Research System(anthropic.com)↩
- [2]Microsoft — Multi-Agent Orchestrator and Sub-Agent Architecture(learn.microsoft.com)↩
- [3]Eesel — Subagent Orchestration Patterns(eesel.ai)↩
- [4]Maxim AI — Multi-Agent System Reliability: Failure Patterns, Root Causes, and Production Validation Strategies(getmaxim.ai)↩
- [5]Kanerika — AI Agent Orchestration(kanerika.com)↩
- [6]Machine Learning Mastery — 7 Agentic AI Trends to Watch in 2026(machinelearningmastery.com)↩