Why single-inference cost estimates fail for agentic workflows — the four-component inference multiplier (call count, context accumulation, tool schema overhead, retry tax) with concrete workflow examples and measurement patterns.
Why N-step agents cost far more than N × single-call price — the O(N²) accumulation math
All four multiplier components: call count, context growth, tool schema overhead, retry tax
How to instrument and measure your actual multiplier from staging runs before launch
Three interventions — dynamic tool loading, structured handoffs, prompt caching — with real reduction numbers
A decision table for when each intervention applies and a pre-production audit checklist
Failure modes: retry storms, async cache invalidation, and context drift at step boundaries
Every cost estimate for an agentic workflow starts the same way: tokens per call × model price × expected requests. That arithmetic is correct. It's also describing a system that doesn't exist in production.
A procurement approval agent doesn't make one inference call per user request. It classifies the request, extracts vendor and budget details, checks the approved vendor list, verifies department budget, routes to the right approval tier, drafts the approval message, parses the manager response, runs a compliance check, writes to the PO system, and notifies the requestor. Ten inference calls on the happy path. When the vendor lookup returns ambiguous results, add two more. Twelve calls is typical — and that's a well-scoped, single-purpose agent.
The same pattern appears everywhere: a code review agent working through diff analysis, context retrieval, security review, and comment synthesis makes 7–9 inference calls per PR. A customer support agent that retrieves history, classifies intent, drafts a response, and checks policy compliance makes 6–8. Multi-agent orchestration stacks more on top.
Call count alone doesn't explain the full cost gap. Agentic models require 5–30× more tokens per task than a standard chat interaction; multi-agent research pipelines push higher[8]. That gap isn't model pricing — it's the inference multiplier: a compound of call count, context accumulation, tool schema overhead, and retry tax. Understanding all four components is what separates a realistic budget from a billing surprise.
One team running a LangChain four-agent research pipeline discovered this the hard way: the agents entered an infinite retry loop and ran for eleven days before anyone noticed. The bill was $47,000[8].
Call count is visible. The other three are structural — they compound against every call you add.
Component 1: Call count. This is the part every team eventually notices. A workflow with 12 inference calls per user request costs at least 12× a single call — before anything else compounds. Most cost estimates stop here. The problem: call count is a multiplier on the three components below, not the final answer.
Component 2: Context accumulation. Every subsequent call in a multi-turn agent loop re-sends the full accumulated conversation history. Step 5 doesn't pay for step 5's reasoning — it pays for steps 1 through 5 in full, as input tokens. In a documented 5-step agent run, per-call input token counts grew: 888 → 3,400 → 8,900 → 14,200 → 18,900[2]. The cost of step 5 alone exceeded steps 1 and 2 combined. The formula is triangular, not linear: total input tokens across N steps equals roughly N(N+1)/2 × average tokens per step. At 12 steps, the context accumulation factor is approximately 6.5× what a flat per-step estimate predicts.
Component 3: Tool schema overhead. Agent frameworks inject every registered tool's full schema into every inference call — whether or not the tool gets called. A single tool definition runs 550–1,400 tokens depending on description length and parameter detail[3]. In multi-server MCP deployments, the problem scales fast: three MCP servers at 30 tools each means 90 tool definitions injected before a single character of user input. At 200 tokens per definition, that's 18,000 tokens of overhead per call — before any reasoning tokens[9]. Scale to ten servers and the number climbs past 100,000 tokens per request. In static toolsets without dynamic loading, tool schema overhead accounts for 60–80% of total token usage[1].
Component 4: Retry tax. Failed tool calls don't disappear. The error, the model's recovery attempt, and all intermediate reasoning accumulate in context and get re-sent on every subsequent step. Most agent frameworks — LangChain, LlamaIndex, the OpenAI and Anthropic tool-use SDKs — retry by replaying conversation history, not by retrying the single failed step in isolation[8]. A 10% per-step failure rate, compounded across 10 steps without circuit breakers, multiplies costs several times over. The compounding is quadratic in context size, not linear in step count — that's the part practitioners miss. One team cut their per-task tool call count from 14 to 2 by adding explicit SUCCESS/FAILED terminal states to tool responses, eliminating retry loops that were consuming the majority of their budget[2].
Gartner analysis, March 2026: range spans single-purpose agents to multi-agent research pipelines[8]
Measured across multi-server MCP deployments; overhead precedes any user input or reasoning tokens[1]
January 2026 multi-provider evaluation of prompt caching for agentic tasks[10]
Linear per-step estimates are structurally wrong for any workflow beyond 3 steps
The structural reason cost estimates fail is that engineers price multi-step agent workflows as N independent inference calls. They're not independent. Each call carries the full history of everything that came before it, because that's how the API works.
For a leadership audience, the math does not need to live in a code-style box. The practical rule is simpler: every additional step carries the history of the steps before it, so the budget grows faster than the visible workflow count.
A 10-step workflow does not behave like 10 independent calls. The repeated context behaves closer to 55 units of accumulated history. A 12-step workflow behaves closer to 78 units. That is why the invoice feels disconnected from the initial estimate: the team priced the visible steps, not the accumulated history each step drags forward.
Apply this to a 12-step procurement agent on claude-sonnet-4-6 ($3/M input, $15/M output)[6]. Reasonable assumptions: 3,000-token system prompt plus tool schemas, 200-token average user input per step, 500-token average tool result per step.
Naive estimate: 12 × (3,000 + 200 + 500) = 44,400 input tokens.
Formula estimate: N·S is 36,000; N·u is 2,400; N(N+1)/2 × r is 78 × 500 = 39,000. Total: approximately 77,400 input tokens — 1.74× the naive estimate before a single output token is counted.
Add output tokens, the retry path on vendor ambiguity, and tool schema overhead from a moderately sized MCP toolset, and the real multiplier on the naive budget reaches 3–5× for this single workflow. That's before any multi-agent orchestration layer.
This compounds further when you account for token price asymmetry. Output tokens cost 5× more than input tokens on Sonnet-class models. At high step counts, output token costs dominate — and unlike input tokens, output tokens don't benefit from prompt caching.
| Step | Procurement Agent Input Tokens | Code Review Agent Input Tokens | Driver of growth |
|---|---|---|---|
| 1 | ~3,200 | ~4,500 | System prompt + tool schemas + first message |
| 2 | ~4,800 | ~7,200 | Step 1 output + tool result added to context |
| 3 | ~6,900 | ~10,400 | All prior outputs accumulating |
| 5 | ~12,000 | ~18,500 | Context now larger than step 1 initial cost |
| 8 | ~22,000 | — | Every prior step re-billed; context dominates |
| 10 | ~30,000 | — | Step 10 alone exceeds steps 1–3 combined |
| 12 (happy path) | ~40,000 | — | 3–4× the naive 44,400-token estimate |
Static tool loading in multi-server MCP is a cost multiplier hiding in plain sight
MCP's tool-injection model has a predictable failure mode at enterprise scale: every connected server contributes its full schema to every inference call, regardless of which tools the current step needs.
A GitHub MCP server with 30 tools, each averaging 550 tokens, contributes 16,500 tokens of overhead per call — before a single token of user input. Add a Jira server (20 tools × 800 tokens average = 16,000 tokens) and a Salesforce server (25 tools × 700 tokens = 17,500 tokens) and you have 50,000 tokens of schema overhead on every step of a 12-step workflow. Across the full workflow, that's 600,000 tokens of pure overhead — at Sonnet pricing, $1.80 per workflow run[9].
This pattern explains why tool schema overhead can dwarf all other cost components in enterprise MCP deployments. The measured range in production multi-server setups is 10,000–100,000+ tokens per call[1][9].
The fix is dynamic tool loading: classify the current step type first using a cheap model (Haiku 4.5 at $0.25/M), then inject only the tool subset relevant to that step class. A classification call costs a few hundred tokens; the savings on subsequent steps return that cost many times over. The GitHub MCP server that consumed 55,000 tokens per call in a static deployment drops to roughly 1,000 tokens per call with lazy-loaded schemas — a 55× reduction in schema overhead for that component alone[9].
| Intervention | Primary target | When to apply | When NOT to apply | Typical reduction |
|---|---|---|---|---|
| Dynamic tool loading | Component 3 (tool schemas) | MCP with 3+ servers, tool schema > 10K tokens/call | Simple single-server setups where all tools are used every step | 50–85% of schema overhead[9] |
| Structured handoffs | Component 2 (accumulation) | Workflows > 5 steps; tool results > 1K tokens each | Workflows where downstream steps require reasoning over full transcript | 50–70% of accumulation cost[11] |
| Prompt caching | Components 2 and 3 (static context) | System prompt > 1,024 tokens; same system prompt across many calls | Steps separated by >5 min (cache invalidates); highly dynamic system prompts | 41–80% of cached input cost[10] |
| Model tier routing | Component 1 (call cost) | Steps have clearly different reasoning requirements | Steps where you can't classify difficulty cheaply enough to offset routing cost | 30–60% of total call cost |
Retry tax doesn't compound linearly — it cascades
Most engineers model retry costs as a simple percentage uplift: 10% failure rate means 10% more cost. That's wrong. The error accumulates in context and gets re-sent on every subsequent step.
Here's the mechanics. Step 7 fails — the tool returns a malformed response. The agent appends the error to conversation history and retries. The retry itself costs step-7 context size × the full input token price. If it fails again, both the original error and the first retry error are now in context for retry 2. The cost of each retry grows with the number of prior failures.
Layer on retries at multiple levels — SDK retries inside a tool call, middleware retries wrapping the tool, agent-level retries wrapping the loop — and you hit multiplicative chains. Three retries per layer across five chained tools is the canonical retry storm: a single user request can produce up to 3^5 = 243 backend calls in the worst case[8].
The four failure patterns that trigger the worst retry storms:
Budget protection from retry storms requires circuit breakers at the workflow level, not just the call level. A per-workflow token budget ceiling that triggers graceful exit is the floor; an escalation path that pages the on-call before the budget is exhausted is the ceiling.
Staging traces are the only honest baseline — spreadsheet estimates are wrong by construction for workflows beyond 3 steps
The only way to know your workflow's actual inference multiplier is to instrument it and measure from staging runs. A spreadsheet model can get the formula right; it cannot know your tool schema sizes, your retry rates, or your actual context accumulation slope.
The measurement pattern: wrap every inference call inside a parent workflow trace span, record per-call input and output token counts, track cumulative context size at each step boundary, and compute the ratio of total measured token spend to your step-1 token count × step count. That ratio is your multiplier.
OpenTelemetry's GenAI semantic conventions (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) give you per-call token data at the instrumentation layer[7]. As of March 2026, most GenAI conventions are in experimental status, but the token usage attributes are stable enough for production cost tracking. The gap in most observability setups: teams see per-call cost but not per-user-action cost. Aggregating across the full workflow under a parent span closes that gap.
Run at least 20 staging executions — not 3. The distribution of per-workflow cost is heavy-tailed; a 3-run benchmark will almost certainly miss the retry paths that drive P95 spend.
For teams using Python, the same pattern maps directly to the Anthropic SDK:
Single average call cost × step count
Each step treated as independent
Tool schema overhead counted once
No retry path in the model
Multi-agent handoff overhead missing
Discovered to be wrong at billing time
Actual token spend measured from 20+ staging runs
Context accumulation formula applied per workflow
Tool schema overhead × step count, per step
P95 retry path included in ceiling
Per-agent handoff context cost measured separately
Multiplier ratio locked before production launch
Each targets a different component — in order of effort-to-impact ratio
Understanding the four-component multiplier tells you where to intervene. The highest-impact changes target components 2 and 3 — they compound against every call in the workflow.
Dynamic tool loading (targets Component 3 — tool schema overhead). The fastest win in most systems. Instead of injecting every registered tool's schema on every call, load only the tools relevant to the current step. In a GitHub MCP server deployment, switching from static to lazy-loaded schemas cut tool schema tokens from 55,000 to roughly 1,000 per initialization — a 55× reduction for that component[9]. Implementation: classify the step type first with a cheap Haiku 4.5 call ($0.25/M), then load only the tool subset needed for that step class. The classification call costs a few hundred tokens; the savings across subsequent steps return that cost many times over.
Structured handoffs instead of full transcripts (targets Component 2 — context accumulation). Most agent loops pass full conversation transcripts between steps. Replacing those with structured JSON summaries — the decision made, the data retrieved, the constraints that apply — cuts context size per step by 50–70% while preserving what the next step actually needs[11]. The formula doesn't change; the per-step r value in N(N+1)/2 × r does. Halving r on a 12-step workflow halves the accumulation component. The production consensus as of 2025–2026: anchored iterative summarization, where each compression updates a persistent JSON document rather than regenerating it from scratch, gives the most consistent results[11].
Prompt caching on stable context (targets Components 2 and 3). System prompts and tool schemas are identical across calls within a workflow. Marking them with cache_control: { type: "ephemeral" } prices cache reads at 10% of standard input cost[6]. A January 2026 multi-provider evaluation measured 41–80% cost reduction in long-horizon agentic tasks from prompt caching alone[10]. One constraint: the Anthropic cache TTL is five minutes[6]. Async approval flows and human-in-the-loop waits that span longer than five minutes re-incur the cache write cost on the next step — factor this into workflows with human review stages.
Combining dynamic tool loading and prompt caching addresses components 2 and 3 without touching workflow logic. Teams that implement both report 70–85% reduction from unoptimized baselines[4].
Operational heuristics you can apply before the next sprint
A token multiplier survives model price changes; a dollar multiplier goes stale every time pricing updates. Compute cost from the token multiplier × current model pricing at budget time. When you route steps to cheaper models, apply the model cost ratio to the measured token budget directly.
Treat multiplier measurement from 20+ staging runs as a non-negotiable pre-launch requirement — same as load testing. If you don't know the P95 token spend per workflow before launch, you don't know your cost baseline.
A global API spend limit protects your account but not individual workflows. Set a token budget ceiling at the workflow level — in code, before the loop — that triggers a graceful exit and surfaces the partial result rather than running to completion at any cost.
At $3/M input tokens, 10,000 tokens of schema overhead per call costs $0.00003 per call — trivial for tens of calls, significant for millions. Measure the threshold at which dynamic loading ROI turns positive for your volume and apply accordingly.
After applying caching, dynamic tool loading, and structured handoffs, a multiplier above 15× usually points to step count inflation or unnecessarily large tool schemas — optimization won't fix it, redesign will.
Steps that classify, route, or check preconditions — not those that reason deeply or synthesize — run adequately on Haiku 4.5 at $0.25/M input. That's 12× cheaper than Sonnet for the steps that don't need Sonnet. Measure quality before routing; don't assume.
The answers don't get simpler the closer you look
What's a typical multiplier for a production agent with 8–10 steps?
Teams commonly report multipliers in the 6–15× range for well-designed single-purpose agents at 8–10 steps, measured against their initial single-call estimate. Multi-agent pipelines with handoffs push higher. The only reliable way to know your specific number is to measure it from staging traces. Public benchmarks don't capture your tool schemas, retry rates, or context accumulation slope — and those three variables dominate the actual multiplier.
Should I measure the multiplier in tokens or dollars?
Tokens, with a per-model cost translation. The multiplier is a ratio that survives model price changes; a dollar-denominated multiplier becomes stale every time pricing updates. Compute cost from the token multiplier × current model pricing at budget time. This also makes routing impact easier to model: if staging shows a 6.5× token multiplier and you're routing steps 1–3 to Haiku 4.5 instead of Sonnet 4.6, apply the model cost ratio directly to the measured token budget without re-running the full benchmark.
My workflow uses MCP — how does that change the multiplier?
MCP multi-server deployments carry the highest tool schema overhead of any agent pattern, because each connected server contributes its full schema to every call. Measured deployments have found 10,000–100,000+ tokens of tool-schema overhead per call in multi-server MCP setups[1][9] — before any user input or reasoning tokens. If you have more than 3–4 MCP servers connected, dynamic tool loading isn't optional; it's the primary cost control. Measure the schema overhead of your specific MCP configuration explicitly — it's often larger than everything else in the workflow combined.
At what multiplier does a workflow have a design problem rather than an optimization problem?
A multiplier above 20× on a single-purpose agent (not a multi-agent pipeline) usually signals a design problem: too many steps with no shared context compression, tool schemas injected that the workflow doesn't need, or retry loops without terminal states. Optimization — caching, dynamic tool loading, structured handoffs — can reduce the multiplier by 50–70%; only redesign reduces it further. The diagnostic test: after applying all three interventions, if the multiplier is still above 15×, audit the step count and tool schema surface area before accepting the cost as fixed.
Does the multiplier calculation change for streaming responses?
The token math is identical — streaming doesn't change how tokens are counted or billed. What changes is when you can measure it. With streaming, you get token counts in the final message_delta event rather than in a synchronous response object. Structure your tracing to collect usage from the stream's final event, not from intermediate chunks. The multiplier formula and measurement methodology are the same either way.
How do I handle the 5-minute prompt cache TTL in long-running workflows?
Design for cache misses at any step boundary that might involve human review or external system waits. Cache the system prompt and static tool schemas as usual, but budget for a cache write penalty on any step that follows a wait exceeding five minutes. In practice, structure human-in-the-loop workflows so the human-facing step is the terminal action of one agent and the continuation is the first action of the next — each continuation writes a fresh cache entry and benefits from cache reads for the remainder of its run[6].
The inference multiplier is the real unit of measurement for agent cost engineering. A 12-step workflow doesn't have a cost — it has a multiplier, and that multiplier interacts with model tier, tool schema size, and retry rate in ways a single-call benchmark can't capture.
Measure the multiplier from staging traces. Document it per workflow, at P95. Set the production budget ceiling against that number, not against the per-call estimate. The teams that avoid billing surprises share one discipline: they treat multiplier measurement as a launch gate, not a post-incident activity.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.