The Inference Multiplier: Why Agent Costs Exceed Estimates

Your Agent Budget Is Built on One Call. Production Takes Twelve.

Why single-inference cost estimates fail for agentic workflows — the four-component inference multiplier (call count, context accumulation, tool schema overhead, retry tax) with concrete workflow examples and measurement patterns.

AI Engineering PlatformadvancedMay 25, 20266 min read

By Viktor Bezdek · VP Engineering, Groupon

Every cost estimate for an agentic workflow starts the same way: tokens per call × model price × expected requests. That arithmetic is correct. It is also describing a system that doesn't exist in production.

A procurement approval agent doesn't make one inference call per user request. It classifies the request, extracts vendor and budget details, checks the approved vendor list, verifies department budget, routes to the right approval tier, drafts the approval message, parses the manager response, runs a compliance check, writes to the PO system, and notifies the requestor. That's ten inference calls on the happy path. When the vendor lookup returns ambiguous results, add two more. Twelve calls is typical — and that's a well-scoped, single-purpose agent.

The same pattern appears everywhere. A code review agent working through diff analysis, context retrieval, security review, and comment synthesis makes 7–9 inference calls per PR. A customer support agent that retrieves history, classifies intent, drafts a response, and checks policy compliance makes 6–8. Multi-agent orchestration adds more on top.

The call count alone doesn't explain the full cost gap. Anthropic's engineering team measured production agent systems and found agents burn roughly 4× more tokens than direct chat interactions; multi-agent research pipelines burn 15× more^[1]. That gap isn't model pricing — it's the inference multiplier: a compound of call count, context accumulation, tool schema overhead, and retry tax. Understanding all four components is what separates a realistic budget from a billing surprise.

40% of agentic AI pilots get cancelled before production, and runaway inference costs are the most common reason^[2].

The Multiplier Has Four Components. Most Teams Model One.

Call count is visible. The other three are structural — they compound against every call you add.

Component 1: Call count. This is the part every team eventually notices. A workflow with 12 inference calls per user request costs at least 12× a single call — before anything else compounds. Most cost estimates stop here. The problem: call count is a multiplier on the three components below, not the final answer.

Component 2: Context accumulation. Every subsequent call in a multi-turn agent loop re-sends the full accumulated conversation history. Step 5 doesn't pay for step 5's reasoning — it pays for steps 1 through 5 in full, as input tokens. In a documented 5-step agent run, per-call input token counts grew: 888 → 3,400 → 8,900 → 14,200 → 18,900^[2]. The cost of step 5 alone exceeded steps 1 and 2 combined. The formula is triangular, not linear: total input tokens across N steps equals roughly N(N+1)/2 × average tokens per step. At 12 steps, the context accumulation factor is approximately 6.5× what a flat per-step estimate predicts.

Component 3: Tool schema overhead. Agent frameworks inject every registered tool's full schema into every inference call — whether or not the tool gets called. A single tool definition runs 550–1,400 tokens depending on description length and parameter detail^[3]. In multi-server MCP deployments, tool schema overhead reaches 10,000–60,000 tokens per call^[1], accounting for 60–80% of token usage in static toolsets. That overhead is billed on every step, before a single token of actual work happens.

Component 4: Retry tax. Failed tool calls don't disappear. The error, the model's recovery attempt, and all intermediate reasoning accumulate in context and get re-sent on every subsequent step. A 10% per-step failure rate, compounded across 10 steps without circuit breakers, multiplies costs several times over^[2]. One team cut their per-task tool call count from 14 to 2 by adding explicit SUCCESS/FAILED terminal states to tool responses — eliminating the retry loops that were consuming the majority of their budget.

4×

Tokens vs direct chat — single production agents

Anthropic production measurement, 2026^[1]

15×

Tokens vs direct chat — multi-agent research systems

Anthropic production measurement, 2026^[1]

60–80%

Token budget consumed by tool schemas alone in static toolsets

Speakeasy benchmarking, cited in Augment Code guide, 2026^[1]

Procurement Approval Agent: 12 Inference Calls per User Request

Each node is one inference call. Context size grows at every step — the cost of step 10 exceeds steps 1–4 combined. The retry path on ambiguous vendor lookup adds two more calls not visible on the happy path.

Context Accumulation Is O(N²), Not O(N). That's the Cost Surprise.

Linear per-step estimates are structurally wrong for any workflow beyond 3 steps

The structural reason cost estimates fail is that engineers price multi-step agent workflows as N independent inference calls. They are not independent. Each call carries the full history of everything that came before it, because that's how the API works.

For a workflow with N steps, average static context S (system prompt + tool schemas), average user input u per step, and average tool result r per step, total input token consumption follows:

Total input ≈ N·S + N·u + N(N+1)/2 · r

The triangular term N(N+1)/2 is the problem. For a 10-step workflow it's 55. For 12 steps it's 78. Multiply that by the average tool result size and you're paying for far more than 12 separate steps.

Apply this to a 12-step procurement agent on Sonnet 4.6 ($3/M input, $15/M output)^[6]. Reasonable assumptions: 3,000-token system prompt plus tool schemas, 200-token average user input per step, 500-token average tool result per step.

Naive estimate: 12 × (3,000 + 200 + 500) = 44,400 input tokens.

Formula estimate: N·S is 36,000; N·u is 2,400; N(N+1)/2 × r is 78 × 500 = 39,000. Total: approximately 77,400 input tokens — 1.74× the naive estimate before a single output token is counted.

Add output tokens, the retry path on vendor ambiguity, and tool schema overhead from a moderately sized MCP toolset, and the real multiplier on the naive budget reaches 3–5× for this single workflow. That's before any multi-agent orchestration layer.

Step	Procurement Agent Input Tokens	Code Review Agent Input Tokens	Driver of growth
1	~3,200	~4,500	System prompt + tool schemas + first message
2	~4,800	~7,200	Step 1 output + tool result added to context
3	~6,900	~10,400	All prior outputs accumulating
5	~12,000	~18,500	Context now larger than step 1 initial cost
8	~22,000	—	Every prior step re-billed; context dominates
10	~30,000	—	Step 10 alone exceeds steps 1–3 combined
12 (happy path)	~40,000	—	3–4× the naive 44,400-token estimate

Instrument the Multiplier Before You Budget. Not After You're Billed.

Staging traces are the only honest baseline — spreadsheet estimates are wrong by construction for workflows beyond 3 steps

The only way to know your workflow's actual inference multiplier is to instrument it and measure from staging runs. A spreadsheet model can get the formula right; it cannot know your tool schema sizes, your retry rates, or your actual context accumulation slope.

The measurement pattern: wrap every inference call inside a parent workflow trace span, record per-call input and output token counts, track cumulative context size at each step boundary, and compute the ratio of total measured token spend to your step-1 token count × step count. That ratio is your multiplier.

OpenTelemetry's GenAI semantic conventions (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) give you per-call token data at the instrumentation layer^[7]. The gap in most observability setups: teams see per-call cost but not per-user-action cost. Aggregating across the full workflow under a parent span closes that gap.

measure-inference-multiplier.ts

import Anthropic from "@anthropic-ai/sdk";
import { trace, SpanStatusCode } from "@opentelemetry/api";

const anthropic = new Anthropic();
const tracer = trace.getTracer("agent-cost-tracker");

interface WorkflowCostProfile {
  workflowId: string;
  stepCount: number;
  totalInputTokens: number;
  totalOutputTokens: number;
  singleCallEstimate: number; // naive: step-1 tokens × N
  multiplier: number;         // actual total / naive estimate
  costUsd: number;
}

async function runStep(
  workflowSpan: ReturnType<typeof tracer.startSpan>,
  stepIndex: number,
  messages: Anthropic.MessageParam[],
  model: string
): Promise<{ content: string; inputTokens: number; outputTokens: number }> {
  return tracer.startActiveSpan(
    `workflow.step.${stepIndex}`,
    { attributes: { "workflow.step_index": stepIndex } },
    async (stepSpan) => {
      const response = await anthropic.messages.create({
        model,
        max_tokens: 2048,
        messages,
      });

      const { input_tokens: inputTokens, output_tokens: outputTokens } =
        response.usage;

      // Step-level visibility — useful for debugging context growth
      stepSpan.setAttributes({
        "gen_ai.usage.input_tokens": inputTokens,
        "gen_ai.usage.output_tokens": outputTokens,
        "workflow.context_size_tokens": inputTokens,
      });

      // Accumulate on the parent span for workflow-level cost view
      workflowSpan.addEvent("step_complete", {
        step: stepIndex,
        input_tokens: inputTokens,
        output_tokens: outputTokens,
      });

      stepSpan.setStatus({ code: SpanStatusCode.OK });
      stepSpan.end();

      const content =
        response.content[0].type === "text" ? response.content[0].text : "";
      return { content, inputTokens, outputTokens };
    }
  );
}

export async function measureWorkflowMultiplier(
  workflowId: string,
  stepFunctions: Array<
    (history: Anthropic.MessageParam[]) => Promise<Anthropic.MessageParam>
  >,
  model = "claude-sonnet-4-6"
): Promise<WorkflowCostProfile> {
  return tracer.startActiveSpan(`workflow.${workflowId}`, async (workflowSpan) => {
    const history: Anthropic.MessageParam[] = [];
    let totalInput = 0;
    let totalOutput = 0;
    let firstStepInput = 0;

    for (let i = 0; i < stepFunctions.length; i++) {
      const nextMessage = await stepFunctions[i](history);
      history.push(nextMessage);

      const { inputTokens, outputTokens } = await runStep(
        workflowSpan,
        i,
        history,
        model
      );

      totalInput += inputTokens;
      totalOutput += outputTokens;
      if (i === 0) firstStepInput = inputTokens;
    }

    // Sonnet 4.6: $3/M input, $15/M output
    const costUsd =
      (totalInput / 1_000_000) * 3 + (totalOutput / 1_000_000) * 15;

    // Naive estimate: step-1 input × N (what most spreadsheets compute)
    const naiveEstimate = firstStepInput * stepFunctions.length;
    const multiplier =
      Math.round((totalInput / firstStepInput) * 10) / 10;

    const profile: WorkflowCostProfile = {
      workflowId,
      stepCount: stepFunctions.length,
      totalInputTokens: totalInput,
      totalOutputTokens: totalOutput,
      singleCallEstimate: naiveEstimate,
      multiplier,
      costUsd: Math.round(costUsd * 10_000) / 10_000,
    };

    // Surface the multiplier on the parent span for dashboard aggregation
    workflowSpan.setAttributes({
      "workflow.id": workflowId,
      "workflow.step_count": stepFunctions.length,
      "workflow.multiplier": multiplier,
      "workflow.total_input_tokens": totalInput,
      "workflow.cost_usd": profile.costUsd,
    });

    workflowSpan.setStatus({ code: SpanStatusCode.OK });
    workflowSpan.end();

    return profile;
  });
}

Naive estimate

Single average call cost × step count
Each step treated as independent
Tool schema overhead counted once
No retry path in the model
Multi-agent handoff overhead missing
Discovered to be wrong at billing time

Multiplier-aware estimate

Actual token spend measured from 20+ staging runs
Context accumulation formula applied per workflow
Tool schema overhead × step count, per step
P95 retry path included in ceiling
Per-agent handoff context cost measured separately
Multiplier ratio locked before production launch

Three Interventions That Actually Move the Multiplier

Each targets a different component — in order of effort-to-impact ratio

Understanding the four-component multiplier tells you where to intervene. The highest-impact changes target components 2 and 3 — because they compound against every call in the workflow.

Dynamic tool loading (targets Component 3 — tool schema overhead). The fastest win in most systems. Instead of injecting every registered tool's schema on every call, load only the tools relevant to the current step. One team reduced tool-definition overhead from 134,000 tokens to 8,700 per call — an 85% reduction in that component — by switching to dynamic loading^[2]. At 12 steps, that's eliminating over 1.5M tokens of overhead across the full workflow. Implementation: classify the step type first with a cheap Haiku 4.5 call, then load only the tool subset needed for that step class. The classification call costs a few hundred tokens; the savings across subsequent steps return that cost many times over.

Structured handoffs instead of full transcripts (targets Component 2 — context accumulation). Most agent loops pass full conversation transcripts between steps. Replacing those with structured JSON summaries — the decision made, the data retrieved, the constraints that apply — cuts context size per step by 50–70% while preserving what the next step actually needs. The formula doesn't change; the per-step r value in N(N+1)/2 × r does. Halving r on a 12-step workflow halves the accumulation component.

Prompt caching on stable context (targets Components 2 and 3). System prompts and tool schemas are identical across calls within a workflow. Marking them with cache_control: { type: "ephemeral" } prices cache reads at 10% of standard input cost^[6]. For a 3,000-token system prompt across 12 steps, that's 33,000 tokens billed at full rate without caching versus 3,000 at full rate and 30,000 at the cache-read rate — a real reduction. One constraint worth knowing: the cache TTL is five minutes^[1]. Async approval flows and human-in-the-loop waits that span longer than five minutes will re-incur the cache write cost on the next step.

Combining dynamic tool loading and prompt caching addresses components 2 and 3 without touching workflow logic. Teams that implement both report 70–85% reduction from unoptimized baselines^[4].

Pre-production multiplier audit

Inference call count mapped for both the happy path and the primary retry path
OpenTelemetry tracing enabled; per-call token counts visible per workflow run
Multiplier measured from at least 20 staging runs — not 1–3 benchmark calls
P95 token spend documented per workflow, not just P50
Tool schema overhead measured per call; dynamic loading evaluated where overhead exceeds 10,000 tokens
Prompt caching enabled for system prompt and stable tool schemas
Structured handoff format defined for each step boundary
Production cost estimate built from measured multiplier × expected request volume

Questions Teams Ask When They First See Their Multiplier

The answers don't get simpler the closer you look

What's a typical multiplier for a production agent with 8–10 steps?

Teams commonly report multipliers in the 6–15× range for well-designed single-purpose agents at 8–10 steps, measured against their initial single-call estimate. Multi-agent pipelines with handoffs push higher — up to 28× for a 4-agent team running 20 steps, according to published benchmarks^[5]. The only reliable way to know your specific number is to measure it from staging traces. Public benchmarks don't capture your tool schemas, retry rates, or context accumulation slope.

Should I measure the multiplier in tokens or dollars?

Tokens, with a per-model cost translation. The multiplier is a ratio that survives model price changes; a dollar-denominated multiplier becomes stale every time pricing updates. Compute cost from the token multiplier × current model pricing at budget time. This also makes routing impact easier to model: if staging shows a 6.5× token multiplier and you're routing steps 1–3 to Haiku 4.5 instead of Sonnet 4.6, you can apply the model cost ratio directly to the measured token budget without re-running the full benchmark.

My workflow uses MCP — how does that change the multiplier?

MCP multi-server deployments tend to carry the highest tool schema overhead of any agent pattern, because each connected server contributes its full schema to every call. Measured deployments have found 10,000–60,000 tokens of tool-schema overhead per call^[1] in multi-server MCP setups — before any user input or reasoning tokens. If you have more than 3–4 MCP servers connected, dynamic tool loading isn't optional; it's the primary cost control. Measure the schema overhead of your specific MCP configuration explicitly — it's often larger than everything else in the workflow combined.

At what multiplier does a workflow have a design problem rather than an optimization problem?

A multiplier above 20× on a single-purpose agent (not a multi-agent pipeline) usually signals a design problem: too many steps with no shared context compression, tool schemas injected that the workflow doesn't need, or retry loops without terminal states. Optimization — caching, dynamic tool loading, structured handoffs — can reduce the multiplier by 50–70%; only redesign reduces it further. The diagnostic test: after applying all three interventions, if the multiplier is still above 15×, audit the step count and tool schema surface area before accepting the cost as fixed.

The inference multiplier is the real unit of measurement for agent cost engineering. A 12-step workflow doesn't have a cost — it has a multiplier, and that multiplier interacts with model tier, tool schema size, and retry rate in ways a single-call benchmark cannot capture.

Measure the multiplier from staging traces. Document it per workflow, at P95. Set the production budget ceiling against that number, not against the per-call estimate. The teams that avoid billing surprises share one discipline: they treat multiplier measurement as a launch gate, not a post-incident activity.

Key terms in this piece

agent inference cost multiplierLLM multi-agent hidden costscontext accumulation token costagentic workflow cost estimationproduction agent cost tracinginference cost per user action

Sources

[1]Augment Code — Multi-Agent Cost Compounding: Why 3 Agents Cost 10x(augmentcode.com)↩
[2]Tian Pan — The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think(tianpan.co)↩
[3]Tian Pan — The Hidden Token Tax: How Overhead Silently Drains Your LLM Context Window(tianpan.co)↩
[4]Zylos Research — AI Agent Cost Engineering — Production Token Economics(zylos.ai)↩
[5]CostLayer — Multi-Agent AI Costs 4x More: Token Bloat Hidden Expense(costlayer.ai)↩
[6]Anthropic — Claude API Pricing — Anthropic Documentation(docs.anthropic.com)↩
[7]DEV Community — LLM Cost Monitoring with OpenTelemetry(dev.to)↩

Share this article

X LinkedIn Hacker News

Your Agent Budget Is Built on One Call. Production Takes Twelve.

AI Engineering PlatformadvancedMay 25, 20266 min read

By Viktor Bezdek · VP Engineering, Groupon

Step

Procurement Agent Input Tokens

Code Review Agent Input Tokens

Driver of growth

~3,200

~4,500

System prompt + tool schemas + first message

~4,800

~7,200

Step 1 output + tool result added to context

~6,900

~10,400

All prior outputs accumulating

~12,000

~18,500

Context now larger than step 1 initial cost

~22,000

—

Every prior step re-billed; context dominates

~30,000

—

Step 10 alone exceeds steps 1–3 combined

12 (happy path)

~40,000

—

3–4× the naive 44,400-token estimate

import Anthropic from "@anthropic-ai/sdk"; import { trace, SpanStatusCode } from "@opentelemetry/api"; const anthropic = new Anthropic(); const tracer = trace.getTracer("agent-cost-tracker"); interface WorkflowCostProfile { workflowId: string; stepCount: number; totalInputTokens: number; totalOutputTokens: number; singleCallEstimate: number; // naive: step-1 tokens × N multiplier: number; // actual total / naive estimate costUsd: number; } async function runStep( workflowSpan: ReturnType<typeof tracer.startSpan>, stepIndex: number, messages: Anthropic.MessageParam[], model: string ): Promise<{ content: string; inputTokens: number; outputTokens: number }> { return tracer.startActiveSpan( `workflow.step.${stepIndex}`, { attributes: { "workflow.step_index": stepIndex } }, async (stepSpan) => { const response = await anthropic.messages.create({ model, max_tokens: 2048, messages, }); const { input_tokens: inputTokens, output_tokens: outputTokens } = response.usage; // Step-level visibility — useful for debugging context growth stepSpan.setAttributes({ "gen_ai.usage.input_tokens": inputTokens, "gen_ai.usage.output_tokens": outputTokens, "workflow.context_size_tokens": inputTokens, }); // Accumulate on the parent span for workflow-level cost view workflowSpan.addEvent("step_complete", { step: stepIndex, input_tokens: inputTokens, output_tokens: outputTokens, }); stepSpan.setStatus({ code: SpanStatusCode.OK }); stepSpan.end(); const content = response.content[0].type === "text" ? response.content[0].text : ""; return { content, inputTokens, outputTokens }; } ); } export async function measureWorkflowMultiplier( workflowId: string, stepFunctions: Array< (history: Anthropic.MessageParam[]) => Promise<Anthropic.MessageParam> >, model = "claude-sonnet-4-6" ): Promise<WorkflowCostProfile> { return tracer.startActiveSpan(`workflow.${workflowId}`, async (workflowSpan) => { const history: Anthropic.MessageParam[] = []; let totalInput = 0; let totalOutput = 0; let firstStepInput = 0; for (let i = 0; i < stepFunctions.length; i++) { const nextMessage = await stepFunctions[i](history); history.push(nextMessage); const { inputTokens, outputTokens } = await runStep( workflowSpan, i, history, model ); totalInput += inputTokens; totalOutput += outputTokens; if (i === 0) firstStepInput = inputTokens; } // Sonnet 4.6: $3/M input, $15/M output const costUsd = (totalInput / 1_000_000) * 3 + (totalOutput / 1_000_000) * 15; // Naive estimate: step-1 input × N (what most spreadsheets compute) const naiveEstimate = firstStepInput * stepFunctions.length; const multiplier = Math.round((totalInput / firstStepInput) * 10) / 10; const profile: WorkflowCostProfile = { workflowId, stepCount: stepFunctions.length, totalInputTokens: totalInput, totalOutputTokens: totalOutput, singleCallEstimate: naiveEstimate, multiplier, costUsd: Math.round(costUsd * 10_000) / 10_000, }; // Surface the multiplier on the parent span for dashboard aggregation workflowSpan.setAttributes({ "workflow.id": workflowId, "workflow.step_count": stepFunctions.length, "workflow.multiplier": multiplier, "workflow.total_input_tokens": totalInput, "workflow.cost_usd": profile.costUsd, }); workflowSpan.setStatus({ code: SpanStatusCode.OK }); workflowSpan.end(); return profile; }); }

Understanding the four-component multiplier tells you where to intervene. The highest-impact changes target components 2 and 3 — because they compound against every call in the workflow.

Combining dynamic tool loading and prompt caching addresses components 2 and 3 without touching workflow logic. Teams that implement both report 70–85% reduction from unoptimized baselines^[4].

Your Agent Budget Is Built on One Call. Production Takes Twelve.

The Multiplier Has Four Components. Most Teams Model One.

Context Accumulation Is O(N²), Not O(N). That's the Cost Surprise.

Instrument the Multiplier Before You Budget. Not After You're Billed.

Three Interventions That Actually Move the Multiplier

Pre-production multiplier audit

Questions Teams Ask When They First See Their Multiplier

Related

Your Agent Budget Is Built on One Call. Production Takes Twelve.

The Multiplier Has Four Components. Most Teams Model One.

Context Accumulation Is O(N²), Not O(N). That's the Cost Surprise.

Instrument the Multiplier Before You Budget. Not After You're Billed.

Three Interventions That Actually Move the Multiplier

Pre-production multiplier audit

Questions Teams Ask When They First See Their Multiplier

Related