Inference Budget Governance for Scaling AI Agents

Inference Budget Governance: The Hidden Finance Problem in Scaling Agents

Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, and cost forecasting without capability degradation.

Governance & AdoptionadvancedApr 30, 20267 min read

By Viktor Bezdek · VP Engineering, Groupon

Three months after shipping their research pipeline, a team found an $89,000 API bill. The agents were running successfully — every quality metric was green, latency was acceptable, users were happy. But each research request was feeding a fresh 150,000-token context window to Claude Opus on every reasoning step. Nobody modeled that. The bill arrived on Tuesday. The CFO asked for an explanation on Wednesday. By Thursday, the team had manually throttled the pipeline to a trickle.

This is inference bill shock, and it follows a pattern almost every team scaling agents eventually hits. They build cost estimates from single-call benchmarks — accurate for a clean system prompt plus response. Then they deploy agents that iterate: accumulating context across steps, spinning up subagents that inherit that context, retrying on failures without any budget awareness. A 10-turn research agent with 20,000 tokens of context per turn doesn't cost 200,000 tokens — it costs 1.1 million, because each step retransmits everything that came before it.

Agents make 3–10x more LLM calls than simple chatbots per user request^[1], output tokens cost 3–5x more than input tokens across all providers^[2], and multi-agent pipelines multiply these effects by the number of agents operating in parallel. The combination turns a $0.05 per-request estimate into a $5 per-request reality. Inference budget governance — treating token capacity as a managed resource with per-agent quotas, model routing policies, and a Finance-Engineering operating cadence — is how teams stop discovering this math on the billing page.

Why production inference bills always exceed estimates

The structural reasons single-agent benchmarks fail to predict multi-agent costs

The core problem isn't that agents are expensive — it's that single-agent cost models don't predict multi-agent costs at all.

When engineering estimates cost per request, they measure a clean single call: system prompt + user message + response. That measurement is accurate. It just doesn't describe what happens in production, where agents iterate.

A research agent analyzing a competitor landscape doesn't ask one question. It asks eight, accumulates the answers into a growing context window, then runs a synthesis pass against all of it. Each reasoning step retransmits the full accumulated context. Add multi-agent orchestration, and the multiplier compounds: each subagent receives a context including the orchestrator's full reasoning trace, its own history, and every tool call output accumulated so far. A three-agent pipeline doesn't cost 3x the single-agent price — it costs 6x, because every fan-out step carries accumulated context from everything upstream.

Then there's the loop problem. In November 2025, two LangChain-based agents entered an infinite conversation cycle that ran for 11 days, generating a $47,000 API bill before anyone caught it^[3]. According to a 2025 State of AI Cost Management survey, 80% of enterprises underestimate their AI infrastructure costs by more than 25%^[4]. The teams that avoid surprises share one trait: they treat inference budget as a first-class design constraint before launch, not a monitoring dashboard to check after something goes wrong.

96% of enterprises report that AI costs exceeded their initial projections^[1]. That's not a forecasting failure — it's an architectural one. The cost model wasn't wrong; it wasn't modeling the right system.

96%

Enterprises reporting AI costs exceeded initial projections

Zylos Research, Feb 2026

3–10×

More LLM calls per request in agentic vs. chat systems

Zylos Research, Feb 2026

$47,000

Cost of a single agent loop left uncaught for 11 days

LangChain incident, November 2025

The three cost failure modes in production agent systems

Each is preventable — but only if identified before launch, not during incident review

Most inference budget problems trace to three structural failures that compound each other.

Context accumulation without pruning. Multi-turn agents retransmit full conversation history on every step. A 20-step research agent with 10,000 tokens of context per turn spends 1.1 million input tokens total, not 200,000 — because step N pays for all N-1 previous steps plus the new input. The fix isn't reducing context size; it's passing structured summaries between steps rather than raw transcripts. Agent output contracts — structured JSON rather than conversational history — cut this multiplier significantly.

Reflexive frontier model use. Teams default to their best model for everything, which feels safe but doesn't make economic sense. Sonnet 4.6 runs at $3/$15 per million tokens; Opus 4.6 runs at $5/$25^[5] — a 1.67x per-token gap. On tasks where Sonnet delivers 98% of Opus quality (most everyday coding, classification, and content generation), using Opus is a 67% cost premium for near-zero quality gain. Routing 90% of production requests to Sonnet instead of Opus cuts that portion of the bill by 40%.

Hard stops without graceful degradation. Engineering teams often implement budget caps as exceptions: when the limit is hit, the agent throws an error. Hard stops cause silent failures, retry storms, and compounding costs. Users see errors and retry. The retry itself may breach the next budget tier. The correct pattern is three-tier response: warn at 80% of budget, downgrade model and compress context at 90%, return a partial result at 100%. An honest partial result preserves more user trust than an unhandled exception.

Inference Budget Governance: Request Flow

Every inference request passes through budget check, task classification, and model routing before executing. Cost is attributed per-agent and per-workflow, then aggregated for anomaly detection and weekly governance review.

Building the Finance-Engineering contract before launch

The organizational agreement that prevents bill shock — structured for teams that don't know where to start

The organizational piece of inference budget governance is systematically underbuilt. Platform engineers implement cost controls. Finance tracks monthly invoices. Neither talks to the other until the bill is already a surprise. By that point, the conversation is adversarial rather than collaborative.

The contract that prevents this has three components.

A shared cost model built before launch. Engineering provides Finance with: estimated token counts per agent task class at P50 and P95 confidence, expected request volume by workflow, and model tier assignments per task type. Finance provides: a monthly budget envelope per product area and escalation thresholds that trigger review. The shared output is a per-workflow cost forecast. P50 is the operating target; P95 defines the alert boundary. Without this document, nobody knows whether a 40% month-over-month cost increase signals success (volume grew 60%) or failure (efficiency collapsed).

Per-agent cost attribution. Every inference request needs to emit a structured cost event: agent ID, workflow ID, product area, model tier, input tokens, output tokens, and cache hit status. Without attribution, you can see total spend but not the cause. With it, anomalies become diagnosable: "the document analysis agent in the compliance workflow consumed 3x its P95 budget because a new document format triggered verbose extraction chains" is actionable. "Our total API spend was 40% over forecast" is not.

A weekly governance cadence. Finance and platform engineering review the cost report together — weekly, not monthly. The latency between anomaly and detection should be measured in days. A team reviewing weekly catches a runaway agent in the current sprint. A team reviewing monthly catches it after four weeks of damage. This cadence also builds the usage history needed to negotiate provider discounts: committed spend tiers typically yield 15–30% reductions at moderate volumes^[2], but you need reliable projections to negotiate confidently.

Reactive cost management

Cost modeled from single-call benchmarks
Provider spend caps as the only guardrail
Monthly billing review catches overruns after the fact
Frontier model used by default for safety
No attribution below the team level
Engineering owns cost; Finance sees invoices

Inference budget governance

Per-workflow cost forecasts with P50 and P95 projections
SDK-level per-agent quota enforcement with graceful degradation
Weekly Finance-Engineering review with anomaly triage
Task-class model routing: Haiku → Sonnet → Opus by complexity
Cost attributed to agent ID, workflow ID, and product area
Shared cost model with Finance sign-off before launch

Per-agent budget enforcement in code

SDK-level controls with graceful degradation — the implementation teams actually ship

Budget enforcement has to happen at the SDK layer. Monitoring tells you what you spent. SDK-level enforcement controls what you can spend — and shapes how the agent behaves when approaching a limit.

The minimal pattern: wrap every agent execution with a budget tracker that holds a per-trace token ceiling. As the ceiling approaches, the agent downgrades to a cheaper model. At 90% of budget, it compresses working context. At 100%, it returns a partial result rather than an exception. The user gets a degraded-but-useful response. No retry storm. No silent failure.

budget-enforced-agent.ts

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

interface AgentBudgetConfig {
  agentId: string;
  workflowId: string;
  maxTokens: number;
  warnThreshold: number;      // 0.8 = warn at 80% budget
  downgradeThreshold: number; // 0.9 = downgrade model at 90%
  primaryModel: string;
  fallbackModel: string;      // one tier down — never two
}

interface ExecutionResult {
  result: string;
  tokensUsed: number;
  modelUsed: string;
  partial: boolean;
}

async function budgetEnforcedExec(
  config: AgentBudgetConfig,
  prompt: string,
  tokensUsedSoFar: number
): Promise<ExecutionResult> {
  const budgetRatio = tokensUsedSoFar / config.maxTokens;

  // Budget exhausted — return partial result, never throw
  if (budgetRatio >= 1.0) {
    return {
      result: `[Budget exhausted at ${tokensUsedSoFar.toLocaleString()} tokens. Returning partial result.]`,
      tokensUsed: 0,
      modelUsed: "none",
      partial: true,
    };
  }

  // Downgrade one tier under budget pressure — never skip tiers mid-reasoning
  const model =
    budgetRatio >= config.downgradeThreshold
      ? config.fallbackModel
      : config.primaryModel;

  if (budgetRatio >= config.warnThreshold) {
    console.warn(
      `[${config.agentId}] Budget at ${Math.round(budgetRatio * 100)}% — routing to ${model}`
    );
  }

  const response = await anthropic.messages.create({
    model,
    max_tokens: Math.min(4096, config.maxTokens - tokensUsedSoFar),
    messages: [{ role: "user", content: prompt }],
  });

  const tokensUsed =
    response.usage.input_tokens + response.usage.output_tokens;

  // Structured cost event — emit to observability platform
  console.log(
    JSON.stringify({
      event: "inference_cost",
      agentId: config.agentId,
      workflowId: config.workflowId,
      model,
      tokensUsed,
      cumulativeTokens: tokensUsedSoFar + tokensUsed,
      budgetRatio: (tokensUsedSoFar + tokensUsed) / config.maxTokens,
    })
  );

  return {
    result:
      response.content[0].type === "text" ? response.content[0].text : "",
    tokensUsed,
    modelUsed: model,
    partial: false,
  };
}

Model routing without capability degradation

Task-class routing is the highest-impact cost control — and the one most teams implement wrong

Model routing is the highest-impact inference cost control available. Teams either route everything to the frontier model (expensive by default) or try to route everything to the cheapest model (breaks tasks, causes retries that compound costs). Both miss the point.

The routing decision lives at the task level, not the session level. A single agent session might use Opus for the planning step, Sonnet for document extraction and analysis, and Haiku for formatting the final output. This isn't a quality compromise — it's matching cognitive load to model capability.

Production data from routing implementations supports this: routing 90% of requests to Sonnet 4.6 while reserving Opus 4.6 for high-complexity tasks typically yields 40–60% cost reduction with less than 5% measurable quality degradation^[6]. The word measurable matters — there's a common assumption that cheaper models perform dramatically worse on real workloads, but benchmark data shows the gap is narrow for the task types that dominate most production traffic.

The routing decision needs its own benchmark. Public benchmark scores (SWE-bench, GPQA) are proxies. Your task distribution is the ground truth. Before committing to routing rules, run your specific task mix through both models and measure quality on your own evaluation criteria. Teams that get routing wrong almost always set rules based on intuition rather than their own task benchmarks.

Task Type	Default Model	Cost per MTok (in/out)	Escalate to Opus When
Text classification, labeling	Haiku 4.5	$1 / $5	Ambiguous categories exceed 30% of samples
Structured data extraction, formatting	Haiku 4.5	$1 / $5	Fields require domain inference, not pattern matching
Content generation, summarization	Sonnet 4.6	$3 / $15	Specialized domain, regulatory, or high-stakes output
Code generation and review	Sonnet 4.6	$3 / $15	Security-critical paths, large codebase refactors
Multi-step planning, task decomposition	Opus 4.6	$5 / $25	Always appropriate — this is Opus's core advantage
Compliance review, security audits	Opus 4.6	$5 / $25	Always appropriate — error cost exceeds model premium
Long-context synthesis (100K+ tokens)	Opus 4.6	$5 / $25	Default — smaller models lose coherence at scale

Context caching as a cost multiplier

Often the fastest win in production agent systems — and the one teams implement last

Model routing handles the model selection decision. Context caching handles the repeated-context problem — and it's often the easier win to implement first.

Most agent systems send the same system prompt and shared knowledge base on every request. Without caching, those tokens are billed at full input rate every call. Anthropic's prompt caching prices cache reads at 10% of standard input cost^[5]. For a 50,000-token system prompt called 1,000 times per day, that's the difference between paying $150 per day and $15 per day — just for the system prompt portion.

The implementation is lightweight: mark stable context with cache_control in the API request. The cache is valid for up to one hour per write. For high-frequency agents, the cache write cost (1.25x standard input price for 5-minute TTL) is recovered after a single subsequent cache read^[5].

Caching and routing compose. A cached Haiku call costs approximately 50x less on the system prompt portion than an uncached Opus call — because the cache read costs 10% of Haiku's already-low input rate, versus full Opus input pricing. At 30,000 requests per month, that difference on system prompt alone can reach $7,350^[7].

cached-agent-call.ts

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

// Large, stable context: system prompt, regulatory docs, shared instructions
const STABLE_SYSTEM_PROMPT = `You are a compliance review agent...
[50,000 tokens of regulatory context and examples]`;

async function cachedAgentCall(userMessage: string): Promise<string> {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-6",
    max_tokens: 2048,
    system: [
      {
        type: "text",
        text: STABLE_SYSTEM_PROMPT,
        cache_control: { type: "ephemeral" }, // cache valid for up to 1 hour
      },
    ],
    messages: [{ role: "user", content: userMessage }],
  });

  // Monitor cache efficiency — target >70% hit rate for stable prompts
  const {
    input_tokens,
    cache_read_input_tokens,
    cache_creation_input_tokens,
  } = response.usage;

  const totalInput =
    input_tokens +
    cache_read_input_tokens +
    (cache_creation_input_tokens ?? 0);
  const cacheHitRate = cache_read_input_tokens / totalInput;

  console.log(
    JSON.stringify({
      event: "cache_efficiency",
      cacheHitRate: Math.round(cacheHitRate * 100),
      cachedTokens: cache_read_input_tokens,
      uncachedTokens: input_tokens,
    })
  );

  // A cache hit rate below 70% on a stable prompt suggests the cache TTL
  // is expiring between requests — increase request frequency or use 1-hour writes.

  return response.content[0].type === "text" ? response.content[0].text : "";
}

The governance operating rhythm

Technical controls without organizational cadence decay — here's the rhythm that keeps them sharp

1
Pre-launch: Define the cost model with Finance
Before any agent ships, engineering produces token estimates by task class at P50 and P95. Finance approves a monthly budget envelope per product area and sets alert thresholds — typically at 50%, 80%, and 100% of the envelope — with a defined response for each threshold: monitor, review, halt. The output is a mutual cost contract, not a dashboard to check later. This document also defines what happens when each threshold fires, so the response isn't improvised during an incident.
2
At launch: Enable per-agent cost attribution
Every inference request must emit a structured cost event: agent ID, workflow ID, product area, model used, input tokens, output tokens, and cache hit status. Wire these to your observability platform — OpenTelemetry span attributes work well. Without attribution, governance is blind. You can see total monthly spend but cannot identify which agent or workflow drove it, which makes optimization guesswork.
3
Weekly: Review and triage anomalies
Finance and platform engineering spend 30 minutes reviewing the cost report together. Flag any workflow spending above P95 for its task class. Assign an investigation owner with a 48-hour resolution window. Most anomalies have simple explanations — context size creep from a new document type, a high-volume customer, a tool schema change that bloated prompt size. Finding these weekly keeps incidents at sprint scope rather than quarterly scope.
4
Monthly: Rebalance model routing
Model pricing changes. New tiers emerge. What required Opus six months ago may be Sonnet territory now — capability gaps narrow, fine-tuning options emerge, and benchmark scores shift. Every month, re-run the routing decision matrix against current pricing and current benchmark data for your specific task mix. Routing rules set six months ago and never revisited are almost certainly leaving money on the table.
5
Quarterly: Renegotiate the budget envelope
As usage patterns mature, P50/P95 estimates improve. Refine the budget commitments to match. If spend consistently runs 30% below budget, propose reallocating that margin to new agent capabilities. If spend consistently exceeds P95, audit model routing decisions before requesting a budget increase — the routing rules may simply need updating, which costs nothing. This cadence also produces the usage history needed to negotiate committed-spend discounts with providers.

Common traps in inference budget implementation

Most of these only reveal themselves after they've already cost money

Technical failure modes

Hard stops instead of graceful degradation — throwing exceptions when budgets are exceeded causes user-facing errors and retry storms that compound the exact costs you were trying to control
Jumping two model tiers mid-task (Opus → Haiku) — breaks multi-step reasoning continuity and triggers retries that cost more than staying on Opus would have
Org-level spend caps without per-agent attribution — you know you're over budget but cannot identify which agent, workflow, or customer caused it
Treating context window as a default rather than a ceiling — filling 200K windows on every step when 5,000 tokens of structured summary would suffice
Treating provider spend caps as primary controls — by the time a provider cap fires, requests are failing for users; SDK enforcement is the operational layer

Organizational failure modes

Finance reviewing cost monthly instead of weekly — a runaway agent running 21 days before detection costs 4x more than one caught in five days
Cost model built from demo benchmarks, not production traces — underestimates by 3–10x because demos don't have the context accumulation of real iterative agent tasks
No pre-launch cost approval gate — engineering ships agents without Finance alignment on budget expectations, turning every bill surprise into a trust problem
Token budget governance owned entirely by the platform team — Finance has no visibility until the invoice, leaving no opportunity for proactive reallocation or renegotiation

Questions practitioners actually ask

The governance questions without clean answers in the documentation

What's a reasonable per-agent monthly token budget for a production research agent?

There's no generalizable number — it varies too much by request volume, context size, and task complexity. The useful approach: instrument one week of staging traces, compute P50 and P95 token spend per task class, then multiply by expected production request volume. Use the P95 number as your budget ceiling for Finance approval, and set your alert threshold at 80% of that. For a rough sanity check, a research agent making 100 requests per day with 50,000-token contexts on Sonnet 4.6 runs roughly $2,000–$3,000 per month before caching — but your actual number could be 5x higher or lower depending on iteration depth and context accumulation.

Should budget caps be enforced at the provider level or the SDK level?

Both — but for different reasons. Provider caps are your last-resort backstop against catastrophic incidents; set them high enough to not interfere with normal operations. SDK-level caps are your operational controls: they enforce graceful degradation, give you per-agent visibility, and let you route dynamically based on budget state. Relying only on provider caps means your first signal of a runaway agent is requests failing wholesale for users — worse for experience and harder to diagnose than a graceful partial-result response.

How much reasoning quality do we actually lose when routing to cheaper models?

On most standard production tasks — extraction, summarization, formatting, routine code generation — the measurable quality difference between Sonnet 4.6 and Opus 4.6 is under 5%. The gap widens on tasks requiring sustained multi-step reasoning, novel domain synthesis, or high-stakes judgment with sparse signal. Before committing to routing rules, run your specific task mix through both models and measure quality on your own criteria. Public benchmark scores on academic datasets are a proxy; your task distribution is the ground truth. Teams that get routing wrong almost always set rules based on intuition rather than their own benchmarks.

What's the fastest way to cut inference spend if we're already over budget?

In order of implementation speed: (1) Enable prompt caching on stable system prompts — takes hours, saves 40–80% on input tokens for high-repetition context. (2) Add context compression between agent steps — pass structured summaries of prior results instead of full conversation transcripts. (3) Audit model assignments — identify any task classes currently running on Opus that your own benchmark data shows Sonnet handles adequately. These three steps together typically reduce spend by 50–70% without any product-visible changes.

When should the agent warn vs. downgrade vs. return a partial result?

Warn at 80% of budget — log the event, optionally surface to an operator, do not change agent behavior yet. Downgrade model at 90% — switch to the next-cheaper tier (one tier only), compress context by summarizing earlier work rather than retransmitting it. Return a partial result at 100% — complete the current reasoning step and halt gracefully with whatever has been computed. Never throw an unhandled exception as the budget enforcement response. Silent failures and unhandled errors erode user trust far faster than an honest partial result.

Pre-launch inference budget governance checklist

Token estimates completed for each agent by task class — P50 and P95 from staging traces
Monthly budget envelope reviewed and approved by Finance before launch
Alert thresholds defined at 50%, 80%, and 100% of monthly envelope with documented responses
Per-agent cost attribution wired to observability platform — queryable by agent ID and workflow ID
Model routing rules defined and benchmarked against your task mix, not just public benchmarks
SDK-level budget enforcement implemented with three-tier response: warn → downgrade → partial result
Prompt caching enabled for stable system prompts — target over 70% cache hit rate
Context compression between agent steps — structured summaries, not full conversation transcripts
Provider spend cap set as backstop only, not as primary enforcement control
Weekly Finance-Engineering cost review cadence established before first production traffic

Key terms in this piece

inference budget governanceLLM agent cost controlper-agent token budgetmodel routing costAI inference cost forecastingagent cost management

Sources

[1]Zylos Research — AI Agent Cost Optimization: Token Economics and FinOps in Production(zylos.ai)↩
[2]Digital Applied — LLM API Pricing Index: AI Agent Deployment Costs Guide(digitalapplied.com)↩
[3]DEV Community — How an AI Agent Ran Up a $47,000 Bill in 11 Days (And How to Stop It)(dev.to)↩
[4]Matthew Diakonov — Stop Burning Money on API Fees(fazm.ai)↩
[5]Anthropic — Claude API Pricing — Anthropic Documentation(docs.anthropic.com)↩
[6]NxCode — Sonnet vs Opus: Which Claude Model to Pick (2026)(nxcode.io)↩
[7]Claude Skills Guide — Claude Haiku vs Sonnet vs Opus Cost Breakdown 2026(claudecodeguides.com)↩

Share this article

X LinkedIn Hacker News

Inference Budget Governance: The Hidden Finance Problem in Scaling Agents

Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, and cost forecasting without capability degradation.

Governance & AdoptionadvancedApr 30, 20267 min read

By Viktor Bezdek · VP Engineering, Groupon

The contract that prevents this has three components.

import Anthropic from "@anthropic-ai/sdk"; const anthropic = new Anthropic(); interface AgentBudgetConfig { agentId: string; workflowId: string; maxTokens: number; warnThreshold: number; // 0.8 = warn at 80% budget downgradeThreshold: number; // 0.9 = downgrade model at 90% primaryModel: string; fallbackModel: string; // one tier down — never two } interface ExecutionResult { result: string; tokensUsed: number; modelUsed: string; partial: boolean; } async function budgetEnforcedExec( config: AgentBudgetConfig, prompt: string, tokensUsedSoFar: number ): Promise<ExecutionResult> { const budgetRatio = tokensUsedSoFar / config.maxTokens; // Budget exhausted — return partial result, never throw if (budgetRatio >= 1.0) { return { result: `[Budget exhausted at ${tokensUsedSoFar.toLocaleString()} tokens. Returning partial result.]`, tokensUsed: 0, modelUsed: "none", partial: true, }; } // Downgrade one tier under budget pressure — never skip tiers mid-reasoning const model = budgetRatio >= config.downgradeThreshold ? config.fallbackModel : config.primaryModel; if (budgetRatio >= config.warnThreshold) { console.warn( `[${config.agentId}] Budget at ${Math.round(budgetRatio * 100)}% — routing to ${model}` ); } const response = await anthropic.messages.create({ model, max_tokens: Math.min(4096, config.maxTokens - tokensUsedSoFar), messages: [{ role: "user", content: prompt }], }); const tokensUsed = response.usage.input_tokens + response.usage.output_tokens; // Structured cost event — emit to observability platform console.log( JSON.stringify({ event: "inference_cost", agentId: config.agentId, workflowId: config.workflowId, model, tokensUsed, cumulativeTokens: tokensUsedSoFar + tokensUsed, budgetRatio: (tokensUsedSoFar + tokensUsed) / config.maxTokens, }) ); return { result: response.content[0].type === "text" ? response.content[0].text : "", tokensUsed, modelUsed: model, partial: false, }; }

Task Type

Default Model

Cost per MTok (in/out)

Escalate to Opus When

Text classification, labeling

Haiku 4.5

$1 / $5

Ambiguous categories exceed 30% of samples

Structured data extraction, formatting

Haiku 4.5

$1 / $5

Fields require domain inference, not pattern matching

Content generation, summarization

Sonnet 4.6

$3 / $15

Specialized domain, regulatory, or high-stakes output

Code generation and review

Sonnet 4.6

$3 / $15

Security-critical paths, large codebase refactors

Multi-step planning, task decomposition

Opus 4.6

$5 / $25

Always appropriate — this is Opus's core advantage

Compliance review, security audits

Opus 4.6

$5 / $25

Always appropriate — error cost exceeds model premium

Long-context synthesis (100K+ tokens)

Opus 4.6

$5 / $25

Default — smaller models lose coherence at scale

import Anthropic from "@anthropic-ai/sdk"; const anthropic = new Anthropic(); // Large, stable context: system prompt, regulatory docs, shared instructions const STABLE_SYSTEM_PROMPT = `You are a compliance review agent... [50,000 tokens of regulatory context and examples]`; async function cachedAgentCall(userMessage: string): Promise<string> { const response = await anthropic.messages.create({ model: "claude-sonnet-4-6", max_tokens: 2048, system: [ { type: "text", text: STABLE_SYSTEM_PROMPT, cache_control: { type: "ephemeral" }, // cache valid for up to 1 hour }, ], messages: [{ role: "user", content: userMessage }], }); // Monitor cache efficiency — target >70% hit rate for stable prompts const { input_tokens, cache_read_input_tokens, cache_creation_input_tokens, } = response.usage; const totalInput = input_tokens + cache_read_input_tokens + (cache_creation_input_tokens ?? 0); const cacheHitRate = cache_read_input_tokens / totalInput; console.log( JSON.stringify({ event: "cache_efficiency", cacheHitRate: Math.round(cacheHitRate * 100), cachedTokens: cache_read_input_tokens, uncachedTokens: input_tokens, }) ); // A cache hit rate below 70% on a stable prompt suggests the cache TTL // is expiring between requests — increase request frequency or use 1-hour writes. return response.content[0].type === "text" ? response.content[0].text : ""; }

Why production inference bills always exceed estimates

The three cost failure modes in production agent systems

Building the Finance-Engineering contract before launch

Per-agent budget enforcement in code

Model routing without capability degradation

Context caching as a cost multiplier

The governance operating rhythm

Pre-launch: Define the cost model with Finance

At launch: Enable per-agent cost attribution

Weekly: Review and triage anomalies

Monthly: Rebalance model routing

Quarterly: Renegotiate the budget envelope

Common traps in inference budget implementation

Technical failure modes

Organizational failure modes

Questions practitioners actually ask

Pre-launch inference budget governance checklist

Related

The 90-Day PoC Exit: How to Break Pilot Culture and Ship AI to Production

The Production Agent Retirement Checklist

Your MCP Server Is Someone Else's Attack Vector: A Supply Chain Audit Framework

Why production inference bills always exceed estimates

The three cost failure modes in production agent systems

Building the Finance-Engineering contract before launch

Per-agent budget enforcement in code

Model routing without capability degradation

Context caching as a cost multiplier

The governance operating rhythm

Pre-launch: Define the cost model with Finance

At launch: Enable per-agent cost attribution

Weekly: Review and triage anomalies

Monthly: Rebalance model routing

Quarterly: Renegotiate the budget envelope

Common traps in inference budget implementation

Technical failure modes

Organizational failure modes

Questions practitioners actually ask

Pre-launch inference budget governance checklist

Related

The 90-Day PoC Exit: How to Break Pilot Culture and Ship AI to Production

The Production Agent Retirement Checklist

Your MCP Server Is Someone Else's Attack Vector: A Supply Chain Audit Framework