Skip to content
AI Native Builders

The Agent Cost Circuit Breaker: Stop the $15K Spike Before It Hits Your Invoice

How to apply the circuit breaker pattern from distributed systems to agent cost control — including pre-production cost envelope estimation, trip threshold calibration, and graceful degradation modes.

AI Engineering PlatformadvancedMar 31, 20266 min read
Editorial illustration of a robot holding wire cutters about to snip a glowing cable while stacks of dollar bills cascade off a cliff in the backgroundThe circuit breaker pattern from distributed systems, applied to the most expensive new failure mode in production engineering.

A Friday afternoon deploy. New research agent, caching enabled, rate limiting in place. By Monday morning, Slack fires: your API bill hit $12,000 over the weekend. The agent entered a retry loop at 6 PM Friday, re-querying the same documents with slightly different prompts. Each iteration appended the full tool response to a growing context window. By Saturday midnight it had made roughly 8,400 API calls, according to the ZenML production survey that documented this scenario[1]. By Sunday, the cost per call was approximately 40 times what it was at the start. No alert fired. No circuit tripped. The system worked exactly as designed.

This is not a monitoring problem. Monitoring shows you the fire after it starts. The agent cost circuit breaker is the pattern that stops ignition — an architectural control sitting in your orchestration layer that measures the cost envelope of each workflow session and trips before the damage compounds.

The concept comes directly from distributed systems. Hystrix popularized it for service availability: when a downstream service starts failing, you stop sending traffic rather than hammering a broken endpoint. The adaptation for agent cost is exact: when a workflow session's spend exceeds its expected envelope, you stop executing rather than continuing to burn tokens on what is almost certainly a runaway path.

This is not a dashboard article. It's an architecture guide for teams who already have agents in production and have either had the billing surprise or are afraid of the first one.

Up to $47K
Maximum documented spike from a single agent retry loop in a 4-week period, per ZenML's 2025 production survey. Most spikes are smaller but the tail risk is real.
~8,400
API calls generated in one reported Friday-night retry loop before any alert fired — your environment's numbers will differ based on tool call patterns
~50×
Typical cost multiplier between the 1st and 100th call in a compounding session, due to context window growth. Exact multiplier depends on response sizes.
24–48h
Typical lag for cloud billing anomaly alerts — often long after significant damage has accumulated

Why Agent Cost Defies Normal Budget Controls

The mechanics of how $0.03 calls become $180 sessions — and why standard cloud alerts always arrive too late.

Cloud cost alerts work because cloud resources consume roughly linearly: 10 more EC2 instances costs approximately 10x more than one. Agents don't behave that way. Each tool call appends its result to the conversation context, so the cost of the 10th call in a session is not 10x the first — it can be 20x, because the context window feeding that call is 20x longer[6]. A multi-agent workflow that normally costs $0.45 can hit $87 when one sub-agent enters a tool loop and each retry appends a 2,000-token response to the shared context.

The failure modes that trigger runaway cost all share a common structure: the agent keeps executing because the termination condition is never cleanly met.

Infinite tool loops: the agent calls the same tool repeatedly because each response is slightly different and its internal stopping condition never triggers. Each call adds tokens to context. Cost per call climbs with every iteration.

Hallucinated tool chains: the agent generates a multi-step plan requiring dozens of API calls — steps that are individually cheap but collectively catastrophic. Each step anchors the next, so the agent has no natural exit until the chain completes or an error breaks it.

Runaway research tasks: a research agent finds one more source, then another, with no internal concept of diminishing returns. Without a hard turn limit, it optimizes for completeness over cost.

Standard AWS Cost Anomaly Detection and similar tools have a 24-48 hour aggregation lag. By the time the alert fires, the loop has already run for 36 hours. The circuit breaker has to operate in-process, in real time, at the level of each session.

How Agent Costs Compound in a Retry Loop
Each iteration appends tool responses to context. The cost-per-call grows with every cycle. Without a session-level trip threshold, the loop continues until the termination condition is finally met — or until Monday morning.

The Agent Cost Circuit Breaker Pattern

Mapping the three-state Hystrix model — CLOSED, OPEN, HALF-OPEN — to session-level cost thresholds.

The circuit breaker pattern has three states, each with a specific behavior:

CLOSED — Normal operation. Traffic flows through. The breaker silently tracks the session's cumulative cost. Below the threshold, nothing changes.

OPEN — The threshold has been crossed. All further agent calls in this session are immediately blocked and routed to a fallback. The session doesn't retry or queue — it degrades. An alert fires to oncall.

HALF-OPEN — After a configurable cooldown (typically 5 minutes), the breaker allows a small probe request through with a tight sub-budget. If the probe completes within the probe budget, the breaker resets to CLOSED. If the probe itself spirals, the breaker immediately returns to OPEN.

The mapping from distributed systems is direct: instead of HTTP 500 error rate as the failure signal, you use cost-per-session exceeding a defined envelope. The trip threshold replaces the failure rate threshold. Degraded response replaces a static fallback page.

One distinction matters here: the circuit state must be per-session, not global. One runaway session should not block other users' concurrent requests. Each workflow session gets its own breaker instance with its own cost accumulator. If your orchestration layer processes 200 concurrent sessions, you maintain 200 independent breakers — each tracking only its own session's spend.

Without a cost circuit breaker
  • Budget alert fires 24–48h after the spike starts

  • By alert time, session has already spent $12,000+

  • Engineer wakes up Monday to a billing surprise

  • No record of which workflow or input caused it

  • Full refund request to AWS or Anthropic — sometimes works

With a cost circuit breaker
  • Breaker trips at 3× P95 cost ($2.40 for a $0.80 P95 workflow)

  • Session degrades gracefully; user gets a partial result

  • Oncall gets a PagerDuty notification within seconds

  • OTel span captures workflow ID, session cost, and trip reason

  • Total incident spend: $2.43 instead of $12,000+

Infinite loops
Agents re-calling the same tool repeatedly because the termination condition is never cleanly met
Hallucinated chains
Multi-step plans requiring dozens of API calls that individually look cheap but collectively spike cost
Runaway research
Research agents finding one more source, then another, with no internal concept of diminishing returns
Context compounding
Each tool call appending results to the context window, making subsequent calls exponentially more expensive

Estimating Cost Envelopes Before You Deploy

The hard part: calculating what 'normal' looks like so you can define the trip threshold with confidence.

You can't calibrate a circuit breaker without knowing your cost distribution. The wrong threshold is as bad as no threshold: too tight and you're tripping on legitimate traffic, too loose and the breaker never fires until you're already deep in the damage.

The estimation process has three steps:

Step 1 — Profile representative inputs. Run your agent against 200–500 production-like inputs before deploying. Measure token counts at each step: input tokens, output tokens, tool call tokens, and the tokens appended to context from each tool response. Log the full cost trace per session.

Step 2 — Calculate P50, P90, and P99 session costs. Session cost = Σ(tokensinstep × inputprice + tokensoutstep × outputprice) for all steps in that session. Plot the distribution. P50 is your happy path. P99 is your outlier distribution. Anything above P99 is a runaway candidate.

Step 3 — Set the trip threshold at 3× P95 as a starting point. Not 10x (too loose to catch spirals early), not 1.5x (you'll trip on legitimate variance). 3× P95 gives enough headroom for real-world variance while catching genuine runaways before they compound — calibrate based on your workload[2]. According to practitioners, Cox Automotive uses a P95 cost threshold as their trip point for customer service agents: when a conversation exceeds that, the agent automatically hands off to a human rather than continuing.

A separate threshold for conversation turns is worth adding alongside the cost threshold. Long turn counts often precede cost spikes — a workflow that normally takes 8 turns but has hit 35 is almost certainly in a loop, even if the cost hasn't crossed the threshold yet.

WorkflowP50 CostP95 CostTrip Threshold (3× P95)Degradation Tier
Simple Q&A agent$0.03$0.09$0.27Return cached answer
Research agent$0.45$1.80$5.40Cap tool calls at 3
Code review agent$0.22$0.85$2.55Skip style checks
Data extraction agent$0.60$2.40$7.20Return partial results
Multi-agent pipeline$1.20$4.80$14.40Skip non-critical subagents

Building the Agent Cost Circuit Breaker at the Orchestration Layer

Where to place the breaker, what it tracks, and the implementation that handles per-session isolation.

The breaker belongs at the orchestration layer — not on individual model calls. Placing it on chat.completions.create() gives you per-call visibility but misses the emergent session cost. The breaker needs to see the running total.

The minimal implementation tracks three things per session: cumulative cost, current state (CLOSED/OPEN/HALF-OPEN), and the timestamp of when the breaker last opened (for cooldown tracking).

cost_circuit_breaker.py
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar

T = TypeVar("T")

class BreakerState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CostCircuitBreaker:
    """Per-session cost circuit breaker for LLM agent workflows."""
    trip_threshold_usd: float
    cooldown_seconds: int = 300
    probe_budget_usd: float = 0.0  # defaults to 10% of trip threshold

    _state: BreakerState = field(default=BreakerState.CLOSED, init=False)
    _session_cost: float = field(default=0.0, init=False)
    _opened_at: float = field(default=0.0, init=False)
    _on_trip: Callable | None = field(default=None, init=False)

    def __post_init__(self) -> None:
        if self.probe_budget_usd == 0.0:
            self.probe_budget_usd = self.trip_threshold_usd * 0.10

    def on_trip(self, callback: Callable) -> None:
        """Register a callback to fire when the breaker opens (e.g. PagerDuty alert)."""
        self._on_trip = callback

    def record_call(self, cost_usd: float) -> None:
        """Call after each LLM response with its cost."""
        self._session_cost += cost_usd
        if (
            self._state == BreakerState.CLOSED
            and self._session_cost >= self.trip_threshold_usd
        ):
            self._open()

    def should_allow(self) -> bool:
        """Call before each LLM invocation. False means route to fallback."""
        if self._state == BreakerState.CLOSED:
            return True
        if self._state == BreakerState.OPEN:
            elapsed = time.monotonic() - self._opened_at
            if elapsed >= self.cooldown_seconds:
                self._state = BreakerState.HALF_OPEN
                self._session_cost = 0.0
                return True
            return False
        # HALF_OPEN: allow only while under probe budget
        return self._session_cost < self.probe_budget_usd

    def reset(self) -> None:
        """Call after a successful HALF_OPEN probe to return to CLOSED."""
        self._state = BreakerState.CLOSED
        self._session_cost = 0.0

    @property
    def state(self) -> str:
        return self._state.value

    @property
    def session_cost(self) -> float:
        return self._session_cost

    def _open(self) -> None:
        self._state = BreakerState.OPEN
        self._opened_at = time.monotonic()
        if self._on_trip:
            self._on_trip({
                "session_cost": self._session_cost,
                "threshold": self.trip_threshold_usd,
            })


# Usage: one breaker instance per session, not shared across sessions
breaker = CostCircuitBreaker(trip_threshold_usd=5.40)
breaker.on_trip(lambda ctx: notify_pagerduty(ctx))

def run_agent_step(prompt: str) -> dict:
    if not breaker.should_allow():
        return degraded_response(breaker.state, breaker.session_cost)

    response = llm_client.chat(prompt)
    breaker.record_call(response.usage.cost_usd)
    return response
Circuit Breaker Architecture — Request Flow
Each workflow session maintains its own breaker instance. OPEN state routes to degradation; HALF-OPEN allows a probe request with a tight sub-budget before resetting.

Degradation Modes: What Happens When the Breaker Trips

The response when the circuit opens determines whether users see an error, a partial result, or a transparent handoff. Define these before you deploy.

  1. 1

    Tier 1 — Constrained execution

    Reduce tool call limit to 3, skip non-essential subagents, return partial results with a note. The agent still runs, just with reduced scope.

  2. 2

    Tier 2 — Cached fallback

    Return the most recent successful response for this workflow type, if one exists in your response cache.

  3. 3

    Tier 3 — Explicit handoff

    Return a structured response with context about what was completed and route to a human queue. Required for customer-facing agents where silent failure is the worst outcome.

Wiring the Agent Cost Circuit Breaker to Langfuse and OpenTelemetry

Where the cost numbers come from, and how to propagate them across services for cross-workflow analysis.

The circuit breaker itself is pure control logic — it just needs cost numbers. The question is where those numbers come from in practice.

Langfuse provides per-generation cost data through generation.usage.cost, which it calculates from token counts and model pricing automatically. You can hook into its Python SDK callback to feed cost into your breaker in real time rather than polling after the fact.

OpenTelemetry spans give you the infrastructure layer: propagate cost data across services, correlate it with workflow IDs, user IDs, and feature flags, and feed it into your observability stack for cross-session analysis. When a breaker trips, the span carries enough context to know exactly which workflow, which input type, and which tool call sequence caused the spiral.

TrueFoundry's cost observability layer adds workflow-level cost aggregation with built-in alerting, which can serve as a second line of defense — not a replacement for the in-process breaker, but a useful catch for cost trends that develop over multiple sessions rather than within a single session.

cost_tracing.py
from opentelemetry import trace
from langfuse.decorators import langfuse_context, observe

tracer = trace.get_tracer("agent.cost.breaker")

@observe()  # Langfuse decorator — auto-captures token usage + cost
def execute_with_cost_tracing(
    workflow_id: str,
    session_id: str,
    breaker: CostCircuitBreaker,
    prompt: str,
) -> dict:
    with tracer.start_as_current_span("agent.execute") as span:
        span.set_attribute("workflow.id", workflow_id)
        span.set_attribute("session.id", session_id)
        span.set_attribute("cost.budget.usd", breaker.trip_threshold_usd)
        span.set_attribute("circuit.state", breaker.state)

        if not breaker.should_allow():
            span.set_attribute("circuit.tripped", True)
            span.set_attribute("cost.session.usd", breaker.session_cost)
            return degraded_response(workflow_id, breaker.state)

        response = llm_client.chat(prompt)

        # Langfuse captures generation cost automatically via @observe
        # We also pull it out to feed the breaker
        cost_usd = langfuse_context.get_current_observation_cost()
        breaker.record_call(cost_usd)

        span.set_attribute("cost.call.usd", cost_usd)
        span.set_attribute("cost.session.usd", breaker.session_cost)
        span.set_attribute("circuit.state", breaker.state)

        return response

Pre-Production Circuit Breaker Checklist

  • Run 200+ representative inputs through the agent and log full token traces

  • Calculate P50, P90, P95, P99 session costs from profiling data

  • Set trip threshold at 3× P95 for each workflow type

  • Set conversation turn limit alongside cost threshold (typically 2–3× expected turns)

  • Define degradation tier (1/2/3) for each workflow before deploying

  • Implement one breaker instance per session — never shared across sessions

  • Wire trip callback to PagerDuty or Slack oncall channel

  • Add OTel span attributes: workflow.id, cost.budget.usd, cost.session.usd, circuit.state

  • Test trip behavior explicitly: run a synthetic loop that exceeds the threshold

  • Validate HALF-OPEN probe budget is ≤10% of trip threshold

Does the circuit breaker add meaningful latency to every agent call?

No. The should_allow() check is a local in-memory operation — it adds microseconds, not milliseconds. The record_call() update happens synchronously after the LLM response returns. At 100 concurrent sessions, the overhead is negligible compared to LLM call latency (typically 500ms–3s).

How do I handle multi-agent workflows where cost accumulates across several agents?

Pass the session's cumulative cost through your execution context so all sub-agents share the same cost accumulator. Each sub-agent checks the same breaker instance before executing. This prevents a scenario where each sub-agent has a clean budget while the parent session is already deep into a spiral.

What if the HALF-OPEN probe request also spirals?

The probe budget should be tight — 5–10% of the trip threshold. If the probe exceeds its budget, the breaker immediately returns to OPEN and resets the cooldown timer. This prevents HALF-OPEN from becoming a second spike vector. Set probebudgetusd low enough that even a misbehaving probe causes minimal damage.

Should I set a global monthly budget as well, in addition to per-session breakers?

Yes, but treat it as a last-resort backstop, not a primary control. Per-session breakers catch in-flight spirals. A monthly budget cap in your AI gateway (Portkey, Helicone, or a custom middleware) catches slow accumulation patterns that are too distributed to trigger individual session breakers — like 50,000 sessions each spending 20% above their expected cost.

Signs you need a circuit breaker now

  • You've deployed agents in production but your only cost control is a monthly cloud budget cap

  • Your cloud billing anomaly alerts have a 24–48h lag — meaning a Friday spike runs all weekend

  • You have multi-agent workflows where sub-agents share a context that grows with each tool call

  • You've had at least one billing surprise that required a refund request or an awkward finance conversation

  • Your agents run research or summarization tasks with no hard limit on source documents fetched

Signs your current threshold needs recalibration

  • The breaker trips more than twice per week on workflows that complete successfully

  • False positives are causing degraded responses for legitimate edge-case inputs

  • Your P95 session cost has shifted by more than 20% since you last profiled (model pricing changes, new tools added)

  • You added new tool types that have significantly different token output sizes than the original profiling set

The 200-input profiling pass felt like overhead we couldn't afford before launch. We skipped it. Two weeks later we were on the phone with our cloud provider asking for a $9,000 credit. The profiling pass now takes one afternoon and is non-negotiable before any agent ships.

Staff Platform Engineer, Series C AI infrastructure company
Key terms in this piece
agent cost circuit breakerLLM cost controlagent billing spike preventioncost envelope estimationLLM orchestration guardrailsagent cost observabilitytoken budget middleware
Sources
  1. [1]ZenML: What 1,200 Production Deployments Reveal About LLMOps in 2025(zenml.io)
  2. [2]Portkey: Retries, Fallbacks, and Circuit Breakers in LLM Apps(portkey.ai)
  3. [3]Galileo: AI Agent Cost Optimization and Observability(galileo.ai)
  4. [4]BMD Pat: AI Agent Cost Control with AgentGuard in Python(bmdpat.com)
  5. [5]TrueFoundry: AI Cost Observability(truefoundry.com)
  6. [6]Dev.to: How to Stop AI Agent Cost Spirals Before They Start(dev.to)
Share this article