Agent Cost Circuit Breaker: The In-Process Control That Holds

Q: Does the breaker add meaningful latency to every agent call?

No. `should_allow()` is a local in-memory check — microseconds, not milliseconds. `record_call()` runs synchronously after the LLM response returns. At 100 concurrent sessions, overhead is negligible against LLM call latency (typically 500ms–3s). If you measure a hot path bottleneck, the bottleneck is the model, not the breaker.

The Cloud Bill Is Not Your Cost Control. The Circuit Breaker Is.

Billing anomaly alerts run on a 24–48 hour lag. The retry loop is already an invoice by the time anyone sees it. The control that catches it is per-session, in-process, and lives in the orchestration layer — profiled envelope, 3x P95 trip, defined degradation.

AI Engineering PlatformadvancedNov 23, 20256 min read

By Viktor Bezdek · VP Engineering, Groupon

Friday, 6 PM. A research agent ships behind caching and rate limits. By Saturday midnight it has fired roughly 8,400 API calls, re-querying the same documents with slightly different prompts and appending every tool response to a growing context. Cost per call is now forty times baseline. The ZenML production survey of 1,200 deployments documents the shape of this failure.^[1] No alert fired. No circuit tripped. The system worked exactly as designed. Monday: a $12,000 invoice.

Monitoring did not fail. Monitoring just showed up after the burn. Cloud anomaly detection aggregates on a 24–48 hour lag, which is another way of saying it tells you about Friday's incident on Sunday afternoon. The control that catches the loop in time is not in the billing dashboard. It lives in the orchestration layer — in-process, per-session, measuring cumulative spend against a defined envelope and tripping before the loop compounds.

This is the circuit breaker pattern, ported. Hystrix used it for service availability: when a downstream endpoint starts failing, stop sending traffic instead of hammering a broken system. The adaptation for cost is mechanical. Replace HTTP 500 rate with session spend exceeding a profiled envelope. Everything else maps one to one.

If you already run agents in production, you have either had the billing surprise or you are queued for the first one. What follows is what catches it.

Up to $47K

Largest single-session spike in a 4-week window (ZenML, 2025). The mean is small. The tail is what burns the budget.

~8,400

Calls one Friday-night retry loop generated before anything alerted. Your tool surface produces a different number. The failure mode is identical.

~50×

Cost multiplier between call 1 and call 100 in a compounding session. The driver is context length, not call count.

24–48h

Cloud anomaly alert lag. By the time it names the spike, the spend is already locked in.

Cloud Cost Controls Were Built for Linear Resources. Agents Are Not Linear.

The mechanism that turns $0.03 calls into $180 sessions, and why every standard alert lands after the spend is already final.

Cloud cost alerts work because cloud resources consume linearly. Ten EC2 instances cost ten times one. Agents break that assumption. Each tool call appends its result to the conversation context, so the tenth call in a session is not ten times the first — it can be twenty, because the context feeding that call is twenty times longer.^[6] A multi-agent workflow with a typical cost of $0.45 hits $87 the moment one sub-agent enters a tool loop and each retry appends a 2,000-token response to the shared context.

The failure modes share one structural property. The termination condition never cleanly fires. The agent keeps going because nothing tells it to stop.

Infinite tool loops. The agent calls the same tool repeatedly. Each response is slightly different. The internal stopping criterion never trips. Tokens stack onto context. Cost per call climbs every iteration.

Hallucinated tool chains. The agent emits a multi-step plan with dozens of API calls. Each step looks cheap on its own. Each step anchors the next. The exit condition is the end of the plan or an error — whichever lands first.

Runaway research. The agent finds one more source. Then another. Diminishing returns is not a concept it has. Without a hard turn limit, it optimizes for completeness over cost.

AWS Cost Anomaly Detection and its peers aggregate on a 24–48 hour delay. By the time the alert lands, the loop has been running for 36 hours. The control that matters runs in-process, in real time, scoped to one session.

How Session Cost Compounds Inside a Retry Loop

Every iteration appends tool responses to context. Cost-per-call grows with every cycle. Without a session-level trip threshold, the loop runs until the termination condition finally fires — or until Monday morning.

Three States, Mapped From Hystrix to Session Cost

CLOSED, OPEN, HALF-OPEN — the failure signal swaps from HTTP 500 rate to session spend exceeding the envelope.

Three states, each with one job.

CLOSED — Normal operation. Traffic flows through. The breaker silently tracks cumulative session cost. Below threshold, it stays out of the way.

OPEN — Threshold crossed. Every further agent call in this session is blocked and routed to a fallback. The session does not retry or queue. It degrades. An alert fires to oncall.

HALF-OPEN — After a configurable cooldown (5 minutes is a reasonable default), the breaker lets one probe request through with a tight sub-budget. Probe completes inside its budget, breaker resets to CLOSED. Probe spirals, breaker returns to OPEN and resets the cooldown.

The mapping is direct. Replace HTTP 500 rate with cost-per-session exceeding a defined envelope. Trip threshold takes the place of failure-rate threshold. Degraded response takes the place of a static fallback page.

One distinction is load-bearing: state must be per-session, not global. One runaway must not block 199 healthy concurrent sessions. Each workflow session owns its breaker instance and its cost accumulator. If your orchestration layer runs 200 concurrent sessions, you run 200 independent breakers — each tracking exactly one session's spend.

Dashboard observability

Budget alert lands 24–48h after the spike starts
By alert time, the session is already $12,000+ deep
Engineer opens Monday to a billing surprise, not an incident
No record of which workflow or input fed the loop
Refund request to AWS or Anthropic — sometimes honored

In-process enforcement

Breaker trips at 3× P95 ($2.40 for a $0.80 P95 workflow)
Session degrades cleanly; user gets a partial result
Oncall gets a PagerDuty page in seconds
OTel span carries workflow ID, session cost, trip reason
Total incident spend: $2.43 instead of $12,000+

Infinite loops

Agents re-calling the same tool because the termination condition never cleanly fires

Hallucinated chains

Multi-step plans of dozens of API calls — each cheap on its own, ruinous in aggregate

Runaway research

Research agents fetching one more source, then another, with no internal model of diminishing returns

Context compounding

Every tool response appended to the window, making the next call non-linearly more expensive

You Cannot Calibrate a Breaker You Have Not Profiled

The hard work is upstream of the breaker — measuring what 'normal' looks like before you set the trip threshold.

A breaker without a cost distribution behind it is theater. The wrong threshold is as costly as no threshold: too tight, you trip on legitimate traffic and the team learns to ignore the alert; too loose, the breaker never fires until the damage is already done.

We got this wrong on the first deployment. We ran 50 test inputs, saw a P95 of $0.80, set the threshold at $2.40 (3×), shipped. What we had not profiled was end-of-month reporting queries — they naturally pull more context and run 2–3× longer than typical interactions. The breaker tripped 17 times in week one, all legitimate. The team started treating the alerts as noise. The next month, a real runaway happened, the alert fired, and oncall dismissed it alongside 14 legitimate ones. False-positive fatigue is the same outcome as no threshold, with extra steps.

The profiling process is three steps.

Step 1 — Profile representative inputs. Run the agent against 200–500 production-like inputs before deploying. Measure token counts at each step: input, output, tool call, and tokens appended to context from each tool response. Log the full per-session cost trace.

Step 2 — Compute P50, P90, P99 session costs. Session cost = Σ(tokensinstep × inputprice + tokensoutstep × outputprice) across every step in the session. Plot the distribution. P50 is the happy path. P99 is the outlier shoulder. Anything above P99 is a runaway candidate.

Step 3 — Set the trip threshold at 3× P95 as the starting point. Not 10× (too loose to catch spirals early), not 1.5× (you trip on legitimate variance). 3× P95 buys enough headroom for real-world variance while still catching genuine runaways before they compound — calibrate against your workload from there^[2]. Practitioners report Cox Automotive uses P95 cost as the trip point for customer service agents: when a conversation crosses that line, the agent hands off to a human instead of continuing.

Add a turn-count threshold alongside the cost threshold. Long turn counts usually precede cost spikes — a workflow that normally takes 8 turns and has hit 35 is almost certainly in a loop, even if cumulative cost has not crossed the dollar line yet.

Workflow	P50 Cost	P95 Cost	Trip Threshold (3× P95)	Degradation Tier
Simple Q&A agent	$0.03	$0.09	$0.27	Return cached answer
Research agent	$0.45	$1.80	$5.40	Cap tool calls at 3
Code review agent	$0.22	$0.85	$2.55	Skip style checks
Data extraction agent	$0.60	$2.40	$7.20	Return partial results
Multi-agent pipeline	$1.20	$4.80	$14.40	Skip non-critical subagents

The Breaker Belongs in the Orchestrator, Not on the Model Call

Per-call instrumentation cannot see the running session total. The orchestration layer can. That is where the breaker has to live.

Place the breaker at the orchestration layer, not on individual model calls. Wrapping chat.completions.create() gives you per-call visibility and exactly nothing about the emergent session cost. The breaker has to see the running total or it cannot do its job.

The minimal state per session is three fields: cumulative cost, current state (CLOSED/OPEN/HALF-OPEN), and the timestamp the breaker last opened (for cooldown tracking). Everything else is a callback.

cost_breaker.py

import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar

T = TypeVar("T")

class BreakerState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

@dataclass
class CostCircuitBreaker:
    """One instance per session. Never shared."""
    trip_threshold_usd: float
    cooldown_seconds: int = 300
    probe_budget_usd: float = 0.0  # defaults to 10% of trip threshold

    _state: BreakerState = field(default=BreakerState.CLOSED, init=False)
    _session_cost: float = field(default=0.0, init=False)
    _opened_at: float = field(default=0.0, init=False)
    _on_trip: Callable | None = field(default=None, init=False)

    def __post_init__(self) -> None:
        if self.probe_budget_usd == 0.0:
            self.probe_budget_usd = self.trip_threshold_usd * 0.10

    def on_trip(self, callback: Callable) -> None:
        """Fires when the breaker opens. Wire this to PagerDuty."""
        self._on_trip = callback

    def record_call(self, cost_usd: float) -> None:
        """Call after every LLM response. Trips on threshold breach."""
        self._session_cost += cost_usd
        if (
            self._state == BreakerState.CLOSED
            and self._session_cost >= self.trip_threshold_usd
        ):
            self._open()

    def should_allow(self) -> bool:
        """Gate every LLM invocation. False routes to fallback."""
        if self._state == BreakerState.CLOSED:
            return True
        if self._state == BreakerState.OPEN:
            elapsed = time.monotonic() - self._opened_at
            if elapsed >= self.cooldown_seconds:
                self._state = BreakerState.HALF_OPEN
                self._session_cost = 0.0
                return True
            return False
        # HALF_OPEN: probe runs only under sub-budget
        return self._session_cost < self.probe_budget_usd

    def reset(self) -> None:
        """Probe completed clean. Return to CLOSED."""
        self._state = BreakerState.CLOSED
        self._session_cost = 0.0

    @property
    def state(self) -> str:
        return self._state.value

    @property
    def session_cost(self) -> float:
        return self._session_cost

    def _open(self) -> None:
        self._state = BreakerState.OPEN
        self._opened_at = time.monotonic()
        if self._on_trip:
            self._on_trip({
                "session_cost": self._session_cost,
                "threshold": self.trip_threshold_usd,
            })


# One breaker per session. Never shared across sessions.
breaker = CostCircuitBreaker(trip_threshold_usd=5.40)
breaker.on_trip(lambda ctx: notify_pagerduty(ctx))

def run_agent_step(prompt: str) -> dict:
    if not breaker.should_allow():
        return degraded_response(breaker.state, breaker.session_cost)

    response = llm_client.chat(prompt)
    breaker.record_call(response.usage.cost_usd)
    return response

Where the Breaker Sits in the Request Flow

Each workflow session owns its breaker instance. OPEN routes straight to degradation. HALF-OPEN allows one probe under a tight sub-budget before reset.

What Happens After the Trip Decides Whether the Breaker Is Useful

Open the circuit and you have two seconds to decide what the user sees. Define the degradation tier before deploy, not after the page fires.

[01]
Tier 1 — Constrained execution
Cut the tool call ceiling to 3, drop non-essential subagents, return partial results with a note. The agent still runs, with reduced scope.
[02]
Tier 2 — Cached fallback
Return the most recent clean response for this workflow type from the response cache, if one exists.
[03]
Tier 3 — Explicit handoff
Return a structured response describing what completed and route the conversation to a human queue. Mandatory for customer-facing agents where silent failure is the worst outcome.

The Breaker Is Pure Logic. The Cost Numbers Have to Come From Somewhere.

Wire it to the cost source — Langfuse for per-generation pricing, OpenTelemetry for cross-service propagation, monthly caps as a backstop.

The breaker is pure control logic. It just needs a cost number per call. The interesting question is where that number lives in your stack.

Langfuse exposes per-generation cost through generation.usage.cost, computed from token counts and model pricing automatically. Hook the Python SDK callback to feed cost into the breaker in real time, instead of polling after the fact.

OpenTelemetry spans give you the infrastructure layer. Propagate cost across services, correlate with workflow ID, user ID, and feature flag, and pipe the result into the observability stack for cross-session analysis. When a breaker trips, the span carries enough context to name the workflow, the input type, and the tool call sequence that fed the spiral.

TrueFoundry's cost observability layer adds workflow-level aggregation with built-in alerting. Treat it as a second line — not a replacement for the in-process breaker, but a useful catch for cost trends that develop across sessions rather than inside one.

cost_tracing.py

from opentelemetry import trace
from langfuse.decorators import langfuse_context, observe

tracer = trace.get_tracer("agent.cost.breaker")

@observe()  # Langfuse captures token usage + cost on every call
def execute_with_cost_tracing(
    workflow_id: str,
    session_id: str,
    breaker: CostCircuitBreaker,
    prompt: str,
) -> dict:
    with tracer.start_as_current_span("agent.execute") as span:
        span.set_attribute("workflow.id", workflow_id)
        span.set_attribute("session.id", session_id)
        span.set_attribute("cost.budget.usd", breaker.trip_threshold_usd)
        span.set_attribute("circuit.state", breaker.state)

        if not breaker.should_allow():
            span.set_attribute("circuit.tripped", True)
            span.set_attribute("cost.session.usd", breaker.session_cost)
            return degraded_response(workflow_id, breaker.state)

        response = llm_client.chat(prompt)

        # Pull cost from Langfuse and feed the breaker.
        cost_usd = langfuse_context.get_current_observation_cost()
        breaker.record_call(cost_usd)

        span.set_attribute("cost.call.usd", cost_usd)
        span.set_attribute("cost.session.usd", breaker.session_cost)
        span.set_attribute("circuit.state", breaker.state)

        return response

Pre-Production Breaker Checklist

200+ representative inputs profiled through the agent with full token traces logged
P50, P90, P95, P99 session costs computed from profiling data
Trip threshold set to 3× P95 for each workflow type
Turn-count threshold set alongside cost threshold (typically 2–3× expected turns)
Degradation tier (1/2/3) named for each workflow before deploy
One breaker instance per session — never shared across sessions
Trip callback wired to PagerDuty or the Slack oncall channel
OTel span attributes present: workflow.id, cost.budget.usd, cost.session.usd, circuit.state
Trip behavior tested explicitly with a synthetic loop that exceeds threshold
HALF-OPEN probe budget validated at ≤10% of trip threshold

Does the breaker add meaningful latency to every agent call?

No. should_allow() is a local in-memory check — microseconds, not milliseconds. record_call() runs synchronously after the LLM response returns. At 100 concurrent sessions, overhead is negligible against LLM call latency (typically 500ms–3s). If you measure a hot path bottleneck, the bottleneck is the model, not the breaker.

How do I handle multi-agent workflows where cost accumulates across sub-agents?

Pass the session's cumulative cost through the execution context so every sub-agent shares the same accumulator. Each sub-agent checks the same breaker instance before executing. This prevents the symmetric failure where each sub-agent sees a clean budget while the parent session is already deep into a spiral.

What if the HALF-OPEN probe also spirals?

Keep the probe budget tight — 5–10% of the trip threshold. If the probe exceeds its sub-budget, the breaker returns to OPEN and resets the cooldown. HALF-OPEN does not get to become a second spike vector. Set probe_budget_usd low enough that a misbehaving probe causes minimal damage.

Do I still need a global monthly budget cap on top of per-session breakers?

Yes, but only as a backstop. Per-session breakers catch in-flight spirals. A monthly cap in the AI gateway (Portkey, Helicone, or custom middleware) catches slow accumulation patterns too distributed to trip an individual session breaker — 50,000 sessions each running 20% over expected cost is the canonical case. One counterintuitive finding: teams with very tight monthly caps sometimes generate worse incentives than teams with none. Engineers start designing agents to land just under the cap instead of optimizing for cost-per-outcome, producing slightly-too-expensive-but-technically-compliant behavior across the fleet. Use the monthly cap as a hard stop. Use per-workflow P95 trends as the primary efficiency signal.

Signals you need a breaker now, not next quarter

Agents are in production and your only cost control is a monthly cloud budget cap
Cloud anomaly alerts run on a 24–48h lag — a Friday spike runs all weekend
Multi-agent workflows share a context that grows with every tool call
You have already had one billing surprise that needed a refund request or an awkward finance conversation
Research and summarization agents run without a hard limit on source documents fetched

Signals the threshold has drifted and needs recalibration

✓
The breaker trips more than twice per week on workflows that complete cleanly
✓
False positives are forcing degraded responses on legitimate edge-case inputs
✓
P95 session cost has moved more than 20% since you last profiled (model pricing change, new tools added)
✓
New tool types have meaningfully different token output sizes than the original profiling set

Key terms in this piece

agent cost circuit breakerLLM cost controlagent billing spike preventioncost envelope estimationLLM orchestration guardrailsagent cost observabilitytoken budget middleware

Sources

[1]ZenML: What 1,200 Production Deployments Reveal About LLMOps in 2025(zenml.io)↩
[2]Portkey: Retries, Fallbacks, and Circuit Breakers in LLM Apps(portkey.ai)↩
[3]Galileo: AI Agent Cost Optimization and Observability(galileo.ai)↩
[4]BMD Pat: AI Agent Cost Control with AgentGuard in Python(bmdpat.com)↩
[5]TrueFoundry: AI Cost Observability(truefoundry.com)↩
[6]Dev.to: How to Stop AI Agent Cost Spirals Before They Start(dev.to)↩

Share this article

X LinkedIn Hacker News

The Cloud Bill Is Not Your Cost Control. The Circuit Breaker Is.

AI Engineering PlatformadvancedNov 23, 20256 min read

By Viktor Bezdek · VP Engineering, Groupon

The profiling process is three steps.

Workflow

P50 Cost

P95 Cost

Trip Threshold (3× P95)

Degradation Tier

Simple Q&A agent

$0.03

$0.09

$0.27

Return cached answer

Research agent

$0.45

$1.80

$5.40

Cap tool calls at 3

Code review agent

$0.22

$0.85

$2.55

Skip style checks

Data extraction agent

$0.60

$2.40

$7.20

Return partial results

Multi-agent pipeline

$1.20

$4.80

$14.40

Skip non-critical subagents

import time from dataclasses import dataclass, field from enum import Enum from typing import Callable, TypeVar T = TypeVar("T") class BreakerState(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" @dataclass class CostCircuitBreaker: """One instance per session. Never shared.""" trip_threshold_usd: float cooldown_seconds: int = 300 probe_budget_usd: float = 0.0 # defaults to 10% of trip threshold _state: BreakerState = field(default=BreakerState.CLOSED, init=False) _session_cost: float = field(default=0.0, init=False) _opened_at: float = field(default=0.0, init=False) _on_trip: Callable | None = field(default=None, init=False) def __post_init__(self) -> None: if self.probe_budget_usd == 0.0: self.probe_budget_usd = self.trip_threshold_usd * 0.10 def on_trip(self, callback: Callable) -> None: """Fires when the breaker opens. Wire this to PagerDuty.""" self._on_trip = callback def record_call(self, cost_usd: float) -> None: """Call after every LLM response. Trips on threshold breach.""" self._session_cost += cost_usd if ( self._state == BreakerState.CLOSED and self._session_cost >= self.trip_threshold_usd ): self._open() def should_allow(self) -> bool: """Gate every LLM invocation. False routes to fallback.""" if self._state == BreakerState.CLOSED: return True if self._state == BreakerState.OPEN: elapsed = time.monotonic() - self._opened_at if elapsed >= self.cooldown_seconds: self._state = BreakerState.HALF_OPEN self._session_cost = 0.0 return True return False # HALF_OPEN: probe runs only under sub-budget return self._session_cost < self.probe_budget_usd def reset(self) -> None: """Probe completed clean. Return to CLOSED.""" self._state = BreakerState.CLOSED self._session_cost = 0.0 @property def state(self) -> str: return self._state.value @property def session_cost(self) -> float: return self._session_cost def _open(self) -> None: self._state = BreakerState.OPEN self._opened_at = time.monotonic() if self._on_trip: self._on_trip({ "session_cost": self._session_cost, "threshold": self.trip_threshold_usd, }) # One breaker per session. Never shared across sessions. breaker = CostCircuitBreaker(trip_threshold_usd=5.40) breaker.on_trip(lambda ctx: notify_pagerduty(ctx)) def run_agent_step(prompt: str) -> dict: if not breaker.should_allow(): return degraded_response(breaker.state, breaker.session_cost) response = llm_client.chat(prompt) breaker.record_call(response.usage.cost_usd) return response

from opentelemetry import trace from langfuse.decorators import langfuse_context, observe tracer = trace.get_tracer("agent.cost.breaker") @observe() # Langfuse captures token usage + cost on every call def execute_with_cost_tracing( workflow_id: str, session_id: str, breaker: CostCircuitBreaker, prompt: str, ) -> dict: with tracer.start_as_current_span("agent.execute") as span: span.set_attribute("workflow.id", workflow_id) span.set_attribute("session.id", session_id) span.set_attribute("cost.budget.usd", breaker.trip_threshold_usd) span.set_attribute("circuit.state", breaker.state) if not breaker.should_allow(): span.set_attribute("circuit.tripped", True) span.set_attribute("cost.session.usd", breaker.session_cost) return degraded_response(workflow_id, breaker.state) response = llm_client.chat(prompt) # Pull cost from Langfuse and feed the breaker. cost_usd = langfuse_context.get_current_observation_cost() breaker.record_call(cost_usd) span.set_attribute("cost.call.usd", cost_usd) span.set_attribute("cost.session.usd", breaker.session_cost) span.set_attribute("circuit.state", breaker.state) return response

The Cloud Bill Is Not Your Cost Control. The Circuit Breaker Is.

Cloud Cost Controls Were Built for Linear Resources. Agents Are Not Linear.

Three States, Mapped From Hystrix to Session Cost

You Cannot Calibrate a Breaker You Have Not Profiled

The Breaker Belongs in the Orchestrator, Not on the Model Call

What Happens After the Trip Decides Whether the Breaker Is Useful

Tier 1 — Constrained execution

Tier 2 — Cached fallback

Tier 3 — Explicit handoff

The Breaker Is Pure Logic. The Cost Numbers Have to Come From Somewhere.

Pre-Production Breaker Checklist

Signals you need a breaker now, not next quarter

Signals the threshold has drifted and needs recalibration

Related

The Cloud Bill Is Not Your Cost Control. The Circuit Breaker Is.

Cloud Cost Controls Were Built for Linear Resources. Agents Are Not Linear.

Three States, Mapped From Hystrix to Session Cost

You Cannot Calibrate a Breaker You Have Not Profiled

The Breaker Belongs in the Orchestrator, Not on the Model Call

What Happens After the Trip Decides Whether the Breaker Is Useful

Tier 1 — Constrained execution

Tier 2 — Cached fallback

Tier 3 — Explicit handoff

The Breaker Is Pure Logic. The Cost Numbers Have to Come From Somewhere.

Pre-Production Breaker Checklist

Signals you need a breaker now, not next quarter

Signals the threshold has drifted and needs recalibration

Related