A Friday afternoon deploy. New research agent, caching enabled, rate limiting in place. By Monday morning, Slack fires: your API bill hit $12,000 over the weekend. The agent entered a retry loop at 6 PM Friday, re-querying the same documents with slightly different prompts. Each iteration appended the full tool response to a growing context window. By Saturday midnight it had made roughly 8,400 API calls, according to the ZenML production survey that documented this scenario[1]. By Sunday, the cost per call was approximately 40 times what it was at the start. No alert fired. No circuit tripped. The system worked exactly as designed.
This is not a monitoring problem. Monitoring shows you the fire after it starts. The agent cost circuit breaker is the pattern that stops ignition — an architectural control sitting in your orchestration layer that measures the cost envelope of each workflow session and trips before the damage compounds.
The concept comes directly from distributed systems. Hystrix popularized it for service availability: when a downstream service starts failing, you stop sending traffic rather than hammering a broken endpoint. The adaptation for agent cost is exact: when a workflow session's spend exceeds its expected envelope, you stop executing rather than continuing to burn tokens on what is almost certainly a runaway path.
This is not a dashboard article. It's an architecture guide for teams who already have agents in production and have either had the billing surprise or are afraid of the first one.
Why Agent Cost Defies Normal Budget Controls
The mechanics of how $0.03 calls become $180 sessions — and why standard cloud alerts always arrive too late.
Cloud cost alerts work because cloud resources consume roughly linearly: 10 more EC2 instances costs approximately 10x more than one. Agents don't behave that way. Each tool call appends its result to the conversation context, so the cost of the 10th call in a session is not 10x the first — it can be 20x, because the context window feeding that call is 20x longer[6]. A multi-agent workflow that normally costs $0.45 can hit $87 when one sub-agent enters a tool loop and each retry appends a 2,000-token response to the shared context.
The failure modes that trigger runaway cost all share a common structure: the agent keeps executing because the termination condition is never cleanly met.
Infinite tool loops: the agent calls the same tool repeatedly because each response is slightly different and its internal stopping condition never triggers. Each call adds tokens to context. Cost per call climbs with every iteration.
Hallucinated tool chains: the agent generates a multi-step plan requiring dozens of API calls — steps that are individually cheap but collectively catastrophic. Each step anchors the next, so the agent has no natural exit until the chain completes or an error breaks it.
Runaway research tasks: a research agent finds one more source, then another, with no internal concept of diminishing returns. Without a hard turn limit, it optimizes for completeness over cost.
Standard AWS Cost Anomaly Detection and similar tools have a 24-48 hour aggregation lag. By the time the alert fires, the loop has already run for 36 hours. The circuit breaker has to operate in-process, in real time, at the level of each session.
The Agent Cost Circuit Breaker Pattern
Mapping the three-state Hystrix model — CLOSED, OPEN, HALF-OPEN — to session-level cost thresholds.
The circuit breaker pattern has three states, each with a specific behavior:
CLOSED — Normal operation. Traffic flows through. The breaker silently tracks the session's cumulative cost. Below the threshold, nothing changes.
OPEN — The threshold has been crossed. All further agent calls in this session are immediately blocked and routed to a fallback. The session doesn't retry or queue — it degrades. An alert fires to oncall.
HALF-OPEN — After a configurable cooldown (typically 5 minutes), the breaker allows a small probe request through with a tight sub-budget. If the probe completes within the probe budget, the breaker resets to CLOSED. If the probe itself spirals, the breaker immediately returns to OPEN.
The mapping from distributed systems is direct: instead of HTTP 500 error rate as the failure signal, you use cost-per-session exceeding a defined envelope. The trip threshold replaces the failure rate threshold. Degraded response replaces a static fallback page.
One distinction matters here: the circuit state must be per-session, not global. One runaway session should not block other users' concurrent requests. Each workflow session gets its own breaker instance with its own cost accumulator. If your orchestration layer processes 200 concurrent sessions, you maintain 200 independent breakers — each tracking only its own session's spend.
Budget alert fires 24–48h after the spike starts
By alert time, session has already spent $12,000+
Engineer wakes up Monday to a billing surprise
No record of which workflow or input caused it
Full refund request to AWS or Anthropic — sometimes works
Breaker trips at 3× P95 cost ($2.40 for a $0.80 P95 workflow)
Session degrades gracefully; user gets a partial result
Oncall gets a PagerDuty notification within seconds
OTel span captures workflow ID, session cost, and trip reason
Total incident spend: $2.43 instead of $12,000+
Estimating Cost Envelopes Before You Deploy
The hard part: calculating what 'normal' looks like so you can define the trip threshold with confidence.
You can't calibrate a circuit breaker without knowing your cost distribution. The wrong threshold is as bad as no threshold: too tight and you're tripping on legitimate traffic, too loose and the breaker never fires until you're already deep in the damage.
The estimation process has three steps:
Step 1 — Profile representative inputs. Run your agent against 200–500 production-like inputs before deploying. Measure token counts at each step: input tokens, output tokens, tool call tokens, and the tokens appended to context from each tool response. Log the full cost trace per session.
Step 2 — Calculate P50, P90, and P99 session costs. Session cost = Σ(tokensinstep × inputprice + tokensoutstep × outputprice) for all steps in that session. Plot the distribution. P50 is your happy path. P99 is your outlier distribution. Anything above P99 is a runaway candidate.
Step 3 — Set the trip threshold at 3× P95 as a starting point. Not 10x (too loose to catch spirals early), not 1.5x (you'll trip on legitimate variance). 3× P95 gives enough headroom for real-world variance while catching genuine runaways before they compound — calibrate based on your workload[2]. According to practitioners, Cox Automotive uses a P95 cost threshold as their trip point for customer service agents: when a conversation exceeds that, the agent automatically hands off to a human rather than continuing.
A separate threshold for conversation turns is worth adding alongside the cost threshold. Long turn counts often precede cost spikes — a workflow that normally takes 8 turns but has hit 35 is almost certainly in a loop, even if the cost hasn't crossed the threshold yet.
| Workflow | P50 Cost | P95 Cost | Trip Threshold (3× P95) | Degradation Tier |
|---|---|---|---|---|
| Simple Q&A agent | $0.03 | $0.09 | $0.27 | Return cached answer |
| Research agent | $0.45 | $1.80 | $5.40 | Cap tool calls at 3 |
| Code review agent | $0.22 | $0.85 | $2.55 | Skip style checks |
| Data extraction agent | $0.60 | $2.40 | $7.20 | Return partial results |
| Multi-agent pipeline | $1.20 | $4.80 | $14.40 | Skip non-critical subagents |
Building the Agent Cost Circuit Breaker at the Orchestration Layer
Where to place the breaker, what it tracks, and the implementation that handles per-session isolation.
The breaker belongs at the orchestration layer — not on individual model calls. Placing it on chat.completions.create() gives you per-call visibility but misses the emergent session cost. The breaker needs to see the running total.
The minimal implementation tracks three things per session: cumulative cost, current state (CLOSED/OPEN/HALF-OPEN), and the timestamp of when the breaker last opened (for cooldown tracking).
cost_circuit_breaker.pyimport time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar
T = TypeVar("T")
class BreakerState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CostCircuitBreaker:
"""Per-session cost circuit breaker for LLM agent workflows."""
trip_threshold_usd: float
cooldown_seconds: int = 300
probe_budget_usd: float = 0.0 # defaults to 10% of trip threshold
_state: BreakerState = field(default=BreakerState.CLOSED, init=False)
_session_cost: float = field(default=0.0, init=False)
_opened_at: float = field(default=0.0, init=False)
_on_trip: Callable | None = field(default=None, init=False)
def __post_init__(self) -> None:
if self.probe_budget_usd == 0.0:
self.probe_budget_usd = self.trip_threshold_usd * 0.10
def on_trip(self, callback: Callable) -> None:
"""Register a callback to fire when the breaker opens (e.g. PagerDuty alert)."""
self._on_trip = callback
def record_call(self, cost_usd: float) -> None:
"""Call after each LLM response with its cost."""
self._session_cost += cost_usd
if (
self._state == BreakerState.CLOSED
and self._session_cost >= self.trip_threshold_usd
):
self._open()
def should_allow(self) -> bool:
"""Call before each LLM invocation. False means route to fallback."""
if self._state == BreakerState.CLOSED:
return True
if self._state == BreakerState.OPEN:
elapsed = time.monotonic() - self._opened_at
if elapsed >= self.cooldown_seconds:
self._state = BreakerState.HALF_OPEN
self._session_cost = 0.0
return True
return False
# HALF_OPEN: allow only while under probe budget
return self._session_cost < self.probe_budget_usd
def reset(self) -> None:
"""Call after a successful HALF_OPEN probe to return to CLOSED."""
self._state = BreakerState.CLOSED
self._session_cost = 0.0
@property
def state(self) -> str:
return self._state.value
@property
def session_cost(self) -> float:
return self._session_cost
def _open(self) -> None:
self._state = BreakerState.OPEN
self._opened_at = time.monotonic()
if self._on_trip:
self._on_trip({
"session_cost": self._session_cost,
"threshold": self.trip_threshold_usd,
})
# Usage: one breaker instance per session, not shared across sessions
breaker = CostCircuitBreaker(trip_threshold_usd=5.40)
breaker.on_trip(lambda ctx: notify_pagerduty(ctx))
def run_agent_step(prompt: str) -> dict:
if not breaker.should_allow():
return degraded_response(breaker.state, breaker.session_cost)
response = llm_client.chat(prompt)
breaker.record_call(response.usage.cost_usd)
return responseDegradation Modes: What Happens When the Breaker Trips
The response when the circuit opens determines whether users see an error, a partial result, or a transparent handoff. Define these before you deploy.
- 1
Tier 1 — Constrained execution
Reduce tool call limit to 3, skip non-essential subagents, return partial results with a note. The agent still runs, just with reduced scope.
- 2
Tier 2 — Cached fallback
Return the most recent successful response for this workflow type, if one exists in your response cache.
- 3
Tier 3 — Explicit handoff
Return a structured response with context about what was completed and route to a human queue. Required for customer-facing agents where silent failure is the worst outcome.
Wiring the Agent Cost Circuit Breaker to Langfuse and OpenTelemetry
Where the cost numbers come from, and how to propagate them across services for cross-workflow analysis.
The circuit breaker itself is pure control logic — it just needs cost numbers. The question is where those numbers come from in practice.
Langfuse provides per-generation cost data through generation.usage.cost, which it calculates from token counts and model pricing automatically. You can hook into its Python SDK callback to feed cost into your breaker in real time rather than polling after the fact.
OpenTelemetry spans give you the infrastructure layer: propagate cost data across services, correlate it with workflow IDs, user IDs, and feature flags, and feed it into your observability stack for cross-session analysis. When a breaker trips, the span carries enough context to know exactly which workflow, which input type, and which tool call sequence caused the spiral.
TrueFoundry's cost observability layer adds workflow-level cost aggregation with built-in alerting, which can serve as a second line of defense — not a replacement for the in-process breaker, but a useful catch for cost trends that develop over multiple sessions rather than within a single session.
cost_tracing.pyfrom opentelemetry import trace
from langfuse.decorators import langfuse_context, observe
tracer = trace.get_tracer("agent.cost.breaker")
@observe() # Langfuse decorator — auto-captures token usage + cost
def execute_with_cost_tracing(
workflow_id: str,
session_id: str,
breaker: CostCircuitBreaker,
prompt: str,
) -> dict:
with tracer.start_as_current_span("agent.execute") as span:
span.set_attribute("workflow.id", workflow_id)
span.set_attribute("session.id", session_id)
span.set_attribute("cost.budget.usd", breaker.trip_threshold_usd)
span.set_attribute("circuit.state", breaker.state)
if not breaker.should_allow():
span.set_attribute("circuit.tripped", True)
span.set_attribute("cost.session.usd", breaker.session_cost)
return degraded_response(workflow_id, breaker.state)
response = llm_client.chat(prompt)
# Langfuse captures generation cost automatically via @observe
# We also pull it out to feed the breaker
cost_usd = langfuse_context.get_current_observation_cost()
breaker.record_call(cost_usd)
span.set_attribute("cost.call.usd", cost_usd)
span.set_attribute("cost.session.usd", breaker.session_cost)
span.set_attribute("circuit.state", breaker.state)
return responsePre-Production Circuit Breaker Checklist
Run 200+ representative inputs through the agent and log full token traces
Calculate P50, P90, P95, P99 session costs from profiling data
Set trip threshold at 3× P95 for each workflow type
Set conversation turn limit alongside cost threshold (typically 2–3× expected turns)
Define degradation tier (1/2/3) for each workflow before deploying
Implement one breaker instance per session — never shared across sessions
Wire trip callback to PagerDuty or Slack oncall channel
Add OTel span attributes: workflow.id, cost.budget.usd, cost.session.usd, circuit.state
Test trip behavior explicitly: run a synthetic loop that exceeds the threshold
Validate HALF-OPEN probe budget is ≤10% of trip threshold
Does the circuit breaker add meaningful latency to every agent call?
No. The should_allow() check is a local in-memory operation — it adds microseconds, not milliseconds. The record_call() update happens synchronously after the LLM response returns. At 100 concurrent sessions, the overhead is negligible compared to LLM call latency (typically 500ms–3s).
How do I handle multi-agent workflows where cost accumulates across several agents?
Pass the session's cumulative cost through your execution context so all sub-agents share the same cost accumulator. Each sub-agent checks the same breaker instance before executing. This prevents a scenario where each sub-agent has a clean budget while the parent session is already deep into a spiral.
What if the HALF-OPEN probe request also spirals?
The probe budget should be tight — 5–10% of the trip threshold. If the probe exceeds its budget, the breaker immediately returns to OPEN and resets the cooldown timer. This prevents HALF-OPEN from becoming a second spike vector. Set probebudgetusd low enough that even a misbehaving probe causes minimal damage.
Should I set a global monthly budget as well, in addition to per-session breakers?
Yes, but treat it as a last-resort backstop, not a primary control. Per-session breakers catch in-flight spirals. A monthly budget cap in your AI gateway (Portkey, Helicone, or a custom middleware) catches slow accumulation patterns that are too distributed to trigger individual session breakers — like 50,000 sessions each spending 20% above their expected cost.
Signs you need a circuit breaker now
You've deployed agents in production but your only cost control is a monthly cloud budget cap
Your cloud billing anomaly alerts have a 24–48h lag — meaning a Friday spike runs all weekend
You have multi-agent workflows where sub-agents share a context that grows with each tool call
You've had at least one billing surprise that required a refund request or an awkward finance conversation
Your agents run research or summarization tasks with no hard limit on source documents fetched
Signs your current threshold needs recalibration
- ✓
The breaker trips more than twice per week on workflows that complete successfully
- ✓
False positives are causing degraded responses for legitimate edge-case inputs
- ✓
Your P95 session cost has shifted by more than 20% since you last profiled (model pricing changes, new tools added)
- ✓
You added new tool types that have significantly different token output sizes than the original profiling set
The 200-input profiling pass felt like overhead we couldn't afford before launch. We skipped it. Two weeks later we were on the phone with our cloud provider asking for a $9,000 credit. The profiling pass now takes one afternoon and is non-negotiable before any agent ships.
- [1]ZenML: What 1,200 Production Deployments Reveal About LLMOps in 2025(zenml.io)↩
- [2]Portkey: Retries, Fallbacks, and Circuit Breakers in LLM Apps(portkey.ai)↩
- [3]Galileo: AI Agent Cost Optimization and Observability(galileo.ai)↩
- [4]BMD Pat: AI Agent Cost Control with AgentGuard in Python(bmdpat.com)↩
- [5]TrueFoundry: AI Cost Observability(truefoundry.com)↩
- [6]Dev.to: How to Stop AI Agent Cost Spirals Before They Start(dev.to)↩