Friday, 6 PM. A research agent ships behind caching and rate limits. By Saturday midnight it has fired roughly 8,400 API calls, re-querying the same documents with slightly different prompts and appending every tool response to a growing context. Cost per call is now forty times baseline. The ZenML production survey of 1,200 deployments documents the shape of this failure.[1] No alert fired. No circuit tripped. The system worked exactly as designed. Monday: a $12,000 invoice.
Monitoring did not fail. Monitoring just showed up after the burn. Cloud anomaly detection aggregates on a 24–48 hour lag, which is another way of saying it tells you about Friday's incident on Sunday afternoon. The control that catches the loop in time is not in the billing dashboard. It lives in the orchestration layer — in-process, per-session, measuring cumulative spend against a defined envelope and tripping before the loop compounds.
This is the circuit breaker pattern, ported. Hystrix used it for service availability: when a downstream endpoint starts failing, stop sending traffic instead of hammering a broken system. The adaptation for cost is mechanical. Replace HTTP 500 rate with session spend exceeding a profiled envelope. Everything else maps one to one.
If you already run agents in production, you have either had the billing surprise or you are queued for the first one. What follows is what catches it.
Cloud Cost Controls Were Built for Linear Resources. Agents Are Not Linear.
The mechanism that turns $0.03 calls into $180 sessions, and why every standard alert lands after the spend is already final.
Cloud cost alerts work because cloud resources consume linearly. Ten EC2 instances cost ten times one. Agents break that assumption. Each tool call appends its result to the conversation context, so the tenth call in a session is not ten times the first — it can be twenty, because the context feeding that call is twenty times longer.[6] A multi-agent workflow with a typical cost of $0.45 hits $87 the moment one sub-agent enters a tool loop and each retry appends a 2,000-token response to the shared context.
The failure modes share one structural property. The termination condition never cleanly fires. The agent keeps going because nothing tells it to stop.
Infinite tool loops. The agent calls the same tool repeatedly. Each response is slightly different. The internal stopping criterion never trips. Tokens stack onto context. Cost per call climbs every iteration.
Hallucinated tool chains. The agent emits a multi-step plan with dozens of API calls. Each step looks cheap on its own. Each step anchors the next. The exit condition is the end of the plan or an error — whichever lands first.
Runaway research. The agent finds one more source. Then another. Diminishing returns is not a concept it has. Without a hard turn limit, it optimizes for completeness over cost.
AWS Cost Anomaly Detection and its peers aggregate on a 24–48 hour delay. By the time the alert lands, the loop has been running for 36 hours. The control that matters runs in-process, in real time, scoped to one session.
Three States, Mapped From Hystrix to Session Cost
CLOSED, OPEN, HALF-OPEN — the failure signal swaps from HTTP 500 rate to session spend exceeding the envelope.
Three states, each with one job.
CLOSED — Normal operation. Traffic flows through. The breaker silently tracks cumulative session cost. Below threshold, it stays out of the way.
OPEN — Threshold crossed. Every further agent call in this session is blocked and routed to a fallback. The session does not retry or queue. It degrades. An alert fires to oncall.
HALF-OPEN — After a configurable cooldown (5 minutes is a reasonable default), the breaker lets one probe request through with a tight sub-budget. Probe completes inside its budget, breaker resets to CLOSED. Probe spirals, breaker returns to OPEN and resets the cooldown.
The mapping is direct. Replace HTTP 500 rate with cost-per-session exceeding a defined envelope. Trip threshold takes the place of failure-rate threshold. Degraded response takes the place of a static fallback page.
One distinction is load-bearing: state must be per-session, not global. One runaway must not block 199 healthy concurrent sessions. Each workflow session owns its breaker instance and its cost accumulator. If your orchestration layer runs 200 concurrent sessions, you run 200 independent breakers — each tracking exactly one session's spend.
Budget alert lands 24–48h after the spike starts
By alert time, the session is already $12,000+ deep
Engineer opens Monday to a billing surprise, not an incident
No record of which workflow or input fed the loop
Refund request to AWS or Anthropic — sometimes honored
Breaker trips at 3× P95 ($2.40 for a $0.80 P95 workflow)
Session degrades cleanly; user gets a partial result
Oncall gets a PagerDuty page in seconds
OTel span carries workflow ID, session cost, trip reason
Total incident spend: $2.43 instead of $12,000+
You Cannot Calibrate a Breaker You Have Not Profiled
The hard work is upstream of the breaker — measuring what 'normal' looks like before you set the trip threshold.
A breaker without a cost distribution behind it is theater. The wrong threshold is as costly as no threshold: too tight, you trip on legitimate traffic and the team learns to ignore the alert; too loose, the breaker never fires until the damage is already done.
We got this wrong on the first deployment. We ran 50 test inputs, saw a P95 of $0.80, set the threshold at $2.40 (3×), shipped. What we had not profiled was end-of-month reporting queries — they naturally pull more context and run 2–3× longer than typical interactions. The breaker tripped 17 times in week one, all legitimate. The team started treating the alerts as noise. The next month, a real runaway happened, the alert fired, and oncall dismissed it alongside 14 legitimate ones. False-positive fatigue is the same outcome as no threshold, with extra steps.
The profiling process is three steps.
Step 1 — Profile representative inputs. Run the agent against 200–500 production-like inputs before deploying. Measure token counts at each step: input, output, tool call, and tokens appended to context from each tool response. Log the full per-session cost trace.
Step 2 — Compute P50, P90, P99 session costs. Session cost = Σ(tokensinstep × inputprice + tokensoutstep × outputprice) across every step in the session. Plot the distribution. P50 is the happy path. P99 is the outlier shoulder. Anything above P99 is a runaway candidate.
Step 3 — Set the trip threshold at 3× P95 as the starting point. Not 10× (too loose to catch spirals early), not 1.5× (you trip on legitimate variance). 3× P95 buys enough headroom for real-world variance while still catching genuine runaways before they compound — calibrate against your workload from there[2]. Practitioners report Cox Automotive uses P95 cost as the trip point for customer service agents: when a conversation crosses that line, the agent hands off to a human instead of continuing.
Add a turn-count threshold alongside the cost threshold. Long turn counts usually precede cost spikes — a workflow that normally takes 8 turns and has hit 35 is almost certainly in a loop, even if cumulative cost has not crossed the dollar line yet.
| Workflow | P50 Cost | P95 Cost | Trip Threshold (3× P95) | Degradation Tier |
|---|---|---|---|---|
| Simple Q&A agent | $0.03 | $0.09 | $0.27 | Return cached answer |
| Research agent | $0.45 | $1.80 | $5.40 | Cap tool calls at 3 |
| Code review agent | $0.22 | $0.85 | $2.55 | Skip style checks |
| Data extraction agent | $0.60 | $2.40 | $7.20 | Return partial results |
| Multi-agent pipeline | $1.20 | $4.80 | $14.40 | Skip non-critical subagents |
The Breaker Belongs in the Orchestrator, Not on the Model Call
Per-call instrumentation cannot see the running session total. The orchestration layer can. That is where the breaker has to live.
Place the breaker at the orchestration layer, not on individual model calls. Wrapping chat.completions.create() gives you per-call visibility and exactly nothing about the emergent session cost. The breaker has to see the running total or it cannot do its job.
The minimal state per session is three fields: cumulative cost, current state (CLOSED/OPEN/HALF-OPEN), and the timestamp the breaker last opened (for cooldown tracking). Everything else is a callback.
cost_breaker.pyimport time
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, TypeVar
T = TypeVar("T")
class BreakerState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
@dataclass
class CostCircuitBreaker:
"""One instance per session. Never shared."""
trip_threshold_usd: float
cooldown_seconds: int = 300
probe_budget_usd: float = 0.0 # defaults to 10% of trip threshold
_state: BreakerState = field(default=BreakerState.CLOSED, init=False)
_session_cost: float = field(default=0.0, init=False)
_opened_at: float = field(default=0.0, init=False)
_on_trip: Callable | None = field(default=None, init=False)
def __post_init__(self) -> None:
if self.probe_budget_usd == 0.0:
self.probe_budget_usd = self.trip_threshold_usd * 0.10
def on_trip(self, callback: Callable) -> None:
"""Fires when the breaker opens. Wire this to PagerDuty."""
self._on_trip = callback
def record_call(self, cost_usd: float) -> None:
"""Call after every LLM response. Trips on threshold breach."""
self._session_cost += cost_usd
if (
self._state == BreakerState.CLOSED
and self._session_cost >= self.trip_threshold_usd
):
self._open()
def should_allow(self) -> bool:
"""Gate every LLM invocation. False routes to fallback."""
if self._state == BreakerState.CLOSED:
return True
if self._state == BreakerState.OPEN:
elapsed = time.monotonic() - self._opened_at
if elapsed >= self.cooldown_seconds:
self._state = BreakerState.HALF_OPEN
self._session_cost = 0.0
return True
return False
# HALF_OPEN: probe runs only under sub-budget
return self._session_cost < self.probe_budget_usd
def reset(self) -> None:
"""Probe completed clean. Return to CLOSED."""
self._state = BreakerState.CLOSED
self._session_cost = 0.0
@property
def state(self) -> str:
return self._state.value
@property
def session_cost(self) -> float:
return self._session_cost
def _open(self) -> None:
self._state = BreakerState.OPEN
self._opened_at = time.monotonic()
if self._on_trip:
self._on_trip({
"session_cost": self._session_cost,
"threshold": self.trip_threshold_usd,
})
# One breaker per session. Never shared across sessions.
breaker = CostCircuitBreaker(trip_threshold_usd=5.40)
breaker.on_trip(lambda ctx: notify_pagerduty(ctx))
def run_agent_step(prompt: str) -> dict:
if not breaker.should_allow():
return degraded_response(breaker.state, breaker.session_cost)
response = llm_client.chat(prompt)
breaker.record_call(response.usage.cost_usd)
return responseWhat Happens After the Trip Decides Whether the Breaker Is Useful
Open the circuit and you have two seconds to decide what the user sees. Define the degradation tier before deploy, not after the page fires.
- [01]
Tier 1 — Constrained execution
Cut the tool call ceiling to 3, drop non-essential subagents, return partial results with a note. The agent still runs, with reduced scope.
- [02]
Tier 2 — Cached fallback
Return the most recent clean response for this workflow type from the response cache, if one exists.
- [03]
Tier 3 — Explicit handoff
Return a structured response describing what completed and route the conversation to a human queue. Mandatory for customer-facing agents where silent failure is the worst outcome.
The Breaker Is Pure Logic. The Cost Numbers Have to Come From Somewhere.
Wire it to the cost source — Langfuse for per-generation pricing, OpenTelemetry for cross-service propagation, monthly caps as a backstop.
The breaker is pure control logic. It just needs a cost number per call. The interesting question is where that number lives in your stack.
Langfuse exposes per-generation cost through generation.usage.cost, computed from token counts and model pricing automatically. Hook the Python SDK callback to feed cost into the breaker in real time, instead of polling after the fact.
OpenTelemetry spans give you the infrastructure layer. Propagate cost across services, correlate with workflow ID, user ID, and feature flag, and pipe the result into the observability stack for cross-session analysis. When a breaker trips, the span carries enough context to name the workflow, the input type, and the tool call sequence that fed the spiral.
TrueFoundry's cost observability layer adds workflow-level aggregation with built-in alerting. Treat it as a second line — not a replacement for the in-process breaker, but a useful catch for cost trends that develop across sessions rather than inside one.
cost_tracing.pyfrom opentelemetry import trace
from langfuse.decorators import langfuse_context, observe
tracer = trace.get_tracer("agent.cost.breaker")
@observe() # Langfuse captures token usage + cost on every call
def execute_with_cost_tracing(
workflow_id: str,
session_id: str,
breaker: CostCircuitBreaker,
prompt: str,
) -> dict:
with tracer.start_as_current_span("agent.execute") as span:
span.set_attribute("workflow.id", workflow_id)
span.set_attribute("session.id", session_id)
span.set_attribute("cost.budget.usd", breaker.trip_threshold_usd)
span.set_attribute("circuit.state", breaker.state)
if not breaker.should_allow():
span.set_attribute("circuit.tripped", True)
span.set_attribute("cost.session.usd", breaker.session_cost)
return degraded_response(workflow_id, breaker.state)
response = llm_client.chat(prompt)
# Pull cost from Langfuse and feed the breaker.
cost_usd = langfuse_context.get_current_observation_cost()
breaker.record_call(cost_usd)
span.set_attribute("cost.call.usd", cost_usd)
span.set_attribute("cost.session.usd", breaker.session_cost)
span.set_attribute("circuit.state", breaker.state)
return responsePre-Production Breaker Checklist
200+ representative inputs profiled through the agent with full token traces logged
P50, P90, P95, P99 session costs computed from profiling data
Trip threshold set to 3× P95 for each workflow type
Turn-count threshold set alongside cost threshold (typically 2–3× expected turns)
Degradation tier (1/2/3) named for each workflow before deploy
One breaker instance per session — never shared across sessions
Trip callback wired to PagerDuty or the Slack oncall channel
OTel span attributes present: workflow.id, cost.budget.usd, cost.session.usd, circuit.state
Trip behavior tested explicitly with a synthetic loop that exceeds threshold
HALF-OPEN probe budget validated at ≤10% of trip threshold
Does the breaker add meaningful latency to every agent call?
No. should_allow() is a local in-memory check — microseconds, not milliseconds. record_call() runs synchronously after the LLM response returns. At 100 concurrent sessions, overhead is negligible against LLM call latency (typically 500ms–3s). If you measure a hot path bottleneck, the bottleneck is the model, not the breaker.
How do I handle multi-agent workflows where cost accumulates across sub-agents?
Pass the session's cumulative cost through the execution context so every sub-agent shares the same accumulator. Each sub-agent checks the same breaker instance before executing. This prevents the symmetric failure where each sub-agent sees a clean budget while the parent session is already deep into a spiral.
What if the HALF-OPEN probe also spirals?
Keep the probe budget tight — 5–10% of the trip threshold. If the probe exceeds its sub-budget, the breaker returns to OPEN and resets the cooldown. HALF-OPEN does not get to become a second spike vector. Set probe_budget_usd low enough that a misbehaving probe causes minimal damage.
Do I still need a global monthly budget cap on top of per-session breakers?
Yes, but only as a backstop. Per-session breakers catch in-flight spirals. A monthly cap in the AI gateway (Portkey, Helicone, or custom middleware) catches slow accumulation patterns too distributed to trip an individual session breaker — 50,000 sessions each running 20% over expected cost is the canonical case. One counterintuitive finding: teams with very tight monthly caps sometimes generate worse incentives than teams with none. Engineers start designing agents to land just under the cap instead of optimizing for cost-per-outcome, producing slightly-too-expensive-but-technically-compliant behavior across the fleet. Use the monthly cap as a hard stop. Use per-workflow P95 trends as the primary efficiency signal.
Signals you need a breaker now, not next quarter
Agents are in production and your only cost control is a monthly cloud budget cap
Cloud anomaly alerts run on a 24–48h lag — a Friday spike runs all weekend
Multi-agent workflows share a context that grows with every tool call
You have already had one billing surprise that needed a refund request or an awkward finance conversation
Research and summarization agents run without a hard limit on source documents fetched
Signals the threshold has drifted and needs recalibration
- ✓
The breaker trips more than twice per week on workflows that complete cleanly
- ✓
False positives are forcing degraded responses on legitimate edge-case inputs
- ✓
P95 session cost has moved more than 20% since you last profiled (model pricing change, new tools added)
- ✓
New tool types have meaningfully different token output sizes than the original profiling set
- [1]ZenML: What 1,200 Production Deployments Reveal About LLMOps in 2025(zenml.io)↩
- [2]Portkey: Retries, Fallbacks, and Circuit Breakers in LLM Apps(portkey.ai)↩
- [3]Galileo: AI Agent Cost Optimization and Observability(galileo.ai)↩
- [4]BMD Pat: AI Agent Cost Control with AgentGuard in Python(bmdpat.com)↩
- [5]TrueFoundry: AI Cost Observability(truefoundry.com)↩
- [6]Dev.to: How to Stop AI Agent Cost Spirals Before They Start(dev.to)↩