Billing anomaly alerts run on a 24–48 hour lag. The retry loop is already an invoice by the time anyone sees it. The control that catches it is per-session, in-process, and lives in the orchestration layer — profiled envelope, 3x P95 trip, defined degradation.
Friday, 6 PM. A research agent ships behind caching and rate limits. By Saturday midnight it has fired roughly 8,400 API calls, re-querying the same documents with slightly different prompts and appending every tool response to a growing context. Cost per call is now forty times baseline. The ZenML production survey of 1,200 deployments documents the shape of this failure.[1] No alert fired. No circuit tripped. The system worked exactly as designed. Monday: a $12,000 invoice.
Monitoring did not fail. Monitoring just showed up after the burn. Cloud anomaly detection aggregates on a 24–48 hour lag — another way of saying it tells you about Friday's incident on Sunday afternoon. Meanwhile, Datadog's 2026 State of AI Engineering report found that 5% of all LLM call spans report errors, and 60% of those errors are rate-limit exceeded — the provider's backstop, not yours.[7] Your engineering team should not be relying on a vendor's rate limiter as a cost control. That control runs in the orchestration layer — in-process, per-session, measuring cumulative spend against a defined envelope and tripping before the loop compounds.
This is the circuit breaker pattern, ported. Hystrix used it for service availability: when a downstream endpoint starts failing, stop sending traffic instead of hammering a broken system. The adaptation for cost is mechanical. Replace HTTP 500 rate with session spend exceeding a profiled envelope. Everything else maps one to one.
If you already run agents in production, you have either had the billing surprise or you are queued for the first one. What follows is what catches it.
Why cloud cost alerts structurally cannot catch agent cost spirals in time
The three-state circuit breaker model mapped to session cost (not HTTP error rate)
How to profile a cost envelope — P50/P95/P99 method, 200-input minimum
Python and TypeScript implementations you can drop into your orchestration layer
Gateway-level vs in-process enforcement: when each applies and when you need both
Degradation tiers — what the user sees after the trip — defined before deploy
Wiring to Langfuse, OpenTelemetry, and LangGraph's event stream
Pre-production checklist, decision table, and FAQ for common objections
The mechanism that turns $0.03 calls into $180 sessions, and why every standard alert lands after the spend is already final.
Cloud cost alerts work because cloud resources consume linearly. Ten EC2 instances cost ten times one. Agents break that assumption. Each tool call appends its result to the conversation context, so the tenth call in a session is not ten times the first — it can be twenty, because the context feeding that call is twenty times longer.[6] A multi-agent workflow with a typical cost of $0.45 hits $87 the moment one sub-agent enters a tool loop and each retry appends a 2,000-token response to the shared context.
Datadog's 2026 engineering survey makes the scale concrete: token usage per request more than doubled for median customers year-over-year and quadrupled for 90th-percentile power users.[7] And that's excluding the compounding effect inside runaway sessions — that figure measures normal operation. Runaway sessions sit above the P99 shoulder entirely.
The failure modes share one structural property: the termination condition never cleanly fires. The agent keeps going because nothing tells it to stop.
Infinite tool loops. The agent calls the same tool repeatedly. Each response is slightly different. The internal stopping criterion never trips. Tokens stack onto context. Cost per call climbs every iteration.
Hallucinated tool chains. The agent emits a multi-step plan with dozens of API calls. Each step looks cheap on its own. Each step anchors the next. The exit condition is the end of the plan or an error — whichever lands first.
Runaway research. The agent finds one more source. Then another. Diminishing returns is not a concept it has. Without a hard turn limit, it optimizes for completeness over cost.
System prompt bloat. Datadog found that 69% of all input tokens in customer traces were system prompts — internal instructions, policy definitions, and tool guidance executing down the chain from the initial user query.[7] That baseline is not the problem, but it compounds the spiral: a large system prompt means every loop iteration starts from a higher token floor before the first tool response appends anything.
AWS Cost Anomaly Detection and its peers aggregate on a 24–48 hour delay. By the time the alert lands, the loop has been running for 36 hours. The control that matters runs in-process, in real time, scoped to one session.
CLOSED, OPEN, HALF-OPEN — the failure signal swaps from HTTP 500 rate to session spend exceeding the envelope.
Three states, each with one job.
CLOSED — Normal operation. Traffic flows through. The breaker silently tracks cumulative session cost. Below threshold, it stays out of the way.
OPEN — Threshold crossed. Every further agent call in this session is blocked and routed to a fallback. The session does not retry or queue. It degrades. An alert fires to oncall.
HALF-OPEN — After a configurable cooldown (5 minutes is a reasonable default), the breaker lets one probe request through with a tight sub-budget. Probe completes inside its budget, breaker resets to CLOSED. Probe spirals, breaker returns to OPEN and resets the cooldown.
The mapping is direct. Replace HTTP 500 rate with cost-per-session exceeding a defined envelope. Trip threshold takes the place of failure-rate threshold. Degraded response takes the place of a static fallback page.
One distinction is load-bearing: state must be per-session, not global. One runaway must not block 199 healthy concurrent sessions. Each workflow session owns its breaker instance and its cost accumulator. If your orchestration layer runs 200 concurrent sessions, you run 200 independent breakers — each tracking exactly one session's spend.
Budget alert lands 24–48h after the spike starts
By alert time, the session is already $12,000+ deep
Engineer opens Monday to a billing surprise, not an incident
No record of which workflow or input fed the loop
Refund request to AWS or Anthropic — sometimes honored
Breaker trips at 3× P95 ($2.40 for a $0.80 P95 workflow)
Session degrades cleanly; user gets a partial result
Oncall gets a PagerDuty page in seconds
OTel span carries workflow ID, session cost, trip reason
Total incident spend: $2.43 instead of $12,000+
MLflow AI Gateway and Portkey enforce spend at the API routing level. That's useful — and orthogonal to what the in-process breaker does.
Several AI gateways now ship with budget enforcement built in. MLflow AI Gateway lets you define a budget policy with a USD threshold and reset period, then either alert or reject requests once the spend crosses the line.[8] Portkey and Helicone offer similar controls at the API routing layer. These are genuinely useful. They're also not what this article is about.
Gateway-level budgets enforce spend across all sessions for a given workspace or API key. In-process breakers enforce spend within one session. The failure modes they catch are different.
What gateway budgets catch: slow accumulation across thousands of sessions each running slightly over expected cost. If 50,000 sessions each cost 20% more than profiled, no individual session trips an in-process breaker — but the fleet blows the monthly cap. A gateway budget with a REJECT action stops new requests once the aggregate threshold is hit. That's the right tool for fleet-level drift.
What in-process breakers catch: a single session spiraling to 50× expected cost within hours. A gateway budget set at $10,000/month will not stop one session from accumulating $12,000 over a weekend. The aggregate threshold doesn't trip until the damage is already done. Per-session enforcement does.
Run both. They protect against distinct failure classes.
| Dimension | AI Gateway Budget (MLflow, Portkey) | In-Process Session Breaker |
|---|---|---|
| Scope | Workspace / API key — all sessions aggregate | One session — each workflow gets its own instance |
| Failure class caught | Fleet-level drift: many sessions each over-budget by a small amount | Single-session spiral: one session at 20–50× expected cost |
| Response time | Rejects new requests once monthly cap is hit | Trips mid-session before the current loop completes next call |
| Granularity | Coarse: same limit for all workflow types | Fine: separate trip threshold per workflow type |
| Degradation control | HTTP 429 — no partial result, no user-visible fallback | Configurable: partial result, cached fallback, or handoff |
| Observability | Aggregate spend dashboard | Per-session OTel span with workflow ID, trip reason, partial cost |
| When to use | Always — as a fleet-level backstop | Any agent workflow with non-deterministic turn count or tool calls |
The hard work is upstream of the breaker — measuring what 'normal' looks like before you set the trip threshold.
A breaker without a cost distribution behind it is theater. The wrong threshold is as costly as no threshold: too tight, you trip on legitimate traffic and the team learns to ignore the alert; too loose, the breaker never fires until the damage is already done.
We got this wrong on the first deployment. We ran 50 test inputs, saw a P95 of $0.80, set the threshold at $2.40 (3×), shipped. What we had not profiled was end-of-month reporting queries — they naturally pull more context and run 2–3× longer than typical interactions. The breaker tripped 17 times in week one, all legitimate. The team started treating the alerts as noise. The next month, a real runaway happened, the alert fired, and oncall dismissed it alongside 14 legitimate ones. False-positive fatigue is the same outcome as no threshold, with extra steps.
The profiling process is three steps.
Step 1 — Profile representative inputs. Run the agent against 200–500 production-like inputs before deploying. Measure token counts at each step: input, output, tool call, and tokens appended to context from each tool response. Log the full per-session cost trace.
Step 2 — Compute P50, P90, P95, P99 session costs. Session cost = Σ(tokensinstep × inputprice + tokensoutstep × outputprice) across every step in the session. Plot the distribution. P50 is the happy path. P99 is the outlier shoulder. Anything above P99 is a runaway candidate.
Step 3 — Set the trip threshold at 3× P95 as the starting point. Not 10× (too loose to catch spirals early), not 1.5× (you trip on legitimate variance). 3× P95 buys enough headroom for real-world variance while still catching genuine runaways before they compound — calibrate against your workload from there.[2]
Add a turn-count threshold alongside the cost threshold. Long turn counts usually precede cost spikes — a workflow that normally takes 8 turns and has hit 35 is almost certainly in a loop, even if cumulative cost has not crossed the dollar line yet.
One number worth knowing: agent framework adoption has doubled from 9% to 18% of production services in a single year.[7] Most of those teams are profiling cost for the first time. If your agent is newer than 6 months, your P95 is probably underfit — run the profiling pass again with inputs collected from real production traffic, not test suites.
| Workflow | P50 Cost | P95 Cost | Trip Threshold (3× P95) | Degradation Tier |
|---|---|---|---|---|
| Simple Q&A agent | $0.03 | $0.09 | $0.27 | Return cached answer |
| Research agent | $0.45 | $1.80 | $5.40 | Cap tool calls at 3 |
| Code review agent | $0.22 | $0.85 | $2.55 | Skip style checks |
| Data extraction agent | $0.60 | $2.40 | $7.20 | Return partial results |
| Multi-agent pipeline | $1.20 | $4.80 | $14.40 | Skip non-critical subagents |
Per-call instrumentation cannot see the running session total. The orchestration layer can. That is where the breaker has to live.
Place the breaker at the orchestration layer, not on individual model calls. Wrapping chat.completions.create() gives you per-call visibility and exactly nothing about the emergent session cost. The breaker has to see the running total or it cannot do its job.
The minimal state per session is three fields: cumulative cost, current state (CLOSED/OPEN/HALF-OPEN), and the timestamp the breaker last opened (for cooldown tracking). Everything else is a callback.
If your orchestration layer is TypeScript — LangGraph, Vercel AI SDK, or a custom loop — the same logic translates directly. LangGraph exposes an event stream where every model call is tagged with input/output token counts, so you can compute cost in a stream handler and check the breaker before each graph node executes.
Open the circuit and you have two seconds to decide what the user sees. Define the degradation tier before deploy, not after the page fires.
Cut the tool call ceiling to 3, drop non-essential subagents, return partial results with a note. The agent still runs, with reduced scope.
Return the most recent clean response for this workflow type from the response cache, if one exists.
Return a structured response describing what completed and route the conversation to a human queue. Mandatory for customer-facing agents where silent failure is the worst outcome.
Wire it to the cost source — Langfuse for per-generation pricing, OpenTelemetry for cross-service propagation, monthly caps as a backstop.
The breaker is pure control logic. It just needs a cost number per call. The interesting question is where that number lives in your stack.
Langfuse exposes per-generation cost through generation.usage.cost, computed from token counts and model pricing automatically. Hook the Python SDK callback to feed cost into the breaker in real time, instead of polling after the fact.
OpenTelemetry spans give you the infrastructure layer. Propagate cost across services, correlate with workflow ID, user ID, and feature flag, and pipe the result into the observability stack for cross-session analysis. When a breaker trips, the span carries enough context to name the workflow, the input type, and the tool call sequence that fed the spiral.
LangGraph's event stream emits on_chat_model_end events with usage_metadata carrying input and output token counts on every model call. That's your cost hook in a TypeScript graph: multiply by the per-token price for the active model, call breaker.recordCall(), and the TypeScript implementation above handles the rest.
TrueFoundry's rate-limiting layer adds a three-tier enforcement model: per-user, per-team, and per-application token budgets enforced at the gateway before the request reaches the model.[9] Use it as a second line alongside the in-process breaker — not a replacement.
One prompt-caching note: only 28% of LLM calls in Datadog's telemetry utilize prompt caching, despite most production models supporting it.[7] If your system prompt is large and stable — likely, given that 69% of input tokens are system prompts — enabling caching materially reduces the baseline cost per call, which in turn lowers both your P95 and your trip threshold. Profile after enabling caching, not before.
Five specific mistakes that turn a cost breaker into a false-confidence artifact. Each one has a corrected form.
Does the breaker add meaningful latency to every agent call?
No. should_allow() is a local in-memory check — microseconds, not milliseconds. record_call() runs synchronously after the LLM response returns. At 100 concurrent sessions, overhead is negligible against LLM call latency (typically 500ms–3s). If you measure a hot path bottleneck, the bottleneck is the model, not the breaker.
How do I handle multi-agent workflows where cost accumulates across sub-agents?
Pass the session's cumulative cost through the execution context so every sub-agent shares the same accumulator. Each sub-agent checks the same breaker instance before executing. This prevents the symmetric failure where each sub-agent sees a clean budget while the parent session is already deep into a spiral. In LangGraph, thread the breaker instance through the graph state or use a session-scoped dependency injected at graph compile time.
What if the HALF-OPEN probe also spirals?
Keep the probe budget tight — 5–10% of the trip threshold. If the probe exceeds its sub-budget, the breaker returns to OPEN and resets the cooldown. HALF-OPEN does not get to become a second spike vector. Set probe_budget_usd (or probeBudget in TypeScript) low enough that a misbehaving probe causes minimal damage.
Do I still need a global monthly budget cap on top of per-session breakers?
Yes, but only as a backstop. Per-session breakers catch in-flight spirals. A monthly cap in the AI gateway (Portkey, Helicone, MLflow AI Gateway) catches slow accumulation patterns too distributed to trip an individual session breaker — 50,000 sessions each running 20% over expected cost is the canonical case. One counterintuitive finding: teams with very tight monthly caps sometimes generate worse incentives than teams with none. Engineers start designing agents to land just under the cap instead of optimizing for cost-per-outcome, producing slightly-too-expensive-but-technically-compliant behavior across the fleet. Use the monthly cap as a hard stop. Use per-workflow P95 trends as the primary efficiency signal.
When should I recalibrate the trip threshold?
Recalibrate when: (1) model pricing changes — a new Claude or GPT-4o pricing tier shifts your P95 in either direction; (2) you add new tools with significantly different output sizes; (3) the breaker is tripping more than twice per week on legitimate traffic; or (4) P95 session cost has moved more than 20% since the last profiling run. Set a calendar reminder for quarterly recalibration at minimum.
Can I use the breaker pattern with streaming responses?
Yes, with a small adjustment. For streaming, you don't get the full token count until the stream closes. Compute cost at stream completion — on on_finish in Vercel AI SDK, or on usage_metadata in LangGraph's on_chat_model_end event — then call recordCall(). For very long streaming responses, you can estimate cost mid-stream using a running character count heuristic (roughly 4 chars per token), but call recordCall() with the actual count at stream end to keep the accumulator precise.
Agents are in production and your only cost control is a monthly cloud budget cap
Cloud anomaly alerts run on a 24–48h lag — a Friday spike runs all weekend
Multi-agent workflows share a context that grows with every tool call
You have already had one billing surprise that needed a refund request or an awkward finance conversation
Research and summarization agents run without a hard limit on source documents fetched
The breaker trips more than twice per week on workflows that complete cleanly
False positives are forcing degraded responses on legitimate edge-case inputs
P95 session cost has moved more than 20% since you last profiled (model pricing change, new tools added)
New tool types have meaningfully different token output sizes than the original profiling set
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.