In November 2025, a LangChain research pipeline burned $47,000 over eleven days.[1] Two agents — one generating queries, the other validating responses — locked into a handshake loop that never terminated. Neither agent was malfunctioning. Both did exactly what their prompts instructed. The validator kept finding the generator's output incomplete. The generator kept trying to improve it. Back and forth, for 11 days, at a burn rate that only became visible when someone finally checked the billing dashboard.
The team had monitoring in place. It did not stop the loop. Cloud cost anomaly detection aggregates spend over 24–48 hour windows — by the time a threshold fires, a session burning $0.045 per minute has already been running for days.[2] Monitoring that fires 36 hours into a runaway doesn't prevent $47,000 in damage. It documents it.
The real failure happened before any code ran. The team modeled their expected agent cost as an average: typical tokens times model price equals expected session cost. That formula is correct for the 65–70% of runs that complete cleanly. It has no term for the 2–5% of runs where termination conditions fail and context grows without bound. In multi-agent systems, that small tail accounts for nearly all your money.
This isn't about adding monitoring or tightening alerts. It's about the upstream problem: cost-by-distribution thinking instead of cost-by-average thinking, and enforcement that operates synchronously inside your process rather than asynchronously from outside it.
The Average Is a Lie: How Agent Costs Actually Distribute
Standard cost forecasting assumes costs cluster around a predictable mean. Agent workflows violate that assumption at every layer.
Standard API cost modeling treats requests as having deterministic cost profiles: a request hits a backend, returns a response, costs a predictable amount. Agentic workflows don't work that way. The number of tool calls per session is variable. Retry cycles are variable. Context length — which determines the cost of every subsequent call — grows with each iteration.
An ICLR 2026 paper analyzed token consumption across agentic coding tasks and found that for identical task specifications, some runs consumed 10 times more tokens than others, with no corresponding quality benefit. Higher token usage was actually associated with lower task accuracy on average.[3] The same task, the same agent, different inputs — and a 10x spread in cost. You cannot average that distribution into a useful budget.
Briefcase AI's analysis of 1.4 million production LLM conversations found that 95th percentile costs exceed the median by a factor of 3–4x, and that the tail of massive conversations — roughly 9% of total sessions — accounts for over half of total cost.[4] Model from the mean and you've invisibly excluded the expensive half.
Three tiers describe where agent costs actually land. The happy path is your cheapest, most common case — task completes in expected tool calls and turns. Iterative search is the middle tier: the agent retries, multi-hops, or refines based on partial results. Edge-case recovery is the tail: a termination condition fails, context explodes, or two agents enter the handshake pattern that generates five-figure bills. Budget from your P95. Enforce at 3× P95. The edge-case recovery tier is what the third multiple is there to catch — and it's also what every average-based budget ignores.
| Tier | What happens | Typical probability | Cost vs happy path | Failure modes that push here |
|---|---|---|---|---|
| Happy path | Task completes in expected tool calls and turns | 65–70% | 1× | None |
| Iterative search | Agent retries, multi-hops, or refines on partial results | 25–30% | 3–8× | Ambiguous inputs, soft failures, partial tool results |
| Edge-case recovery | Termination condition fails, context explodes, retry storm | 2–5% | 50–200×+ | Incompatible agent termination conditions, missing loop guards, context handoff without compression |
The Context Multiplication Tax in Multi-Agent Pipelines
Each agent handoff doesn't just pass the result — it passes the full accumulated context. The cost math is not additive. It's compounding.
Multi-agent systems have a cost property that single-agent systems don't: context multiplies at every handoff. When an orchestrator passes work to a research agent, then forwards the result to an analysis agent, the analysis agent doesn't receive only the research output. It receives the accumulated context of everything that has happened in the session — the original request, the orchestrator's planning steps, the research agent's full output — as its input.
A 3-agent sequential pipeline where each agent produces 500-token outputs doesn't consume 3 × (1,000 input + 500 output) = 4,500 tokens. It consumes roughly 1,000 + 1,500 + 2,000 = 4,500 input tokens across the three agents, before any output tokens — a 4.5× multiplier on inputs alone, even before accounting for any retry overhead.[5] Group-chat patterns are worse: a 5-agent group chat running 10 rounds at 300 tokens per message burns 15,000 tokens in shared context alone before any actual work occurs, because every agent reads the full message history on every turn.
This is why chatbot cost intuitions don't transfer. Anthropic's own research found that agents typically burn 4 times more tokens than direct chat interactions, and multi-agent systems burn 15 times more.[6] One team that migrated from a simple RAG chatbot to an agentic pipeline watched their monthly inference spend jump from $4,200 to $31,000 — same underlying tasks, different architecture.[7]
What you pass at each agent boundary is a cost decision. Full conversation history is the most expensive handoff format. A compressed structured summary — key facts, decisions made, constraints that must persist — is cheaper and often produces better downstream reasoning, because the receiving agent gets signal rather than noise. This is an interface design problem, not just an infrastructure one.
Why Monitoring Alerts Fire Too Late
Asynchronous billing aggregation and synchronous in-flight sessions operate on incompatible time scales. Monitoring sees the output of the process — it cannot stop the process itself.
Every team that hits a cost spiral reaches for the same immediate fix: better monitoring. That reflex is correct for cost visibility — you need to understand what happened. It cannot solve the prevention problem, because monitoring is structurally asynchronous.
Cloud cost alerts aggregate spend over 24–48 hour rolling windows. A session burning $10 per hour accumulates $240 before a daily anomaly threshold can fire. And that assumes your threshold is calibrated tightly — a team that also runs legitimate large batch jobs will face persistent false positives, train engineers to dismiss cost alerts as noise, and then miss the real event. One engineer who traced a similar failure in a production multi-agent framework put it directly: 'Per-session accounting without a synchronous enforcement point tends to lag behind the actual spike. By the time you observe the overage, the burst has already happened.'[9]
Prometheus-based cost dashboards face the same constraint. Even at one-minute scrape intervals, an alert rule must evaluate, match a condition, and notify before any action is possible. A zombie agent stuck in a reasoning loop can consume $4–5 in a single query.[7] Multiply that by concurrent sessions and by the 30–90 minutes it takes for an alert to reach someone who can act, and you understand why monitoring is necessary but not sufficient.
The monitoring sees outputs. It cannot stop the process mid-flight. For that, you need enforcement that operates in-process, synchronously, checked before each API call — not after.
Alert fires 24–48h after spike starts (billing aggregation window)
Cannot stop an in-flight session — only documents what happened
False positives from batch jobs reduce alert credibility over time
By alert time, a $47K loop has been running for days
Requires accurate baseline calibration to avoid alert fatigue
BudgetExceededError raised synchronously before the next API call
Stops the session mid-flight — accumulation cannot continue
Per-session enforcement: one runaway doesn't block concurrent sessions
Trips at 3× P95 ($2.40 on a $0.80 workflow) — not at $47,000
Stable regardless of whether your monitoring baseline is well-calibrated
SDK-Level Hard Enforcement: The Only Real Prevention
Enforcement must happen synchronously, in-process, before each API call — not after the response returns, and not from an external monitoring system.
SDK-level budget enforcement wraps the model client and raises an exception synchronously before the next API call is made, when cumulative token spend crosses its limit. This is fundamentally different from circuit breakers (which check session cost and trip after calls complete) and from monitoring (which observes cost from outside the process entirely).
Two open-source libraries implement this pattern. agentbudget[8] patches the Anthropic and OpenAI SDKs to track every call in dollar terms, with soft limits that fire warning callbacks and a hard limit that raises BudgetExhausted before the next request is made. It also adds loop detection — configurable repeated-call patterns within a time window that trip the breaker before cost can accumulate. tokencap[10] wraps the client and tracks in token counts rather than dollars, which is more stable across provider pricing changes since token counts are always accurate while dollar equivalents become stale after pricing updates.
The enforcement scope matters as much as the mechanism. A global token limit that trips when any session exceeds budget will deny service to all concurrent sessions when one runaway agent hits its ceiling. Per-session enforcement — one budget instance per agent run, never shared — means the 200 clean sessions running alongside one runaway are unaffected. One broken session trips its own limit; everyone else keeps working.
Set your limit in tokens, not dollars. Token counts come directly from provider response metadata and never become stale. Dollar limits calculated from per-token prices silently degrade after pricing changes — and providers change pricing frequently. Translate your session budget to tokens once at configuration time, then enforce in tokens.
agent_session.pyfrom agentbudget import AgentBudget
import anthropic
import logging
log = logging.getLogger(__name__)
def run_agent_session(task: str, p95_usd: float = 0.80) -> dict:
"""
Wrap every session in its own budget instance. Never share across
concurrent sessions — one runaway must not block others.
p95_usd: the P95 session cost measured in staging profiling.
The hard limit is set at 3x P95; soft warning at 2x P95.
"""
budget = AgentBudget(
max_spend=f"${p95_usd * 3:.2f}", # hard stop at 3x P95
soft_limit=0.67, # soft warning at 2x P95 (67% of 3x)
max_repeated_calls=8, # loop detection: same tool, same args
loop_window_seconds=60.0, # within a 60-second window
on_soft_limit=lambda r: log.warning(
"session approaching limit: $%.4f / $%.4f spent",
r.spent, r.limit,
),
on_hard_limit=lambda r: notify_oncall(
session_cost=r.spent,
task_preview=task[:80],
),
on_loop_detected=lambda r: log.error(
"loop detected: %d repeated calls within %ds",
r.repeated_count, 60,
),
)
client = anthropic.Anthropic()
try:
with budget.session() as session:
return agent_loop(client, session, task)
except budget.BudgetExhausted as err:
log.error(
"hard limit hit: $%.4f / $%.4f — task: %s",
err.spent, err.limit, task[:80],
)
return {
"status": "budget_exceeded",
"partial_result": getattr(err, "last_response", None),
}Pre-Deployment Cost Modeling in Three Steps
Setting enforcement limits requires knowing your actual cost distribution first. You cannot calibrate a meaningful trip threshold from intuition or from the happy path alone.
- 1
Map the decision graph and cost each node
Before deploying any agent, map every possible execution path. If you use LangGraph or a similar framework, export the workflow graph. For each node, record: average input token count, average output token count, tool call frequency, retry probability, and the context size it inherits from upstream steps. For multi-agent pipelines, track cumulative context at each agent boundary — not just the current agent's input, but the full accumulated context it receives.
- 2
Profile 200+ representative inputs and compute the full distribution
Run your agent against 200–500 production-like inputs in staging. Log full token traces per session: tokens per tool call, tokens per retry cycle, and total session cost. Calculate P50, P90, P95, and P99. Sort runs by total cost and identify which tier each falls into. Look specifically for runs where cost per step accelerated across iterations rather than remaining roughly stable — those are your loop candidates, and they reveal whether your termination conditions are robust.
- 3
Set enforcement limits at 3× P95, re-profile when anything changes
Set your hard session limit at 3× P95 from your profiling data. Set a soft-limit callback at 2× P95 — this gives you a warning before the hard stop fires, which is useful for debugging. For multi-agent pipelines, set per-agent sub-limits and an aggregate pipeline limit; neither is sufficient alone. Re-profile whenever you add new tools, change model selection, or modify workflow branching logic. Adding a tool that fetches external documents can push P95 up by 3–5×, breaking every threshold calibrated without it.
Pre-Deployment Cost Modeling Checklist
Mapped the decision graph: every node, its average token cost, its retry probability, and the context it inherits from upstream
Profiled 200+ production-representative inputs in staging with full per-step token traces
Calculated P50, P90, P95, P99 session costs from profiling data — not from happy-path estimates
Identified which tier each profiled run falls into (happy path / iterative / edge-case recovery)
Flagged runs where per-step cost accelerated across iterations — these reveal missing termination guards
Set hard session limit at 3× P95 in agentbudget or tokencap
Set soft-limit callback at 2× P95 for pre-hard-stop warning
Per-session enforcement only — one budget instance per agent run, never shared across concurrent sessions
Logged cumulative context size at each agent handoff boundary in multi-agent pipelines
Wired BudgetExhausted handler to oncall alerting — not just log files
Tested the hard limit explicitly: ran a synthetic loop through staging that exceeds it
Scheduled re-profiling as a required step after any tool, model, or workflow branch change
If I set hard token limits, won't they kill legitimate complex tasks?
Yes — if your limit is calibrated too tight. That's why the profiling step is non-negotiable. A limit at 3× P95 means roughly 5% of sessions may trip it. Track your false positive rate in the first two weeks after deployment. More than 1–2 false positives per hundred sessions suggests either your profiling sample was too small, not representative of production inputs, or your P95 is meaningfully higher than what you measured in staging. Adjust upward incrementally. The alternative — a very loose limit to avoid thinking through calibration — is not a budget control, it's a monitoring delay.
How does SDK-level enforcement interact with the circuit breaker pattern?
They complement each other at different granularities. SDK-level token limits check before every single API call — they prevent each new call from adding to a runaway context. Circuit breakers operate at the session or service level, typically with a trip threshold at several multiples of expected session cost, and they add graceful degradation modes (partial results, cached fallbacks, human handoffs). The SDK limit is more granular and catches problems earlier; the circuit breaker manages session-level cost envelopes and controls what happens when a session needs to be stopped gracefully. Production agent systems benefit from both.
My agents are async and run across multiple processes. Can per-session enforcement still work?
Yes, but you need a shared backend rather than an in-memory or SQLite store. Both tokencap and agentbudget support Redis backends for multi-process coordination. The critical requirement is that budget checks and budget updates are atomic operations — if two concurrent processes both read 'budget at 50%' before either update registers, you effectively double spend before enforcement fires. Redis atomic increments eliminate this race condition. For most teams, the SQLite default works fine for single-process agent runtimes; switch to Redis when you have agents running in separate processes or containers.
We already have a monthly spend cap at the API organization level. Isn't that sufficient?
Organization-level spend caps are a last-resort backstop, not a session-level control. They cap your total monthly API spend across all sessions, all users, and all workflows — they don't prevent a single runaway session from consuming $47,000 of that budget before the cap fires. They also provide zero per-session attribution: when the cap triggers, you don't know which workflow caused it or which input pattern triggered the loop. Use the org-level cap as a safety net for truly catastrophic failure. Use per-session enforcement as your primary control, calibrated from actual cost distribution profiling.
How do I profile cost distribution for a multi-agent pipeline where agents run in parallel?
Profile the aggregate pipeline, not individual agents. For each test run, log the total input tokens across all agents that fired, the total output tokens, and the wall-clock elapsed time. For each agent invocation, also log: which agent, input token count at that specific invocation, output token count, and cumulative context size at the point it was invoked. This gives you both the pipeline-level distribution (for your aggregate hard limit) and per-agent distributions (for per-agent sub-limits). Parallel agents make this slightly more complex because you can't sum context sequentially — you need to track each agent's full context input separately.
Hard Rules for Agent Cost Safety
Model agent cost as a distribution, not a point estimate
Build P50, P95, and P99 cost estimates from staging profiling before setting any budget limits. A budget built from average cost leaves the edge-case recovery tier — where runaway loops live — completely unaccounted for. Averages are accurate summaries of the past. They are useless predictors of the tail.
Enforce per-session, never globally
A global token limit that trips when any session exceeds budget denies service to every concurrent session when one fails. Enforcement must be scoped to the individual agent session so one broken run cannot affect others. At scale — 1,000 concurrent sessions with even a 1% runaway rate — a global limit becomes a reliability problem, not just a cost control.
Check the budget before the API call, not after
Monitoring, dashboards, and post-call circuit breakers all observe cost after it has been incurred. The only mechanism that prevents accumulation is one that checks the budget synchronously before each API call is made. By the time you see a cost spike in monitoring, a multi-agent loop has already multiplied context through multiple handoffs and the damage is done.
Re-profile when your tooling or models change
Adding a new tool that fetches external content can change your P95 session cost by 3–5×. Switching model tiers changes per-token costs. Any time your agent's tool set or model selection changes, your existing profiling data is stale and your enforcement thresholds need recalibration. Treat re-profiling as a required step in the same checklist as updating tests — not an optional follow-up.
- [1]Stop Burning Money on API Fees — $47K Infinite Loop Incident Documentation (fazm.ai, Dec 2025)(fazm.ai)↩
- [2]I Spent $0.20 Reproducing the Multi-Agent Loop That Cost Someone $47K (Medium, Feb 2026)(medium.com)↩
- [3]Analyzing and Predicting Token Consumptions in Agentic Coding Tasks (ICLR 2026 submission)(openreview.net)↩
- [4]The Hidden Economics of Token-Based LLM Pricing: Why Your AI Costs Are Unpredictable (Briefcase AI, Jan 2026)(blogs.briefcasebrain.com)↩
- [5]How LLMs Process Information: Tokens and Context Windows — The Compound Effect in Multi-Agent Systems (Agentic Academy, Feb 2026)(agentic-academy.ai)↩
- [6]Multi-Agent Orchestration: The Complexity Trap (Amit Kothari, Nov 2025)(amitkoth.com)↩
- [7]Why Lower Token Prices Didn't Reduce Our Inference Bill (Nick Vasylyna, Generative AI pub, Apr 2026)(generativeai.pub)↩
- [8]AgentBudget: Real-Time Cost Enforcement for AI Agent Sessions (GitHub, Feb 2026)(github.com)↩
- [9]Unbounded LLM Token Usage and Cost Amplification in Tool Loops — Hive Issue #393 (Jan 2026)(github.com)↩
- [10]tokencap: Token Usage Visibility and Budget Enforcement for AI Agents (GitHub, Mar 2026)(github.com)↩