Two LangChain agents burned $47,000 over eleven days in November 2025.[1] One generated queries. The other validated responses. They locked into a handshake. Neither was malfunctioning — they did exactly what their prompts instructed. The validator kept rejecting the generator's output as incomplete. The generator kept trying again. Eleven days. The burn rate became visible when someone finally opened the billing dashboard.
Monitoring was in place. It did not stop the loop. Cloud cost anomaly detection aggregates spend over 24-48 hour windows — by the time a threshold fires, a session burning $0.045 per minute has been running for days.[2] Monitoring that catches a runaway 36 hours in does not prevent $47,000 in damage. It documents it.
The real failure happened before any code ran. The team modeled expected agent cost as an average: typical tokens times model price equals expected session cost. That formula is correct for the 65-70% of runs that complete cleanly. It has no term for the 2-5% of runs where termination conditions fail and context grows without bound. In multi-agent systems, that small tail is where almost all your money goes.
More monitoring will not fix this. Tighter alerts will not fix this. The fix is upstream: model cost as a distribution, not a point estimate, and put enforcement synchronously in-process — checked before each API call — instead of asynchronously from outside it.
Both agents executed their prompts correctly. The cost model had no term for what they did.
Higher token usage correlated with lower accuracy. Spending more bought worse answers.
Budget from the average and you have already excluded the tier where the money goes.
Chatbot cost intuitions do not transfer. Agent architectures pay for context at every handoff.
Why the Average Is a Lie: Agent Cost Is a Heavy Tail
Standard cost forecasting assumes costs cluster around a predictable mean. Agent workflows violate that assumption at every layer.
Standard API cost modeling treats requests as deterministic: hit the backend, return a response, pay a predictable amount. Agentic workflows do not behave that way. Tool calls per session are variable. Retry cycles are variable. Context length — which sets the cost of every subsequent call — grows on every iteration.
An ICLR 2026 study of agentic coding found that for identical task specifications, some runs consumed 10 times more tokens than others. There was no quality bonus. The expensive runs were less accurate on average.[3] Same task, same agent, different inputs, 10x spread in cost. No mean compresses that into a useful budget.
Briefcase AI analyzed 1.4 million production LLM conversations. P95 cost ran 3-4x the median. The tail of massive conversations — about 9% of sessions — accounted for more than half of total spend.[4] Model from the mean and you have silently excluded the expensive half of your bill.
Three tiers describe where agent cost actually lands. Happy path is the cheap, common case: the task completes in the tool calls and turns you predicted. Iterative search is the middle: retries, multi-hops, refinement on partial results. Edge-case recovery is the tail: a termination condition fails, context explodes, or two agents lock into the handshake that generates five-figure bills. Budget from your P95. Enforce at 3x P95. The edge-case tier is what the third multiple is there to catch — and it is exactly what every average-based budget pretends does not exist.
| Tier | What happens | Typical probability | Cost vs happy path | Failure modes that push here |
|---|---|---|---|---|
| Happy path | Task completes in the tool calls and turns you predicted | 65–70% | 1× | None |
| Iterative search | Retries, multi-hops, refinement on partial results | 25–30% | 3–8× | Ambiguous inputs, soft failures, partial tool results |
| Edge-case recovery | Termination fails, context explodes, retry storm fires | 2–5% | 50–200×+ | Incompatible termination conditions, missing loop guards, uncompressed context handoff |
Every Handoff Pays the Context Tax. Twice.
Each agent boundary forwards the full accumulated context, not just the previous result. The cost math is not additive. It compounds.
Multi-agent systems carry a cost property single-agent systems do not: context multiplies at every handoff. When an orchestrator passes work to a research agent, then forwards the result to an analysis agent, the analysis agent does not receive the research output alone. It receives the accumulated context of the entire session — the original request, the orchestrator's planning steps, the research agent's full output — as its input.
A 3-agent sequential pipeline where each agent produces 500-token outputs does not consume 3 x (1,000 input + 500 output) = 4,500 tokens. It consumes roughly 1,000 + 1,500 + 2,000 = 4,500 input tokens across the three agents, before any output tokens at all — a 4.5x multiplier on inputs alone, before retry overhead enters the picture.[5] Group-chat patterns are worse. Five agents, ten rounds, 300 tokens per message: 15,000 tokens in shared context before any work happens, because every agent reads the full message history on every turn.
Chatbot cost intuitions do not transfer. Anthropic's own research finds agents burn 4x more tokens than direct chat, and multi-agent systems burn 15x more.[6] One team migrated a simple RAG chatbot to an agentic pipeline. Monthly inference spend jumped from $4,200 to $31,000 — same underlying tasks, different architecture.[7]
The handoff format is a cost decision. Full conversation history is the most expensive option. A compressed structured summary — key facts, decisions made, constraints that must persist — is cheaper and produces better downstream reasoning, because the receiving agent reads signal instead of noise. This is an interface design problem. Infrastructure cannot fix what the interface gets wrong.
Monitoring Fires Too Late. By Design.
Asynchronous billing aggregation and synchronous in-flight sessions live on incompatible time scales. Monitoring records the process. It cannot stop it.
Every team that hits a cost spiral reaches for the same fix first: better monitoring. The reflex is correct for visibility — you need to know what happened. It cannot solve prevention, because monitoring is structurally asynchronous.
Cloud cost alerts aggregate spend over 24-48 hour rolling windows. A session burning $10 per hour accumulates $240 before a daily anomaly threshold even has the data to fire on. That assumes a tightly-calibrated threshold. A team running legitimate large batch jobs alongside agents will eat persistent false positives, train its engineers to dismiss cost alerts as noise, and then miss the real event when it arrives. One engineer who traced this failure mode in a production multi-agent framework put it cleanly: 'Per-session accounting without a synchronous enforcement point tends to lag behind the actual spike. By the time you observe the overage, the burst has already happened.'[9]
Prometheus-based dashboards have the same property. Even at one-minute scrape intervals, an alert rule must evaluate, match a condition, and route a notification before anyone can act. A zombie agent stuck in a reasoning loop burns $4-5 in a single query.[7] Multiply by concurrent sessions and the 30-90 minutes it takes for an alert to reach someone with permission to kill the process. Monitoring is necessary. It is not sufficient.
The dashboard sees outputs. It cannot stop the process in flight. For that you need enforcement in-process, synchronous, checked before each API call — not after the response returns.
Alert fires 24-48h after the spike starts — billing aggregation window
Cannot stop an in-flight session, only narrate what happened to it
Batch-job false positives erode alert credibility until nobody reads them
By alert time, a $47K loop has been compounding for days
Requires tight baseline calibration just to avoid alert fatigue
BudgetExceededError raised synchronously before the next API call
Halts the session mid-flight — accumulation cannot continue
Per-session scope — one runaway does not deny service to concurrent runs
Trips at 3x P95 ($2.40 on a $0.80 workflow), not at $47,000
Stable regardless of whether your monitoring baseline is calibrated
SDK-Level Enforcement Is the Only Mechanism That Prevents Spend
Enforcement runs synchronously, in-process, before each API call. Not after the response. Not from an external monitor. There is no other place it works.
SDK-level budget enforcement wraps the model client and raises an exception synchronously before the next API call when cumulative spend crosses its limit. This is structurally different from circuit breakers (which evaluate after calls complete) and from monitoring (which observes the process from outside it entirely).
Two open-source libraries implement the pattern. agentbudget[8] patches the Anthropic and OpenAI SDKs, tracks every call in dollar terms, fires soft-limit callbacks for warning, and raises BudgetExhausted before the next request when the hard limit hits. It also detects loops — repeated-call patterns within a configurable time window — and trips the breaker before cost can accumulate. tokencap[10] wraps the client and tracks in token counts rather than dollars. Token counts stay accurate when providers reprice. Dollar equivalents go stale.
Scope matters as much as mechanism. A global token limit that trips when any session exceeds budget denies service to every concurrent session the moment one runaway hits its ceiling. Per-session enforcement — one budget instance per agent run, never shared — means the 200 clean sessions running alongside one runaway never notice. The broken session trips its own limit. Everyone else keeps working.
Set the limit in tokens, not dollars. Token counts come from provider response metadata and stay correct. Dollar limits derived from per-token prices silently degrade when providers change pricing — and providers change pricing often. Translate your session budget to tokens once at configuration time, then enforce in tokens.
agent_session.py# One budget per session. Never shared. One runaway never blocks the others.
from agentbudget import AgentBudget
import anthropic
import logging
log = logging.getLogger(__name__)
def run_agent_session(task: str, p95_usd: float = 0.80) -> dict:
"""
p95_usd is the P95 session cost measured in staging profiling.
Hard stop at 3x P95. Soft warning at 2x P95. Calibrate from data, not intuition.
"""
budget = AgentBudget(
max_spend=f"${p95_usd * 3:.2f}", # hard stop at 3x P95
soft_limit=0.67, # soft warning at 2x P95 (67% of 3x)
max_repeated_calls=8, # loop detection: same tool, same args
loop_window_seconds=60.0, # within a 60-second window
on_soft_limit=lambda r: log.warning(
"session approaching limit: $%.4f / $%.4f spent",
r.spent, r.limit,
),
on_hard_limit=lambda r: notify_oncall(
session_cost=r.spent,
task_preview=task[:80],
),
on_loop_detected=lambda r: log.error(
"loop detected: %d repeated calls within %ds",
r.repeated_count, 60,
),
)
client = anthropic.Anthropic()
# Halt and escalate on breach. Never retry through a tripped budget.
try:
with budget.session() as session:
return agent_loop(client, session, task)
except budget.BudgetExhausted as err:
log.error(
"hard limit hit: $%.4f / $%.4f — task: %s",
err.spent, err.limit, task[:80],
)
return {
"status": "budget_exceeded",
"partial_result": getattr(err, "last_response", None),
}Calibrate the Trip Threshold From Data, Not Intuition
Enforcement limits without a profiled cost distribution are guesses. Three steps. The middle one is non-negotiable.
- [01]
Map the decision graph. Cost every node.
Before any deployment, map every execution path the agent can take. Export the workflow graph from LangGraph or your equivalent. For each node, record: average input token count, average output token count, tool call frequency, retry probability, and the context size inherited from upstream. For multi-agent pipelines, track cumulative context at every boundary — not just the current agent's input, but the full accumulated context it receives.
- [02]
Profile 200+ representative inputs. Compute the full distribution.
Run the agent against 200-500 production-shaped inputs in staging. Log full token traces per session — tokens per tool call, tokens per retry cycle, total session cost. Compute P50, P90, P95, P99. Sort runs by cost and assign each to a tier. Hunt for runs where per-step cost accelerated across iterations rather than holding roughly stable. Those are your loop candidates, and they tell you whether your termination conditions hold under real inputs.
- [03]
Set the hard limit at 3x P95. Re-profile when anything changes.
Hard session limit: 3x P95 from your profiling data. Soft-limit callback: 2x P95 — a debugging warning before the hard stop. For multi-agent pipelines, you need per-agent sub-limits and an aggregate pipeline limit. Neither is sufficient on its own. Re-profile whenever you add a tool, switch model tiers, or change workflow branching. A new tool that fetches external documents can push P95 up by 3-5x and break every threshold calibrated without it.
Pre-Deployment Cost Modeling Checklist
Decision graph mapped — every node, its average token cost, its retry probability, and the context it inherits from upstream
200+ production-shaped inputs profiled in staging with full per-step token traces
P50, P90, P95, P99 session cost computed from profiling data — not from happy-path estimates
Each profiled run assigned a tier — happy path, iterative search, or edge-case recovery
Runs flagged where per-step cost rose across iterations — termination guards likely missing
Hard session limit set at 3x P95 in agentbudget or tokencap
Soft-limit callback at 2x P95 wired for the pre-hard-stop warning
Per-session enforcement only — one budget instance per run, never shared across concurrent sessions
Cumulative context size logged at every agent handoff boundary in multi-agent pipelines
BudgetExhausted handler wired to oncall, not to log files alone
Hard limit tested explicitly — a synthetic loop pushed through staging until it fired
Re-profiling scheduled as a required step after any tool, model, or workflow branch change
Won't hard token limits kill legitimate complex tasks?
Yes — if the limit is calibrated too tight. That is why the profiling step is non-negotiable. A 3x P95 limit will trip on roughly 5% of sessions. Track the false-positive rate in the first two weeks after deployment. More than 1-2 false positives per hundred sessions means one of three things: the profiling sample was too small, the inputs were not representative of production, or your real P95 is meaningfully higher than what staging measured. Adjust upward incrementally. The alternative — a very loose limit to dodge calibration work — is not a budget control. It is a monitoring delay.
How does SDK-level enforcement interact with the circuit breaker pattern?
They operate at different granularities. SDK-level token limits check before every single API call — they prevent each new call from compounding a runaway context. Circuit breakers operate at the session or service level, typically with a trip threshold at several multiples of expected session cost, and they add graceful degradation modes — partial results, cached fallbacks, human handoff. The SDK limit catches problems earlier. The circuit breaker manages session-level cost envelopes and decides how a session shuts down when it has to. Production agent systems run both.
My agents are async across multiple processes. Can per-session enforcement still work?
Yes — with a shared backend instead of in-memory or SQLite. Both tokencap and agentbudget support Redis for multi-process coordination. The constraint that matters: budget checks and budget updates must be atomic. If two concurrent processes both read 'budget at 50%' before either update registers, you double-spend before enforcement fires. Redis atomic increments close that race. For single-process agent runtimes, the SQLite default holds. Switch to Redis when agents run in separate processes or containers.
We already have a monthly spend cap at the API organization level. Isn't that enough?
No. Organization-level spend caps are a last-resort backstop, not a session-level control. They limit total monthly API spend across all sessions, users, and workflows. They do not prevent a single runaway from consuming $47,000 of that budget before the cap fires. They also provide zero per-session attribution: when the org cap triggers, you do not know which workflow caused it or which input pattern triggered the loop. Use the org cap as a safety net for catastrophic failure. Use per-session enforcement, calibrated from profiled distribution, as the primary control.
How do I profile cost distribution when agents run in parallel?
Profile the aggregate pipeline, not the individual agents. For each test run, log total input tokens across all agents that fired, total output tokens, and wall-clock elapsed time. For each agent invocation also log: which agent, input token count at that specific invocation, output token count, and cumulative context size at the moment it was invoked. That gives you both the pipeline-level distribution (for the aggregate hard limit) and per-agent distributions (for per-agent sub-limits). Parallel agents add a wrinkle: you cannot sum context sequentially. You have to track each agent's full context input separately.
Hard Rules for Agent Cost Safety
Model agent cost as a distribution, not a point estimate
Build P50, P95, and P99 cost estimates from staging profiling before setting any budget limit. A budget built from the mean leaves the edge-case recovery tier — where runaway loops live — entirely unaccounted for. Averages are accurate summaries of the past. They are useless predictors of the tail.
Enforce per-session, never globally
A global token limit that trips when any session exceeds budget denies service to every concurrent session the moment one fails. Enforcement must be scoped to the individual agent run so one broken session cannot affect others. With 1,000 concurrent sessions and even a 1% runaway rate, a global limit is a reliability incident, not a cost control.
Check the budget before the API call, not after
Monitoring, dashboards, and post-call circuit breakers all observe cost that has already been incurred. The only mechanism that prevents accumulation checks the budget synchronously before each API call. By the time a cost spike shows up in monitoring, a multi-agent loop has already multiplied context across several handoffs and the damage is done.
Re-profile when tooling or models change
A new tool that fetches external content can shift P95 session cost by 3-5x. Switching model tiers changes per-token costs. Any time the agent's tool set or model selection changes, prior profiling data is stale and enforcement thresholds need recalibration. Re-profiling belongs in the same checklist as updating tests — not in the follow-up backlog.
- [1]Stop Burning Money on API Fees — $47K Infinite Loop Incident Documentation (fazm.ai, Dec 2025)(fazm.ai)↩
- [2]I Spent $0.20 Reproducing the Multi-Agent Loop That Cost Someone $47K (Medium, Feb 2026)(medium.com)↩
- [3]Analyzing and Predicting Token Consumptions in Agentic Coding Tasks (ICLR 2026 submission)(openreview.net)↩
- [4]The Hidden Economics of Token-Based LLM Pricing: Why Your AI Costs Are Unpredictable (Briefcase AI, Jan 2026)(blogs.briefcasebrain.com)↩
- [5]How LLMs Process Information: Tokens and Context Windows — The Compound Effect in Multi-Agent Systems (Agentic Academy, Feb 2026)(agentic-academy.ai)↩
- [6]Multi-Agent Orchestration: The Complexity Trap (Amit Kothari, Nov 2025)(amitkoth.com)↩
- [7]Why Lower Token Prices Didn't Reduce Our Inference Bill (Nick Vasylyna, Generative AI pub, Apr 2026)(generativeai.pub)↩
- [8]AgentBudget: Real-Time Cost Enforcement for AI Agent Sessions (GitHub, Feb 2026)(github.com)↩
- [9]Unbounded LLM Token Usage and Cost Amplification in Tool Loops — Hive Issue #393 (Jan 2026)(github.com)↩
- [10]tokencap: Token Usage Visibility and Budget Enforcement for AI Agents (GitHub, Mar 2026)(github.com)↩