Two LangChain agents burned $47K in eleven days. The model worked. The budget math didn't. Multi-agent cost is a heavy-tailed distribution, monitoring is structurally too late, and only synchronous SDK-level enforcement stops the spiral.
Why average-based cost models structurally fail for agentic workflows — and the three-tier distribution model that works
The context multiplication tax: how a 3-agent pipeline consumes 4.5× the naive per-call estimate before retry overhead
Why monitoring cannot prevent cost spirals — and the synchronous enforcement mechanism that can
A failure taxonomy drawn from 63 confirmed production budget-overrun incidents across 21 orchestration frameworks
Runnable Python for per-session enforcement with agentbudget, including loop detection and Redis for multi-process setups
A pre-deployment profiling workflow: map → profile → set thresholds → test the limit fires
Decision table for context handoff format — when full history costs less than a structured summary
Two LangChain agents burned $47,000 over eleven days in November 2025.[1] One generated queries. The other validated responses. They locked into a handshake. Neither was malfunctioning — they did exactly what their prompts instructed. The validator kept rejecting the generator's output as incomplete. The generator kept trying again. Eleven days. The burn rate became visible when someone finally opened the billing dashboard.
Monitoring was in place. It did not stop the loop. Cloud cost anomaly detection aggregates spend over 24-48 hour windows — by the time a threshold fires, a session burning $0.045 per minute has been running for days.[2] Monitoring that catches a runaway 36 hours in does not prevent $47,000 in damage. It documents it.
The real failure happened before any code ran. The team modeled expected agent cost as an average: typical tokens times model price equals expected session cost. That formula is correct for the 65-70% of runs that complete cleanly. It has no term for the 2-5% of runs where termination conditions fail and context grows without bound. In multi-agent systems, that small tail is where almost all your money goes.
More monitoring will not fix this. Tighter alerts will not fix this. The fix is upstream: model cost as a distribution, not a point estimate, and put enforcement synchronously in-process — checked before each API call — instead of asynchronously from outside it.
Both agents executed their prompts correctly. The cost model had no term for what they did.
Higher token usage correlated with lower accuracy. Spending more bought worse answers.
Budget from the average and you have already excluded the tier where the money goes.
Every incident backed by a quoted GitHub issue or maintainer statement. This is a documented failure class, not an edge case.
Standard cost forecasting assumes costs cluster around a predictable mean. Agent workflows violate that assumption at every layer.
Standard API cost modeling treats requests as deterministic: hit the backend, return a response, pay a predictable amount. Agentic workflows do not behave that way. Tool calls per session are variable. Retry cycles are variable. Context length — which sets the cost of every subsequent call — grows on every iteration.
An ICLR 2026 study of agentic coding found that for identical task specifications, some runs consumed 10 times more tokens than others. There was no quality bonus. The expensive runs were less accurate on average.[3] Same task, same agent, different inputs, 10x spread in cost. No mean compresses that into a useful budget.
Briefcase AI analyzed 1.4 million production LLM conversations. P95 cost ran 3-4x the median. The tail of massive conversations — about 9% of sessions — accounted for more than half of total spend.[4] Model from the mean and you have silently excluded the expensive half of your bill.
A January 2026 empirical study ("Tokenomics") of a ChatDev multi-agent system running 30 software development tasks found that input tokens averaged 53.9% of all consumption — and iterative code review alone consumed 59.4% of the total token budget, dwarfing initial code generation.[12] The refinement loop is the cost, not the primary task. That changes where you instrument and where you cap.
Three tiers describe where agent cost actually lands. Happy path is the cheap, common case: the task completes in the tool calls and turns you predicted. Iterative search is the middle: retries, multi-hops, refinement on partial results. Edge-case recovery is the tail: a termination condition fails, context explodes, or two agents lock into the handshake that generates five-figure bills. Budget from your P95. Enforce at 3x P95. The edge-case tier is what the third multiple is there to catch — and it is exactly what every average-based budget pretends does not exist.
| Tier | What happens | Typical probability | Cost vs happy path | Failure modes that push here |
|---|---|---|---|---|
| Happy path | Task completes in the tool calls and turns you predicted | 65–70% | 1× | None |
| Iterative search | Retries, multi-hops, refinement on partial results | 25–30% | 3–8× | Ambiguous inputs, soft failures, partial tool results |
| Edge-case recovery | Termination fails, context explodes, retry storm fires | 2–5% | 50–200×+ | Incompatible termination conditions, missing loop guards, uncompressed context handoff |
63 confirmed incidents. 21 orchestration frameworks. One consistent theme: the enforcement primitive was missing, not the intention.
A June 2026 arXiv paper cataloged 63 confirmed production budget-overrun incidents across 21 orchestration frameworks spanning 2023-2026, each backed by a quoted GitHub issue and (where reported) a documented dollar loss.[11] The researchers organized them into eight failure clusters — not eight separate problems, but eight expressions of the same root condition: no in-process enforcement primitive.
The clusters map cleanly onto where teams typically install controls after their first incident. They install logging; the failure that hits them next is not the same loop. It's a different cluster.
Cluster 1: Incompatible termination conditions. Two agents, each with correct exit logic, whose conditions are mutually satisfiable only after infinite attempts. The $47K loop was this cluster. Neither agent was broken. Their combined contract was.
Cluster 2: Unbounded retry amplification. A retry policy with no cumulative cap. Three retries on failure sounds safe. Three retries per tool call, per agent, per orchestration step, compounding across an eight-step pipeline, is not. The error rate does not change; the cost multiplier does.
Cluster 3: Context accumulation without summarization. No compression between agent handoffs. Full conversation history forwarded at each boundary. Cost grows quadratically with turns — O(n²) — because every agent reads the entire prior thread, not just the delta. Frameworks including LangGraph now ship summarization nodes precisely because this pattern is the default failure mode in multi-turn pipelines.
Cluster 4: Tool response amplification. A tool that returns arbitrarily large outputs — a database dump, a full API response, a scraped webpage — injected into context without truncation. One tool call can add 50,000 tokens to the context window and make every subsequent call in the session expensive.
Cluster 5: Parallel fan-out with no aggregate cap. An orchestrator spawning N sub-agents with individual session limits but no pipeline-level ceiling. Each sub-agent's limit trips independently. The orchestrator's aggregate cost is N times the per-agent limit, minus any that hit their individual ceiling before the pipeline completes. Parallel agents multiply the per-agent cost variance.
Cluster 6: Reasoning loop accumulation. Extended chain-of-thought or self-critique loops that generate tokens without making progress. The model keeps reconsidering. Tokens accumulate. The task does not advance. This cluster is more common with reasoning-optimized models (o-series, DeepSeek-R1) where verbose internal reasoning is priced the same as output tokens.
Cluster 7: Memory retrieval amplification. A retrieval layer that returns too many chunks per query — either because similarity thresholds are too loose or because the top-k is too high — injecting excess context before the model has a chance to reason. Each retrieved chunk is paid for at full input token price, across every turn that re-sends the retrieval result.
Cluster 8: Phantom delegation. A budget is delegated to a sub-agent but the parent retains a reference. Both spend against it. Neither sees the other's draw. This cluster is what the affine-typed Rust mitigation in the same paper[11] is designed to make a compile-time impossibility — making double-spending structurally unrepresentable rather than a runtime check.
Each agent boundary forwards the full accumulated context, not just the previous result. The cost math is not additive. It compounds.
Multi-agent systems carry a cost property single-agent systems do not: context multiplies at every handoff. When an orchestrator passes work to a research agent, then forwards the result to an analysis agent, the analysis agent does not receive the research output alone. It receives the accumulated context of the entire session — the original request, the orchestrator's planning steps, the research agent's full output — as its input.
A 3-agent sequential pipeline where each agent produces 500-token outputs does not consume 3 × (1,000 input + 500 output) = 4,500 tokens. It consumes roughly 1,000 + 1,500 + 2,000 = 4,500 input tokens across the three agents, before any output tokens at all — a 4.5× multiplier on inputs alone, before retry overhead enters the picture.[5] Group-chat patterns are worse. Five agents, ten rounds, 300 tokens per message: 15,000 tokens in shared context before any work happens, because every agent reads the full message history on every turn.
Chatbot cost intuitions do not transfer. Anthropic's own research finds agents burn 4× more tokens than direct chat, and multi-agent systems burn 15× more.[6] One team migrated a simple RAG chatbot to an agentic pipeline. Monthly inference spend jumped from $4,200 to $31,000 — same underlying tasks, different architecture.[7]
The computational cost of attention scales quadratically with context length — O(n²) in sequence length, with constant-factor optimizations but no change to the growth rate.[13] Double the context, quadruple the computation. That math shows up on your bill even if your per-token price dropped 80% this year.
The handoff format is a cost decision. Full conversation history is the most expensive option. A compressed structured summary — key facts, decisions made, constraints that must persist — is cheaper and often produces better downstream reasoning, because the receiving agent reads signal instead of noise. This is an interface design problem. Infrastructure cannot fix what the interface gets wrong.
| Handoff format | Token cost | Best for | Avoid when |
|---|---|---|---|
| Full conversation history | Highest — grows O(n²) with turns | Short pipelines (≤3 turns), debugging, tasks where the receiving agent needs full reasoning trace | Pipelines with more than 4-5 steps; tool outputs with large payloads; group-chat patterns |
| Compressed structured summary | Fixed — 200-400 tokens regardless of prior turn count | Long pipelines, iterative refinement loops, memory-constrained agents | Tasks requiring full causal trace; when summary quality cannot be verified before handoff |
| State-only extraction | Minimal — key-value facts only | Pipelines where downstream agents need decisions/constraints, not reasoning | Tasks requiring explanation of why a decision was made; debugging sessions |
| Tool output truncation | Bounded — top-K chunks or character cap | Any tool that can return unbounded content (web scrape, DB query, file read) | When completeness of the tool response is verified as necessary (rare) |
Asynchronous billing aggregation and synchronous in-flight sessions live on incompatible time scales. Monitoring records the process. It cannot stop it.
Every team that hits a cost spiral reaches for the same fix first: better monitoring. The reflex is correct for visibility — you need to know what happened. It cannot solve prevention, because monitoring is structurally asynchronous.
Cloud cost alerts aggregate spend over 24-48 hour rolling windows. A session burning $10 per hour accumulates $240 before a daily anomaly threshold even has the data to fire on. That assumes a tightly-calibrated threshold. A team running legitimate large batch jobs alongside agents will eat persistent false positives, train its engineers to dismiss cost alerts as noise, and then miss the real event when it arrives. One engineer who traced this failure mode in a production multi-agent framework put it cleanly: 'Per-session accounting without a synchronous enforcement point tends to lag behind the actual spike. By the time you observe the overage, the burst has already happened.'[9]
Prometheus-based dashboards have the same property. Even at one-minute scrape intervals, an alert rule must evaluate, match a condition, and route a notification before anyone can act. A zombie agent stuck in a reasoning loop burns $4-5 in a single query.[7] Multiply by concurrent sessions and the 30-90 minutes it takes for an alert to reach someone with permission to kill the process. Monitoring is necessary. It is not sufficient.
The dashboard sees outputs. It cannot stop the process in flight. For that you need enforcement in-process, synchronous, checked before each API call — not after the response returns.
Alert fires 24-48h after the spike starts — billing aggregation window
Cannot stop an in-flight session, only narrate what happened to it
Batch-job false positives erode alert credibility until nobody reads them
By alert time, a $47K loop has been compounding for days
Requires tight baseline calibration just to avoid alert fatigue
BudgetExceededError raised synchronously before the next API call
Halts the session mid-flight — accumulation cannot continue
Per-session scope — one runaway does not deny service to concurrent runs
Trips at 3x P95 ($2.40 on a $0.80 workflow), not at $47,000
Stable regardless of whether your monitoring baseline is calibrated
Enforcement runs synchronously, in-process, before each API call. Not after the response. Not from an external monitor. There is no other place it works.
SDK-level budget enforcement wraps the model client and raises an exception synchronously before the next API call when cumulative spend crosses its limit. This is structurally different from circuit breakers (which evaluate after calls complete) and from monitoring (which observes the process from outside it entirely).
Two open-source libraries implement the pattern. agentbudget[8] patches the Anthropic and OpenAI SDKs, tracks every call in dollar terms, fires soft-limit callbacks for warning, and raises BudgetExhausted before the next request when the hard limit hits. It also detects loops — repeated-call patterns within a configurable time window — and trips the breaker before cost can accumulate. tokencap[10] wraps the client and tracks in token counts rather than dollars. Token counts stay accurate when providers reprice. Dollar equivalents go stale.
The OpenAI Agents SDK exposes usage metadata on every run — input tokens, output tokens, cumulative session totals — via RunHooks and the context object passed to each hook.[14] That gives you the instrumentation data. It does not give you enforcement. You implement the pre-call check yourself, or use agentbudget/tokencap to wrap it.
Scope matters as much as mechanism. A global token limit that trips when any session exceeds budget denies service to every concurrent session the moment one runaway hits its ceiling. Per-session enforcement — one budget instance per agent run, never shared — means the 200 clean sessions running alongside one runaway never notice. The broken session trips its own limit. Everyone else keeps working.
Set the limit in tokens, not dollars. Token counts come from provider response metadata and stay correct. Dollar limits derived from per-token prices silently degrade when providers change pricing — and providers change pricing often. Translate your session budget to tokens once at configuration time, then enforce in tokens.
Enforcement limits without a profiled cost distribution are guesses. Three steps. The middle one is non-negotiable.
Before any deployment, map every execution path the agent can take. Export the workflow graph from LangGraph or your equivalent. For each node, record: average input token count, average output token count, tool call frequency, retry probability, and the context size inherited from upstream. For multi-agent pipelines, track cumulative context at every boundary — not just the current agent's input, but the full accumulated context it receives. Flag every node that can produce unbounded output (web scrape, DB query, file reader) — those are your tool-amplification candidates.
Run the agent against 200-500 production-shaped inputs in staging. Log full token traces per session — tokens per tool call, tokens per retry cycle, total session cost. Compute P50, P90, P95, P99. Sort runs by cost and assign each to a tier. Hunt for runs where per-step cost accelerated across iterations rather than holding roughly stable — those are your loop candidates. Also watch the ratio of iterative refinement tokens to initial generation tokens: if refinement is more than 3x the generation cost, your termination conditions need work.
Hard session limit: 3x P95 from your profiling data. Soft-limit callback: 2x P95 — a debugging warning before the hard stop. For multi-agent pipelines, you need per-agent sub-limits and an aggregate pipeline limit. Neither is sufficient on its own. Re-profile whenever you add a tool, switch model tiers, or change workflow branching. A new tool that fetches external documents can push P95 up by 3-5x and break every threshold calibrated without it.
Won't hard token limits kill legitimate complex tasks?
Yes — if the limit is calibrated too tight. That is why the profiling step is non-negotiable. A 3x P95 limit will trip on roughly 5% of sessions. Track the false-positive rate in the first two weeks after deployment. More than 1-2 false positives per hundred sessions means one of three things: the profiling sample was too small, the inputs were not representative of production, or your real P95 is meaningfully higher than what staging measured. Adjust upward incrementally. The alternative — a very loose limit to dodge calibration work — is not a budget control. It is a monitoring delay.
How does SDK-level enforcement interact with the circuit breaker pattern?
They operate at different granularities. SDK-level token limits check before every single API call — they prevent each new call from compounding a runaway context. Circuit breakers operate at the session or service level, typically with a trip threshold at several multiples of expected session cost, and they add graceful degradation modes — partial results, cached fallbacks, human handoff. The SDK limit catches problems earlier. The circuit breaker manages session-level cost envelopes and decides how a session shuts down when it has to. Production agent systems run both.
My agents are async across multiple processes. Can per-session enforcement still work?
Yes — with a shared backend instead of in-memory or SQLite. Both tokencap and agentbudget support Redis for multi-process coordination. The constraint that matters: budget checks and budget updates must be atomic. If two concurrent processes both read 'budget at 50%' before either update registers, you double-spend before enforcement fires. Redis atomic increments close that race. For single-process agent runtimes, the SQLite default holds. Switch to Redis when agents run in separate processes or containers.
We already have a monthly spend cap at the API organization level. Isn't that enough?
No. Organization-level spend caps are a last-resort backstop, not a session-level control. They limit total monthly API spend across all sessions, users, and workflows. They do not prevent a single runaway from consuming $47,000 of that budget before the cap fires. They also provide zero per-session attribution: when the org cap triggers, you do not know which workflow caused it or which input pattern triggered the loop. Use the org cap as a safety net for catastrophic failure. Use per-session enforcement, calibrated from profiled distribution, as the primary control.
How do I profile cost distribution when agents run in parallel?
Profile the aggregate pipeline, not the individual agents. For each test run, log total input tokens across all agents that fired, total output tokens, and wall-clock elapsed time. For each agent invocation also log: which agent, input token count at that specific invocation, output token count, and cumulative context size at the moment it was invoked. That gives you both the pipeline-level distribution (for the aggregate hard limit) and per-agent distributions (for per-agent sub-limits). Parallel agents add a wrinkle: you cannot sum context sequentially. You have to track each agent's full context input separately.
Which failure cluster from the 63-incident taxonomy is most common?
The arXiv catalog[11] groups incidents into eight clusters, but the root condition is the same across all of them: no in-process enforcement primitive. Incompatible termination conditions (the $47K loop) and unbounded retry amplification appear most frequently in the GitHub issues. Context accumulation without summarization is the most subtle — it does not look like a bug at all until the cost report arrives. The practical priority: handle clusters 1-3 first (termination, retry caps, context compression), then instrument for 4-7. Cluster 8 (phantom delegation) only surfaces in systems with budget-passing between agents, but the fix there is architectural — enforce ownership, do not share budget references.
Build P50, P95, and P99 cost estimates from staging profiling before setting any budget limit. A budget built from the mean leaves the edge-case recovery tier — where runaway loops live — entirely unaccounted for. Averages are accurate summaries of the past. They are useless predictors of the tail.
A global token limit that trips when any session exceeds budget denies service to every concurrent session the moment one fails. Enforcement must be scoped to the individual agent run so one broken session cannot affect others. With 1,000 concurrent sessions and even a 1% runaway rate, a global limit is a reliability incident, not a cost control.
Monitoring, dashboards, and post-call circuit breakers all observe cost that has already been incurred. The only mechanism that prevents accumulation checks the budget synchronously before each API call. By the time a cost spike shows up in monitoring, a multi-agent loop has already multiplied context across several handoffs and the damage is done.
A single tool that returns an uncapped database query result or web scrape can inject 50,000+ tokens into the context window, making every subsequent call in the session expensive. Truncate at the tool boundary, before the response reaches the agent. This is not an agent problem — it is a tool interface problem, and it must be fixed at the interface.
A new tool that fetches external content can shift P95 session cost by 3-5x. Switching model tiers changes per-token costs. Any time the agent's tool set or model selection changes, prior profiling data is stale and enforcement thresholds need recalibration. Re-profiling belongs in the same checklist as updating tests — not in the follow-up backlog.
The 63-incident catalog[11] documents a failure class, not a string of bad luck. Every incident had monitoring. None had synchronous enforcement. The teams were not negligent — they were using the standard observability stack, which is correctly designed for stateless services. Agent sessions are not stateless. They accumulate context, they retry, they hand off. The cost model has to match the execution model.
Fix the model first. Build the distribution in staging, set the 3× P95 limit, wire a per-session budget instance before you deploy. Then keep your monitoring — not as a cost control, but as the post-incident audit trail it is actually good at. The loop that already ran is expensive. The one you prevent is free.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.