You approved Copilot. Then Claude Code. The invoice is a surprise and nobody owns the line item. The window for token FinOps is open right now — proxy, attribution, routing, anomaly detection. Build it before the next quarterly review.
Why token spend blows up — the three failure modes that procurement spreadsheets miss
Proxy architecture: one ingress, full attribution, budget enforcement before the call leaves the network
LiteLLM config you can run Monday — team keys, monthly caps, use-case tagging
Model routing: the 10–20x cost gap between Haiku and Opus, and how to pull it as a lever
Anomaly detection that pages at 3x spike, not at month-end invoice
The CFO metric: cost per merged PR, how to build it, why it ends the budget conversation
Implementation checklist and hard rules for rollout sequencing
Across enterprise rollouts. 90% of users stay below $30/active day. The heaviest agentic users blow past $2,000/month with no budget gate in place.
One person. One project. Two days. The first time anyone saw the number was on the invoice.
Adoption hit 84–95% by April. Per-engineer bills hit $500–$2,000/month. The COO said publicly that the link to consumer value 'is not there yet.'
One in five large orgs spending real money against zero attribution. The invoice is the dashboard.
The invoice arrives quarterly. Finance flags it. The VP of Engineering spends two days reconstructing which teams, which agents, which workflows produced the spend. The data does not exist. There is no team line item. There is no per-use-case breakdown. There is one number with nine zeroes of context missing.
AI token spend — the cost of LLM API calls across an engineering org — has become a P&L problem nobody planned for when they handed out Claude Code licenses. Seat-based tools like GitHub Copilot and Cursor sit on the visible side: budgetable, predictable, easy to put in a spreadsheet. The API layer underneath is a different animal. It charges by token, scales non-linearly with agent autonomy, and produces no natural stopping point once an engineer discovers that fanning out parallel agents finishes a refactor in an hour.
Uber's experience is the cautionary case. The company rolled Claude Code to roughly 5,000 engineers in late 2025. By April 2026 — four months later — it had consumed the entire year's AI budget[10]. Adoption hit 95%. Per-engineer bills typically ran $150–$250/month for moderate users, with heavy agentic users hitting $500–$2,000. During a single live demo, an engineer burned $1,200 in tokens in two hours. The COO went on record saying it was hard to draw a line from rising token consumption to actual consumer features shipped.
The orgs that survived cloud bill shock in 2013 built FinOps: meter the resource, allocate it to teams, put visible budgets on business units, page someone before the invoice arrives. The same window is open for AI token spend. The teams that build the governance architecture now will own clean P&L attribution and controllable cost curves. The teams that wait will explain a surprise invoice every quarter.
One counterintuitive read: the teams spending the most on tokens often have the best unit economics, not the worst. A platform team burning $8,000/month and shipping 200 PRs at $40 each is dramatically more efficient than a team burning $1,200/month and shipping 8 PRs at $150 each. Absolute spend is the wrong signal. Cost per shipped unit of work is the right one — and you cannot calculate it without attribution infrastructure underneath every call.
Same spending pattern, same missing controls, same short window before the bill becomes structural and the org learns to live with it.
In 2013, AWS billing was chaos at most engineering orgs. Teams spun up infrastructure with no central visibility. Costs scaled with usage. Invoices arrived 30 days late. Finance had one line item — "cloud infrastructure" — and no way to ask which product, which team, or which architectural decision produced a spike.
The response was FinOps. Meter every resource. Tag it to a cost center. Surface it in near-real-time. Build chargeback so business units own what they consume. By 2018, mature engineering orgs had cost allocation tags on every AWS resource, per-team budget alerts, and anomaly detection that paged on unexpected spikes. The discipline made cloud spend ownable.
AI token spend is in the 2013 moment. The State of FinOps 2026 report shows 98% of FinOps respondents now manage AI spend — up from 63% the previous year and 31% in 2024, the fastest adoption curve the FinOps Foundation has ever recorded[5]. That is the leading edge. The median engineering org still treats AI API costs as one homogeneous line with zero team-level attribution.
The FinOps Foundation's AI working group names the structural difference: "the unit economics of generative AI are fundamentally different from cloud infrastructure — variable by model, prompt complexity, and agent autonomy, not just by hours provisioned."[8] That variability is why the old controls break. A monthly seat license is predictable. A per-token API call from an autonomous agent that retries on failure is not.
The showback vs. chargeback distinction matters here. Showback means your team can see a number — 200 million tokens consumed last month — but nothing changes in the budget. Chargeback means that number flows back as a real cost to the business unit that incurred it, changing how teams build and how aggressively they scale AI experiments. Most orgs start with showback and treat it as progress. It is not. Showback with no consequence produces the same behavior as no visibility at all.
One invoice line: 'AI tooling — $XX,XXX'
Zero team-level attribution
No budget by use case — copilot, agent, batch job all collapsed
Surprises surface at quarterly review
Engineers optimize for speed; cost is invisible
CFO cannot connect spend to shipped work
Model selection is individual preference, not policy
Per-team, per-use-case attribution in real time
Monthly budgets enforced at the proxy, not by memo
Automated alerts when daily spend crosses 3x baseline
Model routing by task complexity — Haiku for autocomplete, Opus for agents
Chargeback to business unit cost centers, not absorbed by engineering
ROI dashboard correlating token spend to shipped features
Finance has a line item they understand and can plan against
Seat-license thinking misses every one of them. The cost curve does not behave like a SaaS subscription, and pretending it does is how the surprise invoice happens.
Token spend explodes in three distinct patterns. Each demands a different control. None of them is solvable with a procurement spreadsheet.
Agentic loops are the dominant cost vector. An engineer runs Claude Code in autonomous mode against a large repo. The agent reads hundreds of files, writes code, runs tests, reads failures, retries. A two-hour session that produces solid output might burn $50–80 in API costs. Nothing alarming on its own. But that engineer runs three sessions a day across four parallel tasks, and the math turns into $200/day or $4,000+/month from one person[1]. Multiply by a 40-engineer org with a fresh agentic-workflow mandate from the CTO and the monthly number becomes structural — and load-bearing on the next funding conversation.
The official Claude Code documentation is explicit about the multiplier: agent teams spawn multiple Claude instances, each with its own context window, using approximately 7x more tokens than standard single-agent sessions[11]. A team of four Claude Code agent-teammates on a complex refactor can consume a week's budget in an afternoon.
Unoptimized model selection compounds the loop problem. Engineers route everything to the most capable model because it produces better results, and there is no force pulling them toward the cheaper one. The cost gap between Haiku and Opus on the same task runs 10–20x. At mid-2026 API pricing — roughly $1/M input tokens for Haiku vs. $5/M for Opus — a team routing 70% of calls to Haiku, 20% to Sonnet at $3/M, and 10% to Opus achieves a weighted input cost of $1.60/M instead of $5/M baseline[12]. That is a 68% cost reduction from routing alone, with no quality loss on the tasks that fit the lower tier.
Tokenmaxxing is the behavioral pattern that emerges from invisible budgets. Practitioners coined the term for engineers who maximize token consumption — running agent swarms, keeping large contexts pinned, retrying aggressively — because in the absence of a personal cost signal it is rational career play[6]. More tokens, faster output, better review cycle. The behavior is not abuse. It is a correct response to the incentive structure. Make the cost invisible and tokenmaxxing is the equilibrium.
Route every LLM call through one cost-attribution layer. No exceptions. The proxy is the foundation that makes budgets, routing, and anomaly detection possible at all.
The core of token governance is a proxy that sits between engineering tools and model provider APIs. Every LLM call — from Claude Code, from internal agents, from CI/CD pipelines, from product features — passes through this single ingress. The proxy tags each call with team, use case, and outcome metadata, records the cost in a database, and enforces budget limits before the call leaves the network.
LiteLLM is the most widely-deployed open source option. It runs a hierarchical multi-tenant model: organization → team → user → key[9]. Budgets cascade down the hierarchy. Every API call carries the full attribution chain. Portkey offers a managed alternative with workspaces, roles, and budget controls baked in. For teams already on Bedrock or Vertex, both platforms also work — Claude Code on those clouds doesn't send metrics automatically, which is exactly why large enterprises running on Bedrock have specifically turned to LiteLLM to get per-key spend tracking[11].
The diagram below traces the path from engineer tooling to model provider. Cost attribution is captured at the proxy and surfaced in a dashboard. The single-ingress rule is what makes the rest of the architecture work. Any source that bypasses the proxy is a hole in the attribution model — and one hole is enough to make the entire cost dataset untrustworthy.
With the proxy live, mint a virtual key per team and bind it to a monthly budget. Teams embed the key in their tooling — Claude Code, LangChain, custom agents — and every call routes through the proxy without further engineering effort[3].
The proxy records the full attribution chain: which team, which user, which model, input and output tokens, and any metadata the caller passed. The metadata field is where use-case attribution lives. Tag calls with use_case: code-review or use_case: autonomous-agent and the dashboard breaks spend down by workflow, not just by team. Without the tag the data still arrives — but answering "what did we spend on agents last month" turns into a guess.
LiteLLM also supports tag-based budgets that cross team boundaries[13]. If you want to cap total org-wide spend on autonomous-agent tasks regardless of which team is running them, a tag budget handles it. This is the governance layer for use-case cost control — teams have their own budgets, and specific high-risk task categories have an additional ceiling on top.
Budget caps stop catastrophe. Model routing reduces baseline spend. The 10–20x gap between Haiku and Opus is the largest cost lever you have, and most orgs are not pulling it.
Budget enforcement prevents catastrophic overruns. Model routing reduces baseline spend. The two controls operate at different layers and compound — one is the circuit breaker, the other is the rate at which the meter spins.
The routing principle is mechanical: match model capability to task requirement. IDE autocomplete needs fast response and basic completion quality. Haiku handles it at a fraction of the Opus price. A multi-step architecture review over a complex codebase actually benefits from Opus depth. Routing every call to the same tier because that is what the API key defaults to wastes money on the easy work and offers no extra fidelity on the hard work.
At mid-2026 API pricing, the gap is concrete: Haiku runs at roughly $1/M input tokens, Sonnet at $3/M, Opus at $5/M[12]. A team with a typical distribution — 70% simple/autocomplete tasks, 20% standard review and generation, 10% complex agentic or architectural work — pays a weighted average of $1.60/M by routing correctly, vs. $5/M by defaulting everything to Opus. That is a 68% reduction in base token costs before any other optimization.
| Task Type | Recommended Model | Rationale | Approx. Cost vs. Opus Baseline |
|---|---|---|---|
| IDE autocomplete / inline suggestion | Claude Haiku | Speed matters more than depth; context is small | ~5% of Opus cost |
| Code explanation / docstring generation | Claude Haiku | Well-defined, bounded task; little ambiguity | ~5% of Opus cost |
| Code review — single PR | Claude Sonnet | Needs judgment on patterns, security, style | ~60% of Opus cost |
| Test generation for existing function | Claude Sonnet | Moderate complexity; clear success criteria | ~60% of Opus cost |
| Multi-file refactor with dependencies | Claude Sonnet | Context-heavy but structured; Sonnet sufficient | ~60% of Opus cost |
| Architecture review / system design | Claude Opus | Requires deep reasoning over ambiguous tradeoffs | 100% (baseline) |
| Autonomous multi-step agent (planning loop) | Claude Opus | Agent orchestration quality significantly affects outcome | 100% (baseline) |
| Batch summarization / classification jobs | Claude Haiku | High volume, low complexity; cost savings compound | ~5% of Opus cost |
The technical work is straightforward once the proxy is live. Assigning the budgets is the political fight, and that fight is the actual project.
Before setting budgets, build a complete map of where calls originate. Engineering tools (Claude Code, Cursor, Copilot), internal agents, CI/CD pipelines, application features — everything must route through the proxy. Any uncovered source is a budget hole, and one hole is enough to break the attribution model.
The attribution schema chosen now decides what questions the org can answer in six months. Team attribution is the floor. Use-case attribution — copilot vs. agent vs. batch — is what lets you have a real conversation about value per dollar instead of a frustrated one about totals.
Do not set budgets before the data exists. Run the proxy in observation mode for 60 days — log everything, enforce nothing. The baseline tells you what normal spend looks like. Set soft limits at roughly 130–150% of the 60-day baseline, hard cutoffs at 200%. A budget set from a guess will either block legitimate work or constrain nothing at all.
Internal governance is for the VP Engineering and the platform team. The finance report is for the CFO and business unit leads. Two audiences, two outputs. The proxy database holds everything you need — the work is the export and the schedule. The team that walks into the QBR with a cost-per-PR number owns the narrative. The team that scrambles for the data on Friday produces a number nobody trusts.
A budget limit is a hard stop. Anomaly detection is the page that fires while there is still time to investigate, not just block.
Budget enforcement is a hard stop. Anomaly detection is an early warning. You want both. A hard stop at month-end tells you the budget is gone. It tells you nothing about which workflow consumed it. An anomaly alert at 3x daily baseline gives a live signal — investigate now, while the cause is still on someone's screen.
The detection model is simple. Compute the rolling 7-day daily-spend average per team. Compare today's spend to that average at 4 PM. Page the team lead when the ratio crosses your threshold. A 3x spike on a Tuesday afternoon almost always has a specific cause — a new agent workflow shipped that morning, a CI pipeline accidentally triggering agents on every push, a developer manually running a large batch job.
LiteLLM exports spend data via its API and a Prometheus metrics endpoint. Grafana with a Prometheus data source is sufficient for the alerting layer. No need for Datadog unless it is already in the stack. The flow from spend data to alert is in the diagram below.
One operational mistake from our own rollout: we set the anomaly threshold at 2x instead of 3x. The false positive rate was high enough that team leads started ignoring alerts inside two weeks — alert fatigue, the standard failure mode. At 3x, alerts fire roughly twice a month per team and are almost always actionable. Calibrate the threshold to what team leads will actually investigate, not to what catches every minor fluctuation. The alert that gets ignored is worse than no alert.
Finance does not need to understand tokens. They need a cost-per-outcome number and a trend line that fits in a board deck. Anything else is noise.
The line that wins the CFO conversation: cost per merged PR is the number that makes AI spend legible to finance. A team spending $3,200 last month and shipping 85 PRs lands at $37.65/PR. A comparable team spending $1,100 and shipping 20 PRs lands at $55/PR. The first team is more efficient even though absolute spend is higher. The framing flips the conversation from "AI is expensive" to "here is the ROI on the AI investment, and here is the trend."
Build this report before finance asks for it. The org that walks into the quarterly review with a cost-per-outcome dashboard owns the narrative. The org that gets asked for the data scrambles for three days and produces a number nobody trusts. Trust on the cost story is built in advance or not at all.
Uber's experience makes this concrete. The COO's public statement — "that link is not there yet" between token spend and consumer value — is what happens when no one built the cost-per-outcome report before the budget ran out[10]. The technical infrastructure to produce that number was absent, so leadership was left with a raw spend figure and no way to defend it. That is the conversation you are trying to prevent.
The DX 2026 survey found 86% of engineering leaders unsure which AI tools deliver the most value, and 40% lacking the data to demonstrate ROI[2]. The token governance architecture solves both problems with one piece of infrastructure — the cost data falls out of the proxy, and the ROI data falls out of the proxy correlated with output metrics.
Not every engineering org needs a full proxy-based governance stack today. Here's the honest decision boundary.
| Situation | Recommended Action | Why |
|---|---|---|
| All tooling on fixed-seat licenses (Copilot, Cursor) — no direct API access | Wait. Monitor. | Seat licenses are predictable. No token exposure. Revisit when any team requests API access. |
| 3+ engineers using Claude Code or any agent framework with direct API keys | Build now. | Three engineers in agentic mode can generate $5K–$10K/month. The proxy pays for itself in one caught incident. |
| CI/CD pipelines calling LLM APIs (code review bots, test generators) | Build now. | Pipelines run on every push. A misconfigured job can burn a month's budget in hours. Hard to detect without the proxy. |
| Product features calling LLM APIs (customer-facing, or internal ops tools) | Build now. | Product call volumes scale with users, not with engineering headcount. Cost surprises here are larger and faster. |
| Fewer than 5 engineers, early exploration phase | Showback only. | Set up the proxy in logging mode. Get baseline data. Set hard budgets once you have 60 days of spend history. |
| Enterprise rollout (50+ engineers, multiple teams, agentic workflows mandated) | Build now. Yesterday. | This is Uber's situation. By the time the invoice arrives, the pattern is structural and the conversation is defensive. |
The same five objections come up in every token governance conversation. Here are the answers.
Won't hard limits block engineers at the worst possible moment?
Soft limits handle this. Alerts at 80%, amber at 110%, hard stop at 150%. The engineers most likely to hit a hard limit are running unbounded agent loops the soft limit would have caught hours earlier. Design the tiers correctly and almost all legitimate work stays in the green. The hard stop exists for the runaway loop, not for the daily commit.
What about teams that need to run large batch jobs that temporarily spike spend?
Budget exemptions work the same way they do in cloud FinOps. Team lead requests a temporary increase for a specific window and use case. VP Engineering approves. The proxy gets one API call. The process creates an audit trail — who asked, why, what they ran — which feeds future budget planning. The alternative is unlimited spend with no record of who consumed it. That alternative is what you have now.
Do we need this if we only use seat-licensed tools?
If every engineer is on a fixed-seat Copilot plan and nothing touches the API layer, the seat license is a predictable line item and the proxy is overkill. The moment any team starts using Claude Code, LangChain agents, or hitting Anthropic/OpenAI APIs directly, you need the attribution layer. The trend is one-directional toward API access as engineers move from copilots to agents. Build the proxy before the API surface area expands, not after.
How do we handle engineers who feel surveilled by the dashboards?
Frame it correctly from day one. Individual-level dashboards are for the engineer's benefit — they can see whether they are on track for the month before a hard limit hits. Team-level dashboards are for planning. The goal is not to catch anyone doing something wrong; it is to give every team a predictable budget to plan against. Engineers at well-run orgs do not feel surveilled by CloudWatch cost alerts. Token budgets are no different when the framing is right and the data is shared, not hidden.
We already run Datadog. Do we need Grafana too?
No. LiteLLM exports spend via Prometheus metrics, and Datadog ingests those directly through its Prometheus integration. Build the dashboards alongside existing observability. The Grafana reference in the architecture is illustrative — any metrics layer works. The proxy is the load-bearing component. The visualization tool is interchangeable.
What's the fastest path to a working proxy if we have zero infrastructure today?
LiteLLM has a one-command Docker deployment and a Railway template that runs in minutes. The minimal viable stack: LiteLLM proxy + Postgres for the spend ledger + one Grafana dashboard. Teams embed their virtual key in Claude Code and point ANTHROPICBASEURL at the proxy. You can have logging and basic alerting in a day. Full budget enforcement with model routing takes a week of platform work, not months.
A budget set from intuition either blocks legitimate work or constrains nothing. Baseline data gives the right number. Two months of observation costs almost nothing. A misset limit that blocks a deploy at 4 AM costs trust and incident time, and the trust is harder to recover than the time.
One team with a hardcoded API key bypassing the proxy breaks the entire attribution model. Partial coverage is not coverage. A spike from an untracked source is indistinguishable from a tracking failure, and the longer it persists the more the cost data becomes a fiction. Treat unproxied keys the same way you treat unapproved cloud credentials.
Total spend rises as the team does more. That is growth, not a problem. The metric finance cares about is efficiency: more output per dollar, over time. Cost-per-PR captures it. Present that number in QBRs unless finance explicitly asks for the breakdown underneath.
If model selection sits with the individual engineer, every call lands on the most capable model regardless of the task. Routing rules embedded in the proxy are non-negotiable. Document the policy, name the reasoning, enforce it programmatically. A policy doc that nobody reads does not change behavior. A pre-call hook that overrides the model does.
Showing a team their token consumption with no budget consequence produces the same behavior as having no dashboard at all. The business unit must feel the cost — either through a real chargeback to their budget or through a hard proxy limit that stops new calls. Visibility alone doesn't change incentives. Financial consequence does.
Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, context compression, and cost forecasting without capability degradation.
46% of AI proofs of concept never ship. The gap is not technical. It is structural: PoC culture rewards experimentation and punishes shipping. A 90-day decision gate, an operational owner, and an incentive rewrite — or pilot purgatory wins again.
Launches get conference talks. Retirements get archived repos and live credentials. Five sequential phases — audit, extract, shadow, communicate, shut down — and the security blast radius when you skip any of them.