Top of the range is engineers running multi-step agents against large repos. No budget gate. No personal cost signal.
One person. One project. Two days. The first time anyone saw the number was on the invoice.
Seat licenses are the visible part. API spend rides underneath, usually untracked.
One in five large orgs spending real money against zero attribution. The invoice is the dashboard.
The invoice arrives quarterly. Finance flags it. The VP of Engineering spends two days reconstructing which teams, which agents, which workflows produced the spend. The data does not exist. There is no team line item. There is no per-use-case breakdown. There is one number with nine zeroes of context missing.
AI token spend — the cost of LLM API calls across an engineering org — has become a P&L problem nobody planned for when they handed out Claude Code licenses. Seat-based tools like GitHub Copilot and Cursor sit on the visible side: budgetable, predictable, easy to put in a spreadsheet. The API layer underneath is a different animal. It charges by token, scales non-linearly with agent autonomy, and produces no natural stopping point once an engineer discovers that fanning out parallel agents finishes a refactor in an hour.
The orgs that survived cloud bill shock in 2013 built FinOps: meter the resource, allocate it to teams, put visible budgets on business units, page someone before the invoice arrives. The same window is open for AI token spend. The teams that build the governance architecture now will own clean P&L attribution and controllable cost curves. The teams that wait will explain a surprise invoice every quarter for the rest of the decade.
One counterintuitive read: the teams spending the most on tokens often have the best unit economics, not the worst. A platform team burning $8,000/month and shipping 200 PRs at $40 each is dramatically more efficient than a team burning $1,200/month and shipping 8 PRs at $150 each. Absolute spend is the wrong signal. Cost per shipped unit of work is the right one — and you cannot calculate it without attribution infrastructure underneath every call.
AI Token Spend Is AWS in 2013. The Window Is Open Again.
Same spending pattern, same missing controls, same short window before the bill becomes structural and the org learns to live with it.
In 2013, AWS billing was chaos at most engineering orgs. Teams spun up infrastructure with no central visibility. Costs scaled with usage. Invoices arrived 30 days late. Finance had one line item — "cloud infrastructure" — and no way to ask which product, which team, or which architectural decision produced a spike.
The response was FinOps. Meter every resource. Tag it to a cost center. Surface it in near-real-time. Build chargeback so business units own what they consume. By 2018, mature engineering orgs had cost allocation tags on every AWS resource, per-team budget alerts, and anomaly detection that paged on unexpected spikes. The discipline made cloud spend ownable.
AI token spend is in the 2013 moment. The State of FinOps 2026 report shows 98% of FinOps respondents now manage AI spend — up from 63% the previous year — and 58% have implemented showback or chargeback for it[5]. That is the leading edge. The median engineering org still treats AI API costs as one homogeneous line with zero team-level attribution.
The FinOps Foundation's AI working group names the structural difference: "the unit economics of generative AI are fundamentally different from cloud infrastructure — variable by model, prompt complexity, and agent autonomy, not just by hours provisioned."[8] That variability is why the old controls break. A monthly seat license is predictable. A per-token API call from an autonomous agent that retries on failure is not.
One invoice line: 'AI tooling — $XX,XXX'
Zero team-level attribution
No budget by use case — copilot, agent, batch job all collapsed
Surprises surface at quarterly review
Engineers optimize for speed; cost is invisible
CFO cannot connect spend to shipped work
Model selection is individual preference, not policy
Per-team, per-use-case attribution in real time
Monthly budgets enforced at the proxy, not by memo
Automated alerts when daily spend crosses 3x baseline
Model routing by task complexity — Haiku for autocomplete, Opus for agents
Chargeback to business unit cost centers, not absorbed by engineering
ROI dashboard correlating token spend to shipped features
Finance has a line item they understand and can plan against
Three Patterns That Blow the Budget
Seat-license thinking misses every one of them. The cost curve does not behave like a SaaS subscription, and pretending it does is how the surprise invoice happens.
Token spend explodes in three distinct patterns. Each demands a different control. None of them is solvable with a procurement spreadsheet.
Agentic loops are the dominant cost vector. An engineer runs Claude Code in autonomous mode against a large repo. The agent reads hundreds of files, writes code, runs tests, reads failures, retries. A two-hour session that produces solid output might burn $50–80 in API costs. Nothing alarming on its own. But that engineer runs three sessions a day across four parallel tasks, and the math turns into $200/day or $4,000+/month from one person[1]. Multiply by a 40-engineer org with a fresh agentic-workflow mandate from the CTO and the monthly number becomes structural — and load-bearing on the next funding conversation.
Unoptimized model selection compounds the loop problem. Engineers route everything to the most capable model because it produces better results, and there is no force pulling them toward the cheaper one. The cost gap between Haiku and Opus on the same task runs 10–20x. When every call hits the most expensive model regardless of difficulty, you are paying premium prices for autocomplete and not saving anything for the workflows that genuinely need depth.
Tokenmaxxing is the behavioral pattern that emerges from invisible budgets. Practitioners coined the term for engineers who maximize token consumption — running agent swarms, keeping large contexts pinned, retrying aggressively — because in the absence of a personal cost signal it is rational career play[6]. More tokens, faster output, better review cycle. The behavior is not abuse. It is a correct response to the incentive structure. Make the cost invisible and tokenmaxxing is the equilibrium.
The Proxy Is the Control Plane. Everything Else Bolts to It.
Route every LLM call through one cost-attribution layer. No exceptions. The proxy is the foundation that makes budgets, routing, and anomaly detection possible at all.
The core of token governance is a proxy that sits between engineering tools and model provider APIs. Every LLM call — from Claude Code, from internal agents, from CI/CD pipelines, from product features — passes through this single ingress. The proxy tags each call with team, use case, and outcome metadata, records the cost in a database, and enforces budget limits before the call leaves the network.
LiteLLM is the most widely-deployed open source option. It runs a hierarchical multi-tenant model: organization → team → user → key[9]. Budgets cascade down the hierarchy. Every API call carries the full attribution chain. Portkey offers a managed alternative with workspaces, roles, and budget controls baked in.
The diagram below traces the path from engineer tooling to model provider. Cost attribution is captured at the proxy and surfaced in a dashboard. The single-ingress rule is what makes the rest of the architecture work. Any source that bypasses the proxy is a hole in the attribution model.
litellm_config.yaml# One proxy, every call. Budgets cascade org -> team -> user -> key.
model_list:
- model_name: claude-opus
litellm_params:
model: anthropic/claude-opus-4-5
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-haiku
litellm_params:
model: anthropic/claude-haiku-4-5-20251001
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
# Per-team spend recorded in real time. The proxy is the source of truth.
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
general_settings:
database_url: os.environ/DATABASE_URL
master_key: os.environ/LITELLM_MASTER_KEY
store_model_in_db: true
# Default budget for every new team. Tune after 60 days of baseline.
default_team_settings:
max_budget: 500 # $500/month per team
budget_duration: 30d
tpm_limit: 2000000 # 2M tokens/minute hard ceilingWith the proxy live, mint a virtual key per team and bind it to a monthly budget. Teams embed the key in their tooling — Claude Code, LangChain, custom agents — and every call routes through the proxy without further engineering effort[3].
The proxy records the full attribution chain: which team, which user, which model, input and output tokens, and any metadata the caller passed. The metadata field is where use-case attribution lives. Tag calls with use_case: code-review or use_case: autonomous-agent and the dashboard breaks spend down by workflow, not just by team. Without the tag the data still arrives — but answering "what did we spend on agents last month" turns into a guess.
create_team_budget.sh# Create a team key with a hard monthly budget. One API call, no ceremony.
curl -X POST 'http://your-litellm-proxy:4000/team/new' \
-H 'Authorization: Bearer $LITELLM_MASTER_KEY' \
-H 'Content-Type: application/json' \
-d '{
"team_alias": "payments-squad",
"max_budget": 800,
"budget_duration": "30d",
"tpm_limit": 1000000,
"metadata": {
"cost_center": "ENGG-PAYMENTS",
"team_lead": "sarah@company.com",
"budget_owner": "vp-engineering"
}
}'
# Response returns the team's API key. Engineers point their tools at it:
# ANTHROPIC_API_KEY=sk-litellm-payments-squad-xxxx
# ANTHROPIC_BASE_URL=http://your-litellm-proxy:4000Routing Is the Lever, Not the Limit
Budget caps stop catastrophe. Model routing reduces baseline spend. The 10–20x gap between Haiku and Opus is the largest cost lever you have, and most orgs are not pulling it.
Budget enforcement prevents catastrophic overruns. Model routing reduces baseline spend. The two controls operate at different layers and compound — one is the circuit breaker, the other is the rate at which the meter spins.
The routing principle is mechanical: match model capability to task requirement. IDE autocomplete needs fast response and basic completion quality. Haiku handles it at a fraction of the Opus price. A multi-step architecture review over a complex codebase actually benefits from Opus depth. Routing every call to the same tier because that is what the API key defaults to wastes money on the easy work and offers no extra fidelity on the hard work.
| Task Type | Recommended Model | Rationale | Approx. Relative Cost |
|---|---|---|---|
| IDE autocomplete / inline suggestion | Claude Haiku | Speed matters more than depth; context is small | 1x baseline |
| Code explanation / docstring generation | Claude Haiku | Well-defined, bounded task; little ambiguity | 1x baseline |
| Code review — single PR | Claude Sonnet | Needs judgment on patterns, security, style | ~5x baseline |
| Test generation for existing function | Claude Sonnet | Moderate complexity; clear success criteria | ~5x baseline |
| Multi-file refactor with dependencies | Claude Sonnet | Context-heavy but structured; Sonnet sufficient | ~5x baseline |
| Architecture review / system design | Claude Opus | Requires deep reasoning over ambiguous tradeoffs | ~20x baseline |
| Autonomous multi-step agent (planning loop) | Claude Opus | Agent orchestration quality significantly affects outcome | ~20x baseline |
| Batch summarization / classification jobs | Claude Haiku | High volume, low complexity; cost savings compound | 1x baseline |
routing_config.yaml# Task-complexity routing. Model selection is enforced at the proxy, not asked of the engineer.
router_settings:
routing_strategy: usage-based-routing
# Fallback chain when the primary tier is unavailable.
fallbacks:
- {"claude-opus": ["claude-sonnet"]}
- {"claude-sonnet": ["claude-haiku"]}
# Engineers tag calls with use_case metadata. The proxy reads the tag and overrides the model.
#
# client.messages.create(
# model="claude-sonnet", # caller's request
# metadata={"use_case": "code-review", "team": "payments-squad"},
# ...
# )
#
# Routing hook overrides the requested model based on the tag.
litellm_settings:
callbacks: ["my_routing_hook"]
# routing_hook.py — pin model selection to use_case. Policy in code, not in a wiki page.
# from litellm.integrations.custom_logger import CustomLogger
# class RoutingHook(CustomLogger):
# async def async_pre_call_hook(self, user_api_key_dict, cache, data, call_type):
# use_case = data.get("metadata", {}).get("use_case", "")
# if use_case in ("autocomplete", "inline-suggestion", "batch-classify"):
# data["model"] = "claude-haiku"
# elif use_case in ("code-review", "test-generation", "refactor"):
# data["model"] = "claude-sonnet"
# return dataPer-Team Budgets in Four Steps
The technical work is straightforward once the proxy is live. Assigning the budgets is the political fight, and that fight is the actual project.
- [01]
Inventory Every LLM Call Source
Before setting budgets, build a complete map of where calls originate. Engineering tools (Claude Code, Cursor, Copilot), internal agents, CI/CD pipelines, application features — everything must route through the proxy. Any uncovered source is a budget hole, and one hole is enough to break the attribution model.
- [02]
Define the Team and Use-Case Taxonomy
The attribution schema chosen now decides what questions the org can answer in six months. Team attribution is the floor. Use-case attribution — copilot vs. agent vs. batch — is what lets you have a real conversation about value per dollar instead of a frustrated one about totals.
- [03]
Set Budgets From Baseline, Not From Intuition
Do not set budgets before the data exists. Run the proxy in observation mode for 60 days — log everything, enforce nothing. The baseline tells you what normal spend looks like. Set soft limits at roughly 130–150% of the 60-day baseline, hard cutoffs at 200%. A budget set from a guess will either block legitimate work or constrain nothing at all.
- [04]
Build the Finance Reporting Export
Internal governance is for the VP Engineering and the platform team. The finance report is for the CFO and business unit leads. Two audiences, two outputs. The proxy database holds everything you need — the work is the export and the schedule. The team that walks into the QBR with a cost-per-PR number owns the narrative. The team that scrambles for the data on Friday produces a number nobody trusts.
Anomaly Detection: The Signal That Arrives Before the Cap
A budget limit is a hard stop. Anomaly detection is the page that fires while there is still time to investigate, not just block.
Budget enforcement is a hard stop. Anomaly detection is an early warning. You want both. A hard stop at month-end tells you the budget is gone. It tells you nothing about which workflow consumed it. An anomaly alert at 3x daily baseline gives a live signal — investigate now, while the cause is still on someone's screen.
The detection model is simple. Compute the rolling 7-day daily-spend average per team. Compare today's spend to that average at 4 PM. Page the team lead when the ratio crosses your threshold. A 3x spike on a Tuesday afternoon almost always has a specific cause — a new agent workflow shipped that morning, a CI pipeline accidentally triggering agents on every push, a developer manually running a large batch job.
LiteLLM exports spend data via its API and a Prometheus metrics endpoint. Grafana with a Prometheus data source is sufficient for the alerting layer. No need for Datadog unless it is already in the stack. The flow from spend data to alert is in the diagram below.
One operational mistake from our own rollout: we set the anomaly threshold at 2x instead of 3x. The false positive rate was high enough that team leads started ignoring alerts inside two weeks — alert fatigue, the standard failure mode. At 3x, alerts fire roughly twice a month per team and are almost always actionable. Calibrate the threshold to what team leads will actually investigate, not to what catches every minor fluctuation. The alert that gets ignored is worse than no alert.
grafana_alert.yaml# Token-spend spike detection. Fires when team's same-day spend exceeds 3x its 7-day daily average.
apiVersion: 1
groups:
- orgId: 1
name: ai-token-governance
folder: engineering-costs
interval: 1h
rules:
- uid: token-spike-alert
title: AI Token Spend Spike
condition: C
data:
- refId: A
# Today's cumulative spend per team
queryType: range
relativeTimeRange:
from: 86400
to: 0
model:
expr: sum(litellm_spend_usd_total) by (team_alias)
- refId: B
# 7-day rolling daily average per team
queryType: range
relativeTimeRange:
from: 604800
to: 86400
model:
expr: sum(litellm_spend_usd_total) by (team_alias) / 7
- refId: C
queryType: expression
model:
type: math
expression: $A / $B # ratio: today vs. daily avg
noDataState: NoData
for: 30m
annotations:
summary: >-
Token spike: {{ $labels.team_alias }} is at
{{ $values.C | printf "%.1f" }}x daily average
labels:
severity: warning
condition:
evaluator:
type: gt
params: [3] # page at 3x daily baselineFour Numbers That End the Finance Conversation
Finance does not need to understand tokens. They need a cost-per-outcome number and a trend line that fits in a board deck. Anything else is noise.
The line that wins the CFO conversation: cost per merged PR is the number that makes AI spend legible to finance. A team spending $3,200 last month and shipping 85 PRs lands at $37.65/PR. A comparable team spending $1,100 and shipping 20 PRs lands at $55/PR. The first team is more efficient even though absolute spend is higher. The framing flips the conversation from "AI is expensive" to "here is the ROI on the AI investment, and here is the trend."
Build this report before finance asks for it. The org that walks into the quarterly review with a cost-per-outcome dashboard owns the narrative. The org that gets asked for the data scrambles for three days and produces a number nobody trusts. Trust on the cost story is built in advance or not at all.
The DX 2026 survey found 86% of engineering leaders unsure which AI tools deliver the most value, and 40% lacking the data to demonstrate ROI[2]. The token governance architecture solves both problems with one piece of infrastructure — the cost data falls out of the proxy, and the ROI data falls out of the proxy correlated with output metrics.
What VPs and CFOs Will Push Back On
The same five objections come up in every token governance conversation. Here are the answers.
Won't hard limits block engineers at the worst possible moment?
Soft limits handle this. Alerts at 80%, amber at 110%, hard stop at 150%. The engineers most likely to hit a hard limit are running unbounded agent loops the soft limit would have caught hours earlier. Design the tiers correctly and almost all legitimate work stays in the green. The hard stop exists for the runaway loop, not for the daily commit.
What about teams that need to run large batch jobs that temporarily spike spend?
Budget exemptions work the same way they do in cloud FinOps. Team lead requests a temporary increase for a specific window and use case. VP Engineering approves. The proxy gets one API call. The process creates an audit trail — who asked, why, what they ran — which feeds future budget planning. The alternative is unlimited spend with no record of who consumed it. That alternative is what you have now.
Do we need this if we only use seat-licensed tools?
If every engineer is on a fixed-seat Copilot plan and nothing touches the API layer, the seat license is a predictable line item and the proxy is overkill. The moment any team starts using Claude Code, LangChain agents, or hitting Anthropic/OpenAI APIs directly, you need the attribution layer. The trend is one-directional toward API access as engineers move from copilots to agents. Build the proxy before the API surface area expands, not after.
How do we handle engineers who feel surveilled by the dashboards?
Frame it correctly from day one. Individual-level dashboards are for the engineer's benefit — they can see whether they are on track for the month before a hard limit hits. Team-level dashboards are for planning. The goal is not to catch anyone doing something wrong; it is to give every team a predictable budget to plan against. Engineers at well-run orgs do not feel surveilled by CloudWatch cost alerts. Token budgets are no different when the framing is right and the data is shared, not hidden.
We already run Datadog. Do we need Grafana too?
No. LiteLLM exports spend via Prometheus metrics, and Datadog ingests those directly through its Prometheus integration. Build the dashboards alongside existing observability. The Grafana reference in the architecture is illustrative — any metrics layer works. The proxy is the load-bearing component. The visualization tool is interchangeable.
Token Governance Implementation Checklist
Every LLM call source inventoried — engineering tools, agents, CI/CD, product features
LiteLLM (or Portkey) proxy deployed with team-key authentication
60-day baseline observation completed before any hard limit was set
Team virtual keys minted with monthly budgets and cost-center metadata
use_case tagging convention enforced across every LLM call site
Model routing hook live — Haiku for autocomplete, Sonnet for review, Opus for agents
Anomaly alert wired for 3x daily-spend spike per team
Weekly cost-per-PR report shipping to finance and business unit leads
Budget exemption process documented for legitimate spike workloads
Per-engineer spend dashboard shared so engineers see their own number
Hard Rules for AI Token Governance
No hard limit without 60 days of baseline data underneath it
A budget set from intuition either blocks legitimate work or constrains nothing. Baseline data gives the right number. Two months of observation costs almost nothing. A misset limit that blocks a deploy at 4 AM costs trust and incident time, and the trust is harder to recover than the time.
Every LLM call routes through the proxy. No exceptions, no exemptions, no 'just for now.'
One team with a hardcoded API key bypassing the proxy breaks the entire attribution model. Partial coverage is not coverage. A spike from an untracked source is indistinguishable from a tracking failure, and the longer it persists the more the cost data becomes a fiction. Treat unproxied keys the same way you treat unapproved cloud credentials.
Cost-per-PR is the CFO metric. Total spend and tokens consumed are not.
Total spend rises as the team does more. That is growth, not a problem. The metric finance cares about is efficiency: more output per dollar, over time. Cost-per-PR captures it. Present that number in QBRs unless finance explicitly asks for the breakdown underneath.
Routing is a control, not a suggestion
If model selection sits with the individual engineer, every call lands on the most capable model regardless of the task. Routing rules embedded in the proxy are non-negotiable. Document the policy, name the reasoning, enforce it programmatically. A policy doc that nobody reads does not change behavior. A pre-call hook that overrides the model does.
- [1]Engineers Should Spend $250K on AI Tokens — Mid-Size Repo Hit $150 in 48 Hours (Medium, 2026)(medium.com)↩
- [2]How Are Engineering Leaders Approaching 2026 AI Tooling Budgets? (DX, 2026)(getdx.com)↩
- [3]Setting Team Budgets — LiteLLM Documentation(docs.litellm.ai)↩
- [4]Spend Tracking — LiteLLM Documentation(docs.litellm.ai)↩
- [5]State of FinOps 2026 Report — FinOps Foundation(data.finops.org)↩
- [6]Tokenmaxxing: The Costly Mistake in AI Engineering Metrics (2026)(itsmeduncan.com)↩
- [7]How Token-Based AI Coding Tools Impact Engineering Budgets (Exceeds AI)(blog.exceeds.ai)↩
- [8]FinOps for AI Overview — FinOps Foundation(finops.org)↩
- [9]Multi-Tenant Architecture with LiteLLM — LiteLLM Documentation(docs.litellm.ai)↩