The invoice arrives quarterly. Finance flags it. The VP of Engineering spends two days trying to reconstruct which teams, which agents, which workflows generated the spend. Nobody has the data. The CFO does not have a line item. The VP Eng does not have a dashboard.
AI token spend — the cost of LLM API calls across your engineering team — has become the engineering P&L problem that nobody was planning for when they handed out Claude Code licenses. Seat-based tools like GitHub Copilot and Cursor are the visible part: budgetable, predictable, easy to put in a spreadsheet. The API layer underneath is a different animal. It charges by token, scales non-linearly with agent autonomy, and produces no natural stopping point once an engineer discovers that running a swarm of parallel agents finishes the work in an hour.
The organizations that survived cloud bill shock in 2013 built FinOps practices: meter it, allocate it to teams, put visible budgets on business units, and build anomaly detection before the invoice arrives. That window is open again for AI token spend. The engineering teams that build the governance architecture now will have clean P&L attribution and controllable cost curves. The teams that wait will be explaining a surprise invoice every quarter.
The Window Is Open Again
AI token spend in 2026 is AWS in 2013 — the same spending patterns, the same missing governance, the same window to build FinOps practices before the bill becomes structural
In 2013, AWS billing was chaos at most engineering organizations. Individual teams spun up infrastructure without central visibility. Costs scaled with usage but the invoices arrived 30 days late. Finance had one line item: "cloud infrastructure." Nobody could tell which product, which team, or which architectural decision was responsible for a cost spike.
The response was FinOps: a discipline built around metering every resource, tagging it to a cost center, surfacing it in near-real-time, and building chargeback models so business units owned their cloud costs. By 2018, most mature engineering organizations had cost allocation tags on every AWS resource, per-team budget alerts, and anomaly detection that paged when a team's spend spiked unexpectedly.
AI token spend is in the 2013 moment. The State of FinOps 2026 report shows 98% of FinOps respondents now manage AI spend — up from 63% the previous year — and 58% have implemented showback or chargeback models for it[5]. But that's the leading edge of mature FinOps organizations. The median engineering org is still treating AI API costs as a homogeneous line item with no team-level attribution.
The FinOps Foundation's AI working group frames the problem precisely: "the unit economics of generative AI are fundamentally different from cloud infrastructure — variable by model, prompt complexity, and agent autonomy, not just by hours provisioned."[8] That variability is why the usual controls break. A monthly seat license is predictable. A per-token API cost with an autonomous agent that retries failed tasks is not.
One invoice line: 'AI tooling — $XX,XXX'
No team-level attribution
No budget by use case (copilot vs. agent vs. batch job)
Surprises surfaced at quarterly review
Engineers optimize for speed, not cost
CFO cannot connect spend to business outcomes
Model selection left to individual preference
Per-team, per-use-case cost attribution in real-time
Monthly budgets enforced at the proxy layer
Automated alerts when daily spend exceeds 3x baseline
Model routing by task complexity (Haiku for autocomplete, Opus for agents)
Chargeback to business unit cost centers
ROI dashboard correlating token spend to shipped features
Finance has a line item they understand and can plan
Anatomy of Runaway Token Spend
Three patterns that blow engineering AI budgets — and why seat-license thinking completely misses them
Token spend explodes in three distinct patterns, each requiring different controls.
Agentic loops are the highest-cost pattern. An engineer uses Claude Code in autonomous mode to refactor a large codebase. The agent reads hundreds of files, generates code, runs tests, interprets failures, and retries. A session that takes two hours and produces solid output might consume $50–80 in API costs — nothing alarming. But the same engineer runs three sessions per day across four simultaneous tasks, and you have $200/day or over $4,000/month from a single person[1]. Multiply by a 40-person engineering team with a new agentic workflow mandate and the monthly number becomes structural.
Unoptimized model selection compounds the problem. Engineers default to the most capable model because it gives better results. Nobody is routing autocomplete calls to a lighter model when the org's API key gives access to Opus. The cost difference between Haiku and Opus for the same task can be 10–20x. When every call goes to the most expensive model regardless of the task, you are paying premium prices for work that a cheaper model handles adequately.
Tokenmaxxing is the behavioral pattern that emerges from invisible budgets. Practitioners have started using this term for engineers who maximize token consumption — running agent swarms, keeping large contexts loaded, retrying aggressively — because it's a rational career move when spend has no personal consequence[6]. If completing a task faster via heavier AI use improves your output, and there's no visible cost signal, the rational behavior is to spend as much as the task can absorb. Invisible budgets make tokenmaxxing structurally inevitable.
The Attribution Architecture
Route every LLM call through a cost-attribution proxy. This is the foundation that makes everything else possible.
The core of AI token governance is a proxy layer that sits between your engineers' tools and the model provider APIs. Every LLM call — from Claude Code, from your internal agents, from CI/CD pipelines — passes through this proxy. The proxy tags each call with team, use case, and outcome metadata, records the cost, and enforces budget limits.
LiteLLM is the most widely-deployed open source option. It implements a hierarchical multi-tenant architecture: organization → team → user → key[9]. Budgets cascade down the hierarchy, and every API call is tracked with the full attribution chain. Portkey offers a managed alternative with a governance layer built around workspaces, roles, and budget controls.
The diagram below shows the flow from engineer tooling to model provider, with cost attribution captured at the proxy layer and surfaced in a dashboard.
litellm_config.yamlmodel_list:
- model_name: claude-opus
litellm_params:
model: anthropic/claude-opus-4-5
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
- model_name: claude-haiku
litellm_params:
model: anthropic/claude-haiku-4-5-20251001
api_key: os.environ/ANTHROPIC_API_KEY
litellm_settings:
# Track spend per team in real-time
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
general_settings:
database_url: os.environ/DATABASE_URL
master_key: os.environ/LITELLM_MASTER_KEY
store_model_in_db: true
# Default budget applied to every new team
default_team_settings:
max_budget: 500 # $500/month per team
budget_duration: 30d
tpm_limit: 2000000 # 2M tokens/minute ceilingWith the proxy running, create a virtual key per team and assign it a monthly budget. Teams embed this key in their tooling config — Claude Code, LangChain, custom agents — and every call routes through the proxy automatically[3].
The proxy records cost with the full attribution chain: which team, which user, which model, how many input and output tokens, and what metadata the caller passed. That metadata is where use case attribution lives — you tag calls with use_case: code-review or use_case: autonomous-agent to break down spend by workflow, not just by team.
create_team_budget.sh# Create a team with a hard monthly budget
curl -X POST 'http://your-litellm-proxy:4000/team/new' \
-H 'Authorization: Bearer $LITELLM_MASTER_KEY' \
-H 'Content-Type: application/json' \
-d '{
"team_alias": "payments-squad",
"max_budget": 800,
"budget_duration": "30d",
"tpm_limit": 1000000,
"metadata": {
"cost_center": "ENGG-PAYMENTS",
"team_lead": "sarah@company.com",
"budget_owner": "vp-engineering"
}
}'
# Response includes the team's API key
# Engineers add this to their .env files:
# ANTHROPIC_API_KEY=sk-litellm-payments-squad-xxxx
# ANTHROPIC_BASE_URL=http://your-litellm-proxy:4000Model Routing by Task Complexity
The biggest lever on token cost isn't budget limits — it's routing each task to the right model. A 10–20x cost difference exists between Haiku and Opus for the same request.
Budget enforcement prevents catastrophic overruns. Model routing reduces baseline spend. The two controls operate at different layers and compound.
The routing principle is straightforward: match model capability to task requirements. A code autocomplete in an IDE needs fast response and basic completion quality — Claude Haiku handles this well at a fraction of the Opus cost. A multi-step architecture review that needs deep reasoning over a complex codebase genuinely benefits from Opus. Routing every request to the same model because it's what the API key defaults to wastes money on the former and potentially shortchanges the latter.
| Task Type | Recommended Model | Rationale | Approx. Relative Cost |
|---|---|---|---|
| IDE autocomplete / inline suggestion | Claude Haiku | Speed matters more than depth; context is small | 1x baseline |
| Code explanation / docstring generation | Claude Haiku | Well-defined, bounded task; little ambiguity | 1x baseline |
| Code review — single PR | Claude Sonnet | Needs judgment on patterns, security, style | ~5x baseline |
| Test generation for existing function | Claude Sonnet | Moderate complexity; clear success criteria | ~5x baseline |
| Multi-file refactor with dependencies | Claude Sonnet | Context-heavy but structured; Sonnet sufficient | ~5x baseline |
| Architecture review / system design | Claude Opus | Requires deep reasoning over ambiguous tradeoffs | ~20x baseline |
| Autonomous multi-step agent (planning loop) | Claude Opus | Agent orchestration quality significantly affects outcome | ~20x baseline |
| Batch summarization / classification jobs | Claude Haiku | High volume, low complexity; cost savings compound | 1x baseline |
routing_config.yaml# LiteLLM router: task-complexity routing via metadata tags
router_settings:
routing_strategy: usage-based-routing
# Fallback chain if primary model is unavailable
fallbacks:
- {"claude-opus": ["claude-sonnet"]}
- {"claude-sonnet": ["claude-haiku"]}
# Engineers tag calls with use_case metadata
# The proxy reads the tag and enforces model selection:
#
# client.messages.create(
# model="claude-sonnet", # requested model
# metadata={"use_case": "code-review", "team": "payments-squad"},
# ...
# )
#
# To enforce routing rules, add a LiteLLM hook:
litellm_settings:
callbacks: ["my_routing_hook"]
# routing_hook.py — override model based on use_case tag
# from litellm.integrations.custom_logger import CustomLogger
# class RoutingHook(CustomLogger):
# async def async_pre_call_hook(self, user_api_key_dict, cache, data, call_type):
# use_case = data.get("metadata", {}).get("use_case", "")
# if use_case in ("autocomplete", "inline-suggestion", "batch-classify"):
# data["model"] = "claude-haiku"
# elif use_case in ("code-review", "test-generation", "refactor"):
# data["model"] = "claude-sonnet"
# return dataPer-Team Budget Enforcement in Four Steps
The governance architecture is straightforward to stand up once the proxy is running. The organizational work of assigning budgets is harder than the technical implementation.
- 1
Inventory all LLM call sources
Before setting budgets, you need a complete map of where API calls originate. Engineering tools (Claude Code, Cursor, Copilot), internal agents, CI/CD pipelines, and application-layer features all need to route through the proxy. Any uncovered call source is a budget hole.
- 2
Define your team and use-case taxonomy
The attribution schema you choose now determines what questions you can answer in six months. Team attribution is the minimum viable structure. Use-case attribution (copilot vs. agent vs. batch) lets you have more nuanced conversations about value per dollar.
- 3
Set budgets based on 60 days of baseline data
Do not set budgets before you have baseline data. Run the proxy in observation mode for two months — log everything, enforce nothing. The baseline tells you what 'normal' spend looks like. Budget limits should be set at roughly 130–150% of the 60-day baseline, not at zero from a guess.
- 4
Build the finance reporting export
The technical governance is for the VP Engineering and platform team. The finance report is for the CFO and business unit leads. These are different audiences requiring different outputs. The proxy database has everything you need — the work is building the export and scheduling it.
Building the Anomaly Detection Layer
A budget limit stops runaway spend. Anomaly detection surfaces the cause before the limit is hit — and gives you time to investigate rather than just block.
Budget enforcement is a hard stop. Anomaly detection is an early warning system. You want both, because a hard stop at month-end budget tells you nothing about which workflow caused the spike, whereas an anomaly alert at 3x daily baseline gives you a live signal to investigate.
The anomaly detection model is simple: calculate the rolling 7-day average spend per team, compare today's spend to that average by 4 PM, and page the team lead when the ratio exceeds your threshold. A 3x spike on a Tuesday afternoon almost always has a specific cause — a new agent workflow a team shipped, a CI pipeline accidentally running agents on every push, or a developer manually running a large batch job.
LiteLLM exports spend data via its API and Prometheus metrics endpoint. Grafana with a simple Prometheus data source is sufficient for the alerting layer — no need for Datadog unless you already have it. The diagram below shows the flow from spend data to alert.
grafana_alert.yaml# Grafana alerting rule — token spend spike detection
# Fires when a team's same-day spend exceeds 3x their 7-day daily average
apiVersion: 1
groups:
- orgId: 1
name: ai-token-governance
folder: engineering-costs
interval: 1h
rules:
- uid: token-spike-alert
title: AI Token Spend Spike
condition: C
data:
- refId: A
# Today's cumulative spend per team
queryType: range
relativeTimeRange:
from: 86400
to: 0
model:
expr: sum(litellm_spend_usd_total) by (team_alias)
- refId: B
# 7-day rolling daily average per team
queryType: range
relativeTimeRange:
from: 604800
to: 86400
model:
expr: sum(litellm_spend_usd_total) by (team_alias) / 7
- refId: C
queryType: expression
model:
type: math
expression: $A / $B # ratio of today vs daily avg
noDataState: NoData
for: 30m
annotations:
summary: >-
Token spike: {{ $labels.team_alias }} is at
{{ $values.C | printf "%.1f" }}x daily average
labels:
severity: warning
condition:
evaluator:
type: gt
params: [3] # alert at 3x daily baselineThe Four Numbers That Make Finance Stop Asking Questions
Finance does not need to understand tokens. They need a cost-per-outcome number and a trend line they can put in a board deck.
The critical insight for the CFO conversation: cost per merged PR is the number that makes AI spend legible to finance. If a team spent $3,200 last month on AI APIs and shipped 85 pull requests, their cost per PR is $37.65. If a comparable team spending $1,100 shipped 20 pull requests, their cost per PR is $55. The first team is getting more value per dollar, even though their absolute spend is higher. That framing converts the conversation from "AI is expensive" to "here's the ROI on your AI investment."
Start building this report before finance asks for it. The engineering org that walks into the quarterly business review with a cost-per-outcome dashboard owns the narrative. The engineering org that gets asked for this data scrambles for three days and produces a number nobody trusts.
The DX 2026 survey found that 86% of engineering leaders feel uncertain about which AI tools provide the most benefit, and 40% lack sufficient data to demonstrate ROI[2]. The token governance architecture solves both problems simultaneously — it gives you the cost data and, when correlated with output metrics, the ROI data.
Questions VPs and CFOs Ask
The objections that come up in every token governance conversation
Won't hard budget limits block engineers at critical moments?
Soft limits — alerts at 80% of budget — handle this. Hard limits at 150% are a circuit breaker for runaway processes, not a daily constraint for normal work. In practice, the engineers most likely to hit a hard limit are running unbounded agent loops that a soft limit would have caught hours earlier. Design the limit tiers correctly: green (0–80%), amber alert (80–110%), hard stop (150%). Almost all legitimate engineering work stays in the green.
What if teams need to run large batch jobs that temporarily spike spend?
Budget exemptions work exactly like cloud FinOps exemptions: the team lead requests a temporary limit increase for a specific time window and use case, the VP Engineering approves it, the proxy gets updated. This is one API call to LiteLLM. The process creates an audit trail — who requested it, why, what they ran — which is itself valuable data for future budget planning. The alternative (no limits, unlimited spend) is what you have now.
Do we need all this infrastructure if we only use seat-licensed tools?
If every engineer is on a fixed-seat GitHub Copilot plan and nothing touches the API layer, the seat license is a predictable line item and you don't need token governance. The moment any team starts using Claude Code, LangChain agents, or calling Anthropic/OpenAI APIs directly, you need the attribution proxy. The trend is strongly toward API-layer access as engineers move from copilots to autonomous agents. Build the proxy before the API footprint grows, not after.
How do we handle the cultural pushback from engineers who feel monitored?
Frame it correctly from the start: individual-level spend dashboards are for engineers' benefit — they can see if they're on track for the month before a hard limit hits. Team-level dashboards are for planning, not surveillance. The goal is not to catch engineers doing something wrong; it's to give every team a predictable budget they can plan around. Engineers working at well-run organizations with clear cloud budgets don't feel surveilled by CloudWatch cost alerts. Token budgets are no different when the framing is right.
We're already using Datadog for observability — do we need Grafana too?
No. LiteLLM exports spend data via Prometheus metrics, which Datadog can ingest directly through its Prometheus integration. Build your token spend dashboards in Datadog alongside your existing observability. The Grafana reference in the architecture is illustrative — any metrics visualization layer works. The proxy is the critical component, not the dashboard tool.
Token Governance Implementation Checklist
Inventoried all LLM API call sources across engineering tooling, agents, and CI/CD
Deployed LiteLLM (or Portkey) proxy with team-key authentication
Ran 60-day baseline observation before setting hard budget limits
Created team virtual keys with monthly budgets and cost center metadata
Implemented use_case tagging convention across all LLM call sites
Configured model routing hook (Haiku for autocomplete, Sonnet for reviews, Opus for agents)
Built Grafana/Datadog alert for 3x daily spend spike per team
Built weekly cost-per-PR report for finance and business unit leads
Defined budget exemption request process for legitimate spike workloads
Shared per-engineer spend dashboard so engineers have personal visibility
Hard Rules for AI Token Governance
Never set budget limits before collecting 60 days of baseline data
A budget set from intuition will either block legitimate work or provide no real constraint. Baseline data gives you the right number. Two months of observation costs almost nothing; a misset limit that blocks a critical deployment at 4 AM costs trust and incident time.
Route every LLM call through the attribution proxy — no exceptions
One team with a hardcoded API key bypassing the proxy breaks the entire attribution model. You cannot have partial coverage; a spend spike from an untracked source is indistinguishable from a tracking failure. Treat unproxied API keys the same way you treat unapproved cloud credentials.
Cost-per-PR is the CFO metric — not total spend, not tokens consumed
Total spend increases as the team does more. That's not a problem — it's growth. The metric that matters to finance is efficiency: are you getting more output per dollar over time? Cost-per-PR captures this. Present only this metric in QBRs unless finance asks for more detail.
Model routing is a governance control, not an engineering suggestion
If model selection is left to individual preference, engineers default to the best model for every task regardless of cost. Routing rules embedded in the proxy are non-negotiable guardrails. Document the routing policy, explain the reasoning, and enforce it programmatically. A policy doc nobody reads does not change model selection behavior.
- [1]Engineers Should Spend $250K on AI Tokens — Mid-Size Repo Hit $150 in 48 Hours (Medium, 2026)(medium.com)↩
- [2]How Are Engineering Leaders Approaching 2026 AI Tooling Budgets? (DX, 2026)(getdx.com)↩
- [3]Setting Team Budgets — LiteLLM Documentation(docs.litellm.ai)↩
- [4]Spend Tracking — LiteLLM Documentation(docs.litellm.ai)↩
- [5]State of FinOps 2026 Report — FinOps Foundation(data.finops.org)↩
- [6]Tokenmaxxing: The Costly Mistake in AI Engineering Metrics (2026)(itsmeduncan.com)↩
- [7]How Token-Based AI Coding Tools Impact Engineering Budgets (Exceeds AI)(blog.exceeds.ai)↩
- [8]FinOps for AI Overview — FinOps Foundation(finops.org)↩
- [9]Multi-Tenant Architecture with LiteLLM — LiteLLM Documentation(docs.litellm.ai)↩