Token Budget Engineering: AI FinOps Before the Invoice Lands

The $2,000 Engineer: Build a Token Budget Before AI Tooling Eats Your P&L

You approved Copilot. Then Claude Code. The invoice is a surprise and nobody owns the line item. The window for token FinOps is open right now — proxy, attribution, routing, anomaly detection. Build it before the next quarterly review.

Governance & AdoptionadvancedMar 11, 20268 min read

By Viktor Bezdek · VP Engineering, Groupon

$500–$2K

Monthly API spend per engineer running Claude Code as an autonomous agent^[1]

Top of the range is engineers running multi-step agents against large repos. No budget gate. No personal cost signal.

$150

Burned by one developer on a mid-size repo in 48 hours^[1]

One person. One project. Two days. The first time anyone saw the number was on the invoice.

38%

Engineering leaders spending $101–500/dev/year on AI tools — 10.5% already past $1K^[2]

Seat licenses are the visible part. API spend rides underneath, usually untracked.

21%

Large engineering orgs operating with no formal AI cost-tracking system^[2]

One in five large orgs spending real money against zero attribution. The invoice is the dashboard.

The invoice arrives quarterly. Finance flags it. The VP of Engineering spends two days reconstructing which teams, which agents, which workflows produced the spend. The data does not exist. There is no team line item. There is no per-use-case breakdown. There is one number with nine zeroes of context missing.

AI token spend — the cost of LLM API calls across an engineering org — has become a P&L problem nobody planned for when they handed out Claude Code licenses. Seat-based tools like GitHub Copilot and Cursor sit on the visible side: budgetable, predictable, easy to put in a spreadsheet. The API layer underneath is a different animal. It charges by token, scales non-linearly with agent autonomy, and produces no natural stopping point once an engineer discovers that fanning out parallel agents finishes a refactor in an hour.

The orgs that survived cloud bill shock in 2013 built FinOps: meter the resource, allocate it to teams, put visible budgets on business units, page someone before the invoice arrives. The same window is open for AI token spend. The teams that build the governance architecture now will own clean P&L attribution and controllable cost curves. The teams that wait will explain a surprise invoice every quarter for the rest of the decade.

One counterintuitive read: the teams spending the most on tokens often have the best unit economics, not the worst. A platform team burning $8,000/month and shipping 200 PRs at $40 each is dramatically more efficient than a team burning $1,200/month and shipping 8 PRs at $150 each. Absolute spend is the wrong signal. Cost per shipped unit of work is the right one — and you cannot calculate it without attribution infrastructure underneath every call.

AI Token Spend Is AWS in 2013. The Window Is Open Again.

Same spending pattern, same missing controls, same short window before the bill becomes structural and the org learns to live with it.

In 2013, AWS billing was chaos at most engineering orgs. Teams spun up infrastructure with no central visibility. Costs scaled with usage. Invoices arrived 30 days late. Finance had one line item — "cloud infrastructure" — and no way to ask which product, which team, or which architectural decision produced a spike.

The response was FinOps. Meter every resource. Tag it to a cost center. Surface it in near-real-time. Build chargeback so business units own what they consume. By 2018, mature engineering orgs had cost allocation tags on every AWS resource, per-team budget alerts, and anomaly detection that paged on unexpected spikes. The discipline made cloud spend ownable.

AI token spend is in the 2013 moment. The State of FinOps 2026 report shows 98% of FinOps respondents now manage AI spend — up from 63% the previous year — and 58% have implemented showback or chargeback for it^[5]. That is the leading edge. The median engineering org still treats AI API costs as one homogeneous line with zero team-level attribution.

The FinOps Foundation's AI working group names the structural difference: "the unit economics of generative AI are fundamentally different from cloud infrastructure — variable by model, prompt complexity, and agent autonomy, not just by hours provisioned."^[8] That variability is why the old controls break. A monthly seat license is predictable. A per-token API call from an autonomous agent that retries on failure is not.

Untracked Spend

One invoice line: 'AI tooling — $XX,XXX'
Zero team-level attribution
No budget by use case — copilot, agent, batch job all collapsed
Surprises surface at quarterly review
Engineers optimize for speed; cost is invisible
CFO cannot connect spend to shipped work
Model selection is individual preference, not policy

Attributed Spend

Per-team, per-use-case attribution in real time
Monthly budgets enforced at the proxy, not by memo
Automated alerts when daily spend crosses 3x baseline
Model routing by task complexity — Haiku for autocomplete, Opus for agents
Chargeback to business unit cost centers, not absorbed by engineering
ROI dashboard correlating token spend to shipped features
Finance has a line item they understand and can plan against

Three Patterns That Blow the Budget

Seat-license thinking misses every one of them. The cost curve does not behave like a SaaS subscription, and pretending it does is how the surprise invoice happens.

Token spend explodes in three distinct patterns. Each demands a different control. None of them is solvable with a procurement spreadsheet.

Agentic loops are the dominant cost vector. An engineer runs Claude Code in autonomous mode against a large repo. The agent reads hundreds of files, writes code, runs tests, reads failures, retries. A two-hour session that produces solid output might burn $50–80 in API costs. Nothing alarming on its own. But that engineer runs three sessions a day across four parallel tasks, and the math turns into $200/day or $4,000+/month from one person^[1]. Multiply by a 40-engineer org with a fresh agentic-workflow mandate from the CTO and the monthly number becomes structural — and load-bearing on the next funding conversation.

Unoptimized model selection compounds the loop problem. Engineers route everything to the most capable model because it produces better results, and there is no force pulling them toward the cheaper one. The cost gap between Haiku and Opus on the same task runs 10–20x. When every call hits the most expensive model regardless of difficulty, you are paying premium prices for autocomplete and not saving anything for the workflows that genuinely need depth.

Tokenmaxxing is the behavioral pattern that emerges from invisible budgets. Practitioners coined the term for engineers who maximize token consumption — running agent swarms, keeping large contexts pinned, retrying aggressively — because in the absence of a personal cost signal it is rational career play^[6]. More tokens, faster output, better review cycle. The behavior is not abuse. It is a correct response to the incentive structure. Make the cost invisible and tokenmaxxing is the equilibrium.

The Proxy Is the Control Plane. Everything Else Bolts to It.

Route every LLM call through one cost-attribution layer. No exceptions. The proxy is the foundation that makes budgets, routing, and anomaly detection possible at all.

The core of token governance is a proxy that sits between engineering tools and model provider APIs. Every LLM call — from Claude Code, from internal agents, from CI/CD pipelines, from product features — passes through this single ingress. The proxy tags each call with team, use case, and outcome metadata, records the cost in a database, and enforces budget limits before the call leaves the network.

LiteLLM is the most widely-deployed open source option. It runs a hierarchical multi-tenant model: organization → team → user → key^[9]. Budgets cascade down the hierarchy. Every API call carries the full attribution chain. Portkey offers a managed alternative with workspaces, roles, and budget controls baked in.

The diagram below traces the path from engineer tooling to model provider. Cost attribution is captured at the proxy and surfaced in a dashboard. The single-ingress rule is what makes the rest of the architecture work. Any source that bypasses the proxy is a hole in the attribution model.

Single Ingress: Every LLM Call Tagged Before It Leaves

Every call hits the proxy first. Budget check, model routing, cost log — then provider. Anything bypassing this path is invisible to finance.

litellm_config.yaml

# One proxy, every call. Budgets cascade org -> team -> user -> key.
model_list:
  - model_name: claude-opus
    litellm_params:
      model: anthropic/claude-opus-4-5
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-6
      api_key: os.environ/ANTHROPIC_API_KEY
  - model_name: claude-haiku
    litellm_params:
      model: anthropic/claude-haiku-4-5-20251001
      api_key: os.environ/ANTHROPIC_API_KEY

litellm_settings:
  # Per-team spend recorded in real time. The proxy is the source of truth.
  success_callback: ["langfuse"]
  failure_callback: ["langfuse"]

general_settings:
  database_url: os.environ/DATABASE_URL
  master_key: os.environ/LITELLM_MASTER_KEY
  store_model_in_db: true

  # Default budget for every new team. Tune after 60 days of baseline.
  default_team_settings:
    max_budget: 500        # $500/month per team
    budget_duration: 30d
    tpm_limit: 2000000     # 2M tokens/minute hard ceiling

With the proxy live, mint a virtual key per team and bind it to a monthly budget. Teams embed the key in their tooling — Claude Code, LangChain, custom agents — and every call routes through the proxy without further engineering effort^[3].

The proxy records the full attribution chain: which team, which user, which model, input and output tokens, and any metadata the caller passed. The metadata field is where use-case attribution lives. Tag calls with use_case: code-review or use_case: autonomous-agent and the dashboard breaks spend down by workflow, not just by team. Without the tag the data still arrives — but answering "what did we spend on agents last month" turns into a guess.

create_team_budget.sh

# Create a team key with a hard monthly budget. One API call, no ceremony.
curl -X POST 'http://your-litellm-proxy:4000/team/new' \
  -H 'Authorization: Bearer $LITELLM_MASTER_KEY' \
  -H 'Content-Type: application/json' \
  -d '{
    "team_alias": "payments-squad",
    "max_budget": 800,
    "budget_duration": "30d",
    "tpm_limit": 1000000,
    "metadata": {
      "cost_center": "ENGG-PAYMENTS",
      "team_lead": "sarah@company.com",
      "budget_owner": "vp-engineering"
    }
  }'

# Response returns the team's API key. Engineers point their tools at it:
# ANTHROPIC_API_KEY=sk-litellm-payments-squad-xxxx
# ANTHROPIC_BASE_URL=http://your-litellm-proxy:4000

Routing Is the Lever, Not the Limit

Budget caps stop catastrophe. Model routing reduces baseline spend. The 10–20x gap between Haiku and Opus is the largest cost lever you have, and most orgs are not pulling it.

Budget enforcement prevents catastrophic overruns. Model routing reduces baseline spend. The two controls operate at different layers and compound — one is the circuit breaker, the other is the rate at which the meter spins.

The routing principle is mechanical: match model capability to task requirement. IDE autocomplete needs fast response and basic completion quality. Haiku handles it at a fraction of the Opus price. A multi-step architecture review over a complex codebase actually benefits from Opus depth. Routing every call to the same tier because that is what the API key defaults to wastes money on the easy work and offers no extra fidelity on the hard work.

Task Type	Recommended Model	Rationale	Approx. Relative Cost
IDE autocomplete / inline suggestion	Claude Haiku	Speed matters more than depth; context is small	1x baseline
Code explanation / docstring generation	Claude Haiku	Well-defined, bounded task; little ambiguity	1x baseline
Code review — single PR	Claude Sonnet	Needs judgment on patterns, security, style	~5x baseline
Test generation for existing function	Claude Sonnet	Moderate complexity; clear success criteria	~5x baseline
Multi-file refactor with dependencies	Claude Sonnet	Context-heavy but structured; Sonnet sufficient	~5x baseline
Architecture review / system design	Claude Opus	Requires deep reasoning over ambiguous tradeoffs	~20x baseline
Autonomous multi-step agent (planning loop)	Claude Opus	Agent orchestration quality significantly affects outcome	~20x baseline
Batch summarization / classification jobs	Claude Haiku	High volume, low complexity; cost savings compound	1x baseline

routing_config.yaml

# Task-complexity routing. Model selection is enforced at the proxy, not asked of the engineer.
router_settings:
  routing_strategy: usage-based-routing

  # Fallback chain when the primary tier is unavailable.
  fallbacks:
    - {"claude-opus": ["claude-sonnet"]}
    - {"claude-sonnet": ["claude-haiku"]}

# Engineers tag calls with use_case metadata. The proxy reads the tag and overrides the model.
#
# client.messages.create(
#   model="claude-sonnet",  # caller's request
#   metadata={"use_case": "code-review", "team": "payments-squad"},
#   ...
# )
#
# Routing hook overrides the requested model based on the tag.

litellm_settings:
  callbacks: ["my_routing_hook"]

# routing_hook.py — pin model selection to use_case. Policy in code, not in a wiki page.
# from litellm.integrations.custom_logger import CustomLogger
# class RoutingHook(CustomLogger):
#   async def async_pre_call_hook(self, user_api_key_dict, cache, data, call_type):
#     use_case = data.get("metadata", {}).get("use_case", "")
#     if use_case in ("autocomplete", "inline-suggestion", "batch-classify"):
#       data["model"] = "claude-haiku"
#     elif use_case in ("code-review", "test-generation", "refactor"):
#       data["model"] = "claude-sonnet"
#     return data

Per-Team Budgets in Four Steps

The technical work is straightforward once the proxy is live. Assigning the budgets is the political fight, and that fight is the actual project.

[01]
Inventory Every LLM Call Source
Before setting budgets, build a complete map of where calls originate. Engineering tools (Claude Code, Cursor, Copilot), internal agents, CI/CD pipelines, application features — everything must route through the proxy. Any uncovered source is a budget hole, and one hole is enough to break the attribution model.
[02]
Define the Team and Use-Case Taxonomy
The attribution schema chosen now decides what questions the org can answer in six months. Team attribution is the floor. Use-case attribution — copilot vs. agent vs. batch — is what lets you have a real conversation about value per dollar instead of a frustrated one about totals.
[03]
Set Budgets From Baseline, Not From Intuition
Do not set budgets before the data exists. Run the proxy in observation mode for 60 days — log everything, enforce nothing. The baseline tells you what normal spend looks like. Set soft limits at roughly 130–150% of the 60-day baseline, hard cutoffs at 200%. A budget set from a guess will either block legitimate work or constrain nothing at all.
[04]
Build the Finance Reporting Export
Internal governance is for the VP Engineering and the platform team. The finance report is for the CFO and business unit leads. Two audiences, two outputs. The proxy database holds everything you need — the work is the export and the schedule. The team that walks into the QBR with a cost-per-PR number owns the narrative. The team that scrambles for the data on Friday produces a number nobody trusts.

Anomaly Detection: The Signal That Arrives Before the Cap

A budget limit is a hard stop. Anomaly detection is the page that fires while there is still time to investigate, not just block.

Budget enforcement is a hard stop. Anomaly detection is an early warning. You want both. A hard stop at month-end tells you the budget is gone. It tells you nothing about which workflow consumed it. An anomaly alert at 3x daily baseline gives a live signal — investigate now, while the cause is still on someone's screen.

The detection model is simple. Compute the rolling 7-day daily-spend average per team. Compare today's spend to that average at 4 PM. Page the team lead when the ratio crosses your threshold. A 3x spike on a Tuesday afternoon almost always has a specific cause — a new agent workflow shipped that morning, a CI pipeline accidentally triggering agents on every push, a developer manually running a large batch job.

LiteLLM exports spend data via its API and a Prometheus metrics endpoint. Grafana with a Prometheus data source is sufficient for the alerting layer. No need for Datadog unless it is already in the stack. The flow from spend data to alert is in the diagram below.

One operational mistake from our own rollout: we set the anomaly threshold at 2x instead of 3x. The false positive rate was high enough that team leads started ignoring alerts inside two weeks — alert fatigue, the standard failure mode. At 3x, alerts fire roughly twice a month per team and are almost always actionable. Calibrate the threshold to what team leads will actually investigate, not to what catches every minor fluctuation. The alert that gets ignored is worse than no alert.

Spend Spike Detected at 4 PM, Not at Month-End

Today's spend vs. the rolling 7-day average. 3x triggers a Slack page to the team lead with model and use-case breakdown attached.

grafana_alert.yaml

# Token-spend spike detection. Fires when team's same-day spend exceeds 3x its 7-day daily average.
apiVersion: 1
groups:
  - orgId: 1
    name: ai-token-governance
    folder: engineering-costs
    interval: 1h
    rules:
      - uid: token-spike-alert
        title: AI Token Spend Spike
        condition: C
        data:
          - refId: A
            # Today's cumulative spend per team
            queryType: range
            relativeTimeRange:
              from: 86400
              to: 0
            model:
              expr: sum(litellm_spend_usd_total) by (team_alias)
          - refId: B
            # 7-day rolling daily average per team
            queryType: range
            relativeTimeRange:
              from: 604800
              to: 86400
            model:
              expr: sum(litellm_spend_usd_total) by (team_alias) / 7
          - refId: C
            queryType: expression
            model:
              type: math
              expression: $A / $B   # ratio: today vs. daily avg
        noDataState: NoData
        for: 30m
        annotations:
          summary: >-
            Token spike: {{ $labels.team_alias }} is at
            {{ $values.C | printf "%.1f" }}x daily average
        labels:
          severity: warning
        condition:
          evaluator:
            type: gt
            params: [3]   # page at 3x daily baseline

Four Numbers That End the Finance Conversation

Finance does not need to understand tokens. They need a cost-per-outcome number and a trend line that fits in a board deck. Anything else is noise.

Cost per merged PR

Total token spend ÷ merged PRs per team per month. The number that translates AI cost into shipped work.

Spend trend vs. output trend

Cost-per-PR over time. Falling = the investment is compounding. Rising = something needs a routing audit.

Budget utilization by team

Which teams sit at 40%, which at 95%. Allocation accuracy is the governance story finance actually wants.

Model mix by use case

Share of spend by model tier. A high Opus share on autocomplete is a routing failure, not a budget failure.

The line that wins the CFO conversation: cost per merged PR is the number that makes AI spend legible to finance. A team spending $3,200 last month and shipping 85 PRs lands at $37.65/PR. A comparable team spending $1,100 and shipping 20 PRs lands at $55/PR. The first team is more efficient even though absolute spend is higher. The framing flips the conversation from "AI is expensive" to "here is the ROI on the AI investment, and here is the trend."

Build this report before finance asks for it. The org that walks into the quarterly review with a cost-per-outcome dashboard owns the narrative. The org that gets asked for the data scrambles for three days and produces a number nobody trusts. Trust on the cost story is built in advance or not at all.

The DX 2026 survey found 86% of engineering leaders unsure which AI tools deliver the most value, and 40% lacking the data to demonstrate ROI^[2]. The token governance architecture solves both problems with one piece of infrastructure — the cost data falls out of the proxy, and the ROI data falls out of the proxy correlated with output metrics.

What VPs and CFOs Will Push Back On

The same five objections come up in every token governance conversation. Here are the answers.

Won't hard limits block engineers at the worst possible moment?

Soft limits handle this. Alerts at 80%, amber at 110%, hard stop at 150%. The engineers most likely to hit a hard limit are running unbounded agent loops the soft limit would have caught hours earlier. Design the tiers correctly and almost all legitimate work stays in the green. The hard stop exists for the runaway loop, not for the daily commit.

What about teams that need to run large batch jobs that temporarily spike spend?

Budget exemptions work the same way they do in cloud FinOps. Team lead requests a temporary increase for a specific window and use case. VP Engineering approves. The proxy gets one API call. The process creates an audit trail — who asked, why, what they ran — which feeds future budget planning. The alternative is unlimited spend with no record of who consumed it. That alternative is what you have now.

Do we need this if we only use seat-licensed tools?

If every engineer is on a fixed-seat Copilot plan and nothing touches the API layer, the seat license is a predictable line item and the proxy is overkill. The moment any team starts using Claude Code, LangChain agents, or hitting Anthropic/OpenAI APIs directly, you need the attribution layer. The trend is one-directional toward API access as engineers move from copilots to agents. Build the proxy before the API surface area expands, not after.

How do we handle engineers who feel surveilled by the dashboards?

Frame it correctly from day one. Individual-level dashboards are for the engineer's benefit — they can see whether they are on track for the month before a hard limit hits. Team-level dashboards are for planning. The goal is not to catch anyone doing something wrong; it is to give every team a predictable budget to plan against. Engineers at well-run orgs do not feel surveilled by CloudWatch cost alerts. Token budgets are no different when the framing is right and the data is shared, not hidden.

We already run Datadog. Do we need Grafana too?

No. LiteLLM exports spend via Prometheus metrics, and Datadog ingests those directly through its Prometheus integration. Build the dashboards alongside existing observability. The Grafana reference in the architecture is illustrative — any metrics layer works. The proxy is the load-bearing component. The visualization tool is interchangeable.

Token Governance Implementation Checklist

Every LLM call source inventoried — engineering tools, agents, CI/CD, product features
LiteLLM (or Portkey) proxy deployed with team-key authentication
60-day baseline observation completed before any hard limit was set
Team virtual keys minted with monthly budgets and cost-center metadata
use_case tagging convention enforced across every LLM call site
Model routing hook live — Haiku for autocomplete, Sonnet for review, Opus for agents
Anomaly alert wired for 3x daily-spend spike per team
Weekly cost-per-PR report shipping to finance and business unit leads
Budget exemption process documented for legitimate spike workloads
Per-engineer spend dashboard shared so engineers see their own number

Hard Rules for AI Token Governance

[01]

No hard limit without 60 days of baseline data underneath it

A budget set from intuition either blocks legitimate work or constrains nothing. Baseline data gives the right number. Two months of observation costs almost nothing. A misset limit that blocks a deploy at 4 AM costs trust and incident time, and the trust is harder to recover than the time.

[02]

Every LLM call routes through the proxy. No exceptions, no exemptions, no 'just for now.'

One team with a hardcoded API key bypassing the proxy breaks the entire attribution model. Partial coverage is not coverage. A spike from an untracked source is indistinguishable from a tracking failure, and the longer it persists the more the cost data becomes a fiction. Treat unproxied keys the same way you treat unapproved cloud credentials.

[03]

Cost-per-PR is the CFO metric. Total spend and tokens consumed are not.

Total spend rises as the team does more. That is growth, not a problem. The metric finance cares about is efficiency: more output per dollar, over time. Cost-per-PR captures it. Present that number in QBRs unless finance explicitly asks for the breakdown underneath.

[04]

Routing is a control, not a suggestion

If model selection sits with the individual engineer, every call lands on the most capable model regardless of the task. Routing rules embedded in the proxy are non-negotiable. Document the policy, name the reasoning, enforce it programmatically. A policy doc that nobody reads does not change behavior. A pre-call hook that overrides the model does.

Key terms in this piece

AI token budgetLLM cost governancetoken spend attributionAI FinOps engineeringLiteLLM team budgetsAI tooling P&L

Sources

[1]Engineers Should Spend $250K on AI Tokens — Mid-Size Repo Hit $150 in 48 Hours (Medium, 2026)(medium.com)↩
[2]How Are Engineering Leaders Approaching 2026 AI Tooling Budgets? (DX, 2026)(getdx.com)↩
[3]Setting Team Budgets — LiteLLM Documentation(docs.litellm.ai)↩
[4]Spend Tracking — LiteLLM Documentation(docs.litellm.ai)↩
[5]State of FinOps 2026 Report — FinOps Foundation(data.finops.org)↩
[6]Tokenmaxxing: The Costly Mistake in AI Engineering Metrics (2026)(itsmeduncan.com)↩
[7]How Token-Based AI Coding Tools Impact Engineering Budgets (Exceeds AI)(blog.exceeds.ai)↩
[8]FinOps for AI Overview — FinOps Foundation(finops.org)↩
[9]Multi-Tenant Architecture with LiteLLM — LiteLLM Documentation(docs.litellm.ai)↩

Share this article

X LinkedIn Hacker News

The $2,000 Engineer: Build a Token Budget Before AI Tooling Eats Your P&L

Governance & AdoptionadvancedMar 11, 20268 min read

By Viktor Bezdek · VP Engineering, Groupon

Token spend explodes in three distinct patterns. Each demands a different control. None of them is solvable with a procurement spreadsheet.

# One proxy, every call. Budgets cascade org -> team -> user -> key. model_list: - model_name: claude-opus litellm_params: model: anthropic/claude-opus-4-5 api_key: os.environ/ANTHROPIC_API_KEY - model_name: claude-sonnet litellm_params: model: anthropic/claude-sonnet-4-6 api_key: os.environ/ANTHROPIC_API_KEY - model_name: claude-haiku litellm_params: model: anthropic/claude-haiku-4-5-20251001 api_key: os.environ/ANTHROPIC_API_KEY litellm_settings: # Per-team spend recorded in real time. The proxy is the source of truth. success_callback: ["langfuse"] failure_callback: ["langfuse"] general_settings: database_url: os.environ/DATABASE_URL master_key: os.environ/LITELLM_MASTER_KEY store_model_in_db: true # Default budget for every new team. Tune after 60 days of baseline. default_team_settings: max_budget: 500 # $500/month per team budget_duration: 30d tpm_limit: 2000000 # 2M tokens/minute hard ceiling

# Create a team key with a hard monthly budget. One API call, no ceremony. curl -X POST 'http://your-litellm-proxy:4000/team/new' \ -H 'Authorization: Bearer $LITELLM_MASTER_KEY' \ -H 'Content-Type: application/json' \ -d '{ "team_alias": "payments-squad", "max_budget": 800, "budget_duration": "30d", "tpm_limit": 1000000, "metadata": { "cost_center": "ENGG-PAYMENTS", "team_lead": "sarah@company.com", "budget_owner": "vp-engineering" } }' # Response returns the team's API key. Engineers point their tools at it: # ANTHROPIC_API_KEY=sk-litellm-payments-squad-xxxx # ANTHROPIC_BASE_URL=http://your-litellm-proxy:4000

Task Type

Recommended Model

Rationale

Approx. Relative Cost

IDE autocomplete / inline suggestion

Claude Haiku

Speed matters more than depth; context is small

1x baseline

Code explanation / docstring generation

Claude Haiku

Well-defined, bounded task; little ambiguity

1x baseline

Code review — single PR

Claude Sonnet

Needs judgment on patterns, security, style

~5x baseline

Test generation for existing function

Claude Sonnet

Moderate complexity; clear success criteria

~5x baseline

Multi-file refactor with dependencies

Claude Sonnet

Context-heavy but structured; Sonnet sufficient

~5x baseline

Architecture review / system design

Claude Opus

Requires deep reasoning over ambiguous tradeoffs

~20x baseline

Autonomous multi-step agent (planning loop)

Claude Opus

Agent orchestration quality significantly affects outcome

~20x baseline

Batch summarization / classification jobs

Claude Haiku

High volume, low complexity; cost savings compound

1x baseline

# Task-complexity routing. Model selection is enforced at the proxy, not asked of the engineer. router_settings: routing_strategy: usage-based-routing # Fallback chain when the primary tier is unavailable. fallbacks: - {"claude-opus": ["claude-sonnet"]} - {"claude-sonnet": ["claude-haiku"]} # Engineers tag calls with use_case metadata. The proxy reads the tag and overrides the model. # # client.messages.create( # model="claude-sonnet", # caller's request # metadata={"use_case": "code-review", "team": "payments-squad"}, # ... # ) # # Routing hook overrides the requested model based on the tag. litellm_settings: callbacks: ["my_routing_hook"] # routing_hook.py — pin model selection to use_case. Policy in code, not in a wiki page. # from litellm.integrations.custom_logger import CustomLogger # class RoutingHook(CustomLogger): # async def async_pre_call_hook(self, user_api_key_dict, cache, data, call_type): # use_case = data.get("metadata", {}).get("use_case", "") # if use_case in ("autocomplete", "inline-suggestion", "batch-classify"): # data["model"] = "claude-haiku" # elif use_case in ("code-review", "test-generation", "refactor"): # data["model"] = "claude-sonnet" # return data

# Token-spend spike detection. Fires when team's same-day spend exceeds 3x its 7-day daily average. apiVersion: 1 groups: - orgId: 1 name: ai-token-governance folder: engineering-costs interval: 1h rules: - uid: token-spike-alert title: AI Token Spend Spike condition: C data: - refId: A # Today's cumulative spend per team queryType: range relativeTimeRange: from: 86400 to: 0 model: expr: sum(litellm_spend_usd_total) by (team_alias) - refId: B # 7-day rolling daily average per team queryType: range relativeTimeRange: from: 604800 to: 86400 model: expr: sum(litellm_spend_usd_total) by (team_alias) / 7 - refId: C queryType: expression model: type: math expression: $A / $B # ratio: today vs. daily avg noDataState: NoData for: 30m annotations: summary: >- Token spike: {{ $labels.team_alias }} is at {{ $values.C | printf "%.1f" }}x daily average labels: severity: warning condition: evaluator: type: gt params: [3] # page at 3x daily baseline

The $2,000 Engineer: Build a Token Budget Before AI Tooling Eats Your P&L

AI Token Spend Is AWS in 2013. The Window Is Open Again.

Three Patterns That Blow the Budget

The Proxy Is the Control Plane. Everything Else Bolts to It.

Routing Is the Lever, Not the Limit

Per-Team Budgets in Four Steps

Inventory Every LLM Call Source

Define the Team and Use-Case Taxonomy

Set Budgets From Baseline, Not From Intuition

Build the Finance Reporting Export

Anomaly Detection: The Signal That Arrives Before the Cap

Four Numbers That End the Finance Conversation

What VPs and CFOs Will Push Back On

Token Governance Implementation Checklist

Hard Rules for AI Token Governance

No hard limit without 60 days of baseline data underneath it

Every LLM call routes through the proxy. No exceptions, no exemptions, no 'just for now.'

Cost-per-PR is the CFO metric. Total spend and tokens consumed are not.

Routing is a control, not a suggestion

Related

The $2,000 Engineer: Build a Token Budget Before AI Tooling Eats Your P&L

AI Token Spend Is AWS in 2013. The Window Is Open Again.

Three Patterns That Blow the Budget

The Proxy Is the Control Plane. Everything Else Bolts to It.

Routing Is the Lever, Not the Limit

Per-Team Budgets in Four Steps

Inventory Every LLM Call Source

Define the Team and Use-Case Taxonomy

Set Budgets From Baseline, Not From Intuition

Build the Finance Reporting Export

Anomaly Detection: The Signal That Arrives Before the Cap

Four Numbers That End the Finance Conversation

What VPs and CFOs Will Push Back On

Token Governance Implementation Checklist

Hard Rules for AI Token Governance

No hard limit without 60 days of baseline data underneath it

Every LLM call routes through the proxy. No exceptions, no exemptions, no 'just for now.'

Cost-per-PR is the CFO metric. Total spend and tokens consumed are not.

Routing is a control, not a suggestion

Related