The bill arrives from Anthropic or OpenAI, and the number is wrong. Not wrong as in the math does not add up — the math is perfectly accurate. Wrong as in nobody budgeted for this during architecture review. A team running 50,000 requests per day on Claude Sonnet is spending roughly $22,500 per month on input tokens alone, assuming an average 300 tokens per request at $3/1M. Route 65% of that traffic to Claude Haiku at $0.80/1M and the same volume costs closer to $6,500/month. Nobody designed that routing. Nobody decided to pay the premium. It just happened, because the default is always the capable model.
The pattern is structural, not accidental. Teams pick a frontier model for development — sensibly, because frontier models surface edge cases faster and fail more expressively. Then the feature ships. Traffic grows. Six months later, the model choice is load-bearing infrastructure, the billing alarm fires, and the engineers who built it are in a different org. Retrofitting routing now means a migration that often costs more in engineering time than the accumulated overspend.
A heterogeneous LLM stack routes each request to the cheapest model that meets its quality requirement, rather than sending all traffic to a single default model. The mechanism is a task profile audit: an explicit map of which task types live in the system, what quality bar each requires, and which model tier meets that bar at minimum cost. Teams that build this before launch consistently report 40–85% lower inference spend from day one [1][2]. That is not a post-launch optimization — it is a design decision made before writing the first prompt.
Claude Opus 4 at $15/1M input vs. GPT-4o-mini at $0.15/1M — the capability gap on most real tasks is far smaller than this price gap [5]
Classification, extraction, formatting, and consistent-domain Q&A make up the majority of real workloads — all small-model territory [3]
Below this threshold, a classifier router's token overhead erases the savings — rule-based routing is cheaper [4]
The Billing Surprise Is a Design Failure
Three structural reasons teams arrive at budget overruns — none of them fixable after launch without migration cost.
Development economics favor capability over cost. During prototyping, frontier models produce better outputs faster, surface problems earlier, and fail expressively — the error is legible, not silent. Those properties have high value when you are figuring out the right behavior. They have zero value in production when the task is a JSON extraction you have run ten thousand times.
The cost signal arrives too late. API costs aggregate to monthly bills. A model choice made on day one generates no visible feedback until month two or three, by which point the architecture has calcified. Every new feature built on top of the expensive default extends the blast radius of the original decision. The compounding is slow enough to be invisible until it is not.
Nobody owns the routing decision. Whoever writes the first prompt picks the model. That choice does not go through architecture review. It has no owner. It defaults to whatever the documentation examples show — typically the highest-capability tier — and becomes permanent by inertia. Cost-first architecture requires treating routing ownership the way you treat database schema ownership: explicitly assigned, reviewed before merge, not implicit.
Pick the model that works during development
Ship without a cost model
Wait for the billing alarm six months later
Retrofit routing into working production code
Migration cost: 4–8 weeks of refactoring [2]
Audit task types and quality requirements before coding
Run synthetic profiling to validate model tier thresholds
Ship with routing configured and cost model validated
Per-task cost attribution from the first deploy
Routing updates: configuration change, no code deploy needed
Task Profile Audit: What You Build Before Writing Prompts
The routing decision lives in a document before it lives in code. Three questions per task determine which model tier handles it.
The task profile audit is a structured list of every discrete operation your LLM system performs. For each task, you answer three questions: What is the input distribution? What constitutes an acceptable output? What is the downstream consequence of a wrong answer?
Those three answers determine the model tier — and the third question is the one teams consistently skip. Task complexity and downstream consequence are not the same variable. A simple classification that feeds an automated refund decision needs more careful model selection than a complex synthesis that a human reviews before acting. Model tier selection is not only a cost decision — it is a risk calibration. The blast radius column in your audit is as important as the complexity column.
A short prompt does not equal a low-stakes task. A long, nuanced prompt does not always require a frontier model. Map them separately. Teams that classify only by input complexity and skip the consequence column build routing configs that look cheap on paper and expensive in production incident reports.
| Task Type | Default Tier | Routing Signal | Escalation Trigger |
|---|---|---|---|
| Classification / intent detection | 1 — Small ($0.15–$0.80/1M) | Closed label set, short input | Label outside known set |
| Entity extraction / structured output | 1 — Small ($0.15–$0.80/1M) | Fixed schema, schema-validatable | Schema validation failure rate > 5% |
| Summarization (consistent domain) | 2 — Mid ($3–$10/1M) | Variable length, quality check needed | Quality score below threshold |
| RAG answer generation | 2 — Mid ($3–$10/1M) | Context-dependent, domain-specific | Multi-hop or open-domain queries |
| Multi-step reasoning | 3 — Frontier ($15–$75/1M) | Ambiguous inputs, multi-hop logic | — |
| Novel code generation | 3 — Frontier ($15–$75/1M) | High blast radius on errors | — |
Routing Architecture: How the Three Tiers Work and What Breaks Between Them
Three routing strategies with different tradeoffs. The right one depends on how deterministic your task types are.
The three-tier model is a practical approximation covering most production workloads. Small models (GPT-4o-mini, Claude Haiku, Gemini Flash) run at $0.15–$0.80 per million input tokens. Mid-tier (Claude Sonnet, GPT-4o) at $3–$10. Frontier (Claude Opus, o-series reasoning models) at $15–$75 [5]. The price gap between frontier and small is roughly 100x. The capability gap on most real-world tasks is not 100x — it is closer to 5–15%, and often zero.
Rule-based routing uses application-layer metadata — task type tag, input length, structured-vs-open flag — to assign tiers without an additional LLM call. Zero added latency, zero added cost, roughly 60–75% classification accuracy [4]. The right default for systems where task type is deterministic (the same application path always produces the same task type). Most systems have more deterministic task distribution than engineers assume — the classification already lives in your route handlers, it just has not been wired to the model selector.
Content-based classification analyzes the actual prompt text using a small classifier — an embedding model, a BERT-scale model, or a cheap LLM call — to assign complexity. 75–93% accuracy, $0.001–$0.003 per query overhead [4]. Use this when your application layer cannot reliably tag task type, or when a single endpoint receives genuinely mixed complexity.
Cascade routing sends requests to the cheapest tier first, then escalates on quality failure. Highest accuracy (because failure is empirically measured, not predicted), highest latency, most complex failure modes. Reserve it for workloads with reliable programmatic quality signals and users who can tolerate 2–3 second escalation delays. Customer-facing interactive features are usually the wrong fit for cascade routing. Background processing pipelines are the right one.
Profile Without Production Data: Running Synthetic Benchmarks Before Launch
You need traffic data to configure routing, but you need routing configured before traffic arrives. Synthetic profiling resolves this.
The standard routing advice assumes you have production traffic logs to analyze. You do not, on day one. Synthetic profiling fills that gap — and most competing articles on LLM routing skip it entirely.
The method: collect 200–500 representative prompts from analogous systems, internal dogfooding, or hand-constructed edge-case scenarios. Run them through each model tier in a staging environment. Apply your quality criteria to each output. The result is an empirical routing threshold — what percentage of your task corpus does Tier 1 handle to acceptable quality? — measured on your actual prompt distribution before a single user touches the system.
This is harder than it sounds for one reason: you must define "acceptable" before you start. Teams that skip this definition end up calibrating routing thresholds to a gestalt of "looks good," which does not survive engineer turnover or model upgrades. The quality oracle — the acceptance criteria applied to each output in the profiling run — becomes the canonical definition of correctness for the task. Write it down. Version it. Treat it like a spec.
- 1
Extract a representative task sample
Pull 200–500 prompts from analogous systems or synthesize inputs covering your edge cases. Weight the sample to reflect expected production distribution — if 70% of real traffic will be short classification requests, 70% of your sample should be too. If you have no prior data, create 50–100 prompts per task category and include adversarial inputs.
- 2
Define the quality oracle before running anything
For structured outputs: schema validity plus field accuracy against labeled answers. For open-ended outputs: a rubric with a grading model (a cheap LLM call) or human review of a stratified sample. The oracle becomes the production quality gate. Write it as code, not prose.
- 3
Run the profiling battery across all tiers
Submit each sample through Tier 1, Tier 2, and Tier 3 in isolation. Log output, latency, and token counts per tier per prompt. Run tiers in parallel to minimize wall-clock time. Expect 30–90 minutes for a 500-prompt sample.
- 4
Measure quality per tier and set routing thresholds
Apply the oracle to each output. Record pass rate by tier and task type. Where Tier 1 achieves 95%+ pass rate: route there by default. 85–95%: consider Tier 1 with quality monitoring. Below 85%: route to Tier 2. These thresholds are starting points — calibrate them against your acceptable quality floor, not industry benchmarks.
- 5
Build the pre-launch cost projection
Apply your expected production traffic distribution to the measured pass rates and token counts. This is your cost model: grounded in real quality measurements on your actual prompts. Share it with stakeholders before the first deploy. If the number is wrong, better to know before the system is live.
Context window as a feasibility ceiling: Small models have smaller effective context windows. GPT-4o-mini supports 128K tokens; Claude Haiku supports 200K. For most tasks this does not matter. For tasks involving long documents, multi-turn conversation history, or large retrieval contexts, routing to a small model is not just a quality question — it is a feasibility question. The request that works on Sonnet fails on Haiku because the context does not fit the configuration.
Teams that discover this mid-production end up with two bad options: limit context to the smallest model in the fleet (wasting capability on every high-tier task) or add a second routing dimension — complexity plus context length — that was not in the original design. Build a context-length guard into the classifier from the start. Any request exceeding 80% of the small model's context window should bypass Tier 1 automatically.
Output tokens compound faster than input tokens at scale: Development prompts tend to produce short outputs. Production traffic does not. A frontier model generating 2,000 output tokens per request costs roughly $0.15 per call at Opus pricing ($75/1M output). The same request on GPT-4o-mini costs $0.0012. The task profile audit should capture expected output length explicitly — not just input complexity. Generative tasks with open-ended outputs need output-length-aware routing, not just input-length-aware routing.
One more failure mode: model upgrade cycles invalidate thresholds. Routing configurations calibrated when Claude Haiku cost $0.80/1M become economically wrong if the same capability tier reaches $0.20/1M in the next model generation. Build thresholds as configuration values, not hardcoded constants. You will update them quarterly.
router_config.pyfrom litellm import Router
# task_type drives routing — no LLM classifier needed for deterministic task paths
TIER_MAP = {
"classification": "tier1",
"extraction": "tier1",
"summarization": "tier2",
"rag_answer": "tier2",
"reasoning": "tier3",
"code_generation": "tier3",
}
router = Router(
model_list=[
{"model_name": "tier1", "litellm_params": {"model": "claude-haiku-4-5-20251001"}},
{"model_name": "tier2", "litellm_params": {"model": "claude-sonnet-4-6"}},
{"model_name": "tier3", "litellm_params": {"model": "claude-opus-4-7"}},
],
fallbacks=[{"tier1": ["tier2"]}],
num_retries=1,
)
def route_request(
task_type: str,
messages: list,
context_tokens: int = 0,
) -> str:
model = TIER_MAP.get(task_type, "tier2") # tier2 as safe default
# Context-length guard: bypass Tier 1 if prompt exceeds 80% of 128K window
if model == "tier1" and context_tokens > 102_400:
model = "tier2"
response = router.completion(
model=model,
messages=messages,
metadata={"task_type": task_type},
)
# Log actual model served — required for per-task cost attribution
log_cost_attribution(task_type, response.model, response.usage)
return response.choices[0].message.contentBuild vs. Buy the Routing Layer
The build-from-scratch router is defensible in exactly one case. Every other case has a cheaper answer.
Full control over routing logic and quality evaluation
Custom cost attribution and reporting structure
No per-request proxy overhead added
4–8 weeks to production-quality for a senior engineer [2]
Ongoing maintenance: model API changes, failover logic, provider updates
Justified when: specialized routing logic, regulatory audit trail requirements, or existing internal proxy infrastructure
Routing, fallback, and attribution ready in hours, not weeks
Multi-provider failover included — one provider outage does not take the system down
Cost dashboards and per-endpoint attribution out of the box
Routing logic is the vendor's maintenance problem, not yours
Small per-request overhead (negligible above 10K queries/month)
Justified for: most product teams, most workloads, and all teams without a dedicated platform team
What Good Looks Like Before the First Request Hits Production
Routing configured post-launch is a retrofit. Routing configured pre-launch is infrastructure.
Cost-Aware LLM Stack: Pre-Launch Readiness
Task types enumerated in the profile audit before any routing code is written
Quality oracle defined for each task type — written as executable code, not prose
Synthetic profiling run completed; thresholds set from measured quality, not vendor benchmarks
Cost model built from profiling results and shared with stakeholders before launch
response.model logged on every request — not sampled, not only on errors
Per-task-type cost attribution active from day one, not aggregate billing only
Routing thresholds stored as configuration, not hardcoded in application logic
Context-length guard in place before routing to small models
Quality monitoring active in production — escalation rate tracked per task type
Alert configured when Tier 1 escalation rate exceeds profiling baseline by more than 15%
How do I know which tasks are simple before I have production data?
You do not know with certainty — which is why synthetic profiling exists. Start with the task profile audit: classify by schema structure (well-defined vs. open-ended output), output length, and downstream consequence. For tasks in the gray zone, run synthetic benchmarks against your own prompt samples before launch. Rule of thumb: if you can write an evaluation script that checks output correctness mechanically — schema validation, field-level accuracy, exact-match classification — the task likely belongs in Tier 1.
What is the minimum query volume where routing makes economic sense?
For rule-based routing with zero classifier overhead: any volume. The only cost is engineering time to build the tier map, which is worth it at any scale. For content-based classifier routing: roughly 10,000 queries per month, where the savings from cheaper models exceed the classifier token overhead [4]. Below that threshold, a flat two-model strategy or rule-based routing is cheaper than a sophisticated classifier.
What happens when a Tier 1 model makes a mistake with downstream consequences?
The blast radius of a routing mistake is a function of task design, not just model selection. High-consequence tasks — those triggering financial transactions, customer-facing communications, or irreversible operations — need a quality gate regardless of model tier. Route them to Tier 1 if task complexity permits, but run the output through validation before the downstream action fires. Routing tier reduces inference cost. Quality gate reduces blast radius. These are separate responsibilities and must be designed separately.
Do I need to rebuild routing configuration when models update?
Yes, approximately quarterly. Model pricing shifts, capability improves, and new tiers appear. A routing threshold calibrated when Haiku cost $0.80/1M becomes economically wrong if the same capability tier costs $0.20/1M in the next generation. This is precisely why thresholds belong in configuration files, not hardcoded in application logic — so you can update them without a deployment when pricing or capability changes.
The bill you are looking at is a record of design decisions that had no cost dimension when they were made. The model was picked for capability during development, traffic grew, nobody updated the routing, and the accounting arrived six months late.
Cost-first architecture does not require a sophisticated ML router. It requires a task profile audit, a synthetic profiling run, and a configuration layer that routes by task type. Three weeks of work done before launch compounds into substantial savings as traffic scales. The measurement layer — logging response.model on every request, attributing cost per task type, alerting on escalation rate changes — is what keeps the system honest as models and pricing evolve.
Teams surprised by their quarterly bill skipped the audit. Teams that ran it do not get surprised.
- [1]Zylos Research — AI Agent Model Routing and Dynamic Model Selection Strategies(zylos.ai)↩
- [2]LLM Model Routing: The Complete Guide for Engineering Teams(promptunit.ai)↩
- [3]LLM Routing — Smart Model Selection for Cost and Quality(myengineeringpath.dev)↩
- [4]Kenny Tan — LLM Cost-Per-Query Optimization(uatgpt.com)↩
- [5]How Intelligent Model Routing Cuts LLM Costs by 30-60%(trovald.com)↩
- [6]Zylos Research — AI Agent Cost Optimization: Token Budgets, Model Routing, and Production FinOps(zylos.ai)↩
- [7]Architecting Cost-Aware LLM Workloads with Model Router in Microsoft Foundry(techcommunity.microsoft.com)↩
- [8]LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries(tianpan.co)↩