Cost-Aware LLM Architecture: Heterogeneous Model Routing from Day One

Your LLM Bill Is a Design Decision You Made Six Months Ago

Most teams architect for capability and optimize for cost after the invoice lands. Here is the playbook for building cost constraints in from day one: task profile audits, three-tier routing, and synthetic benchmarking before your first deploy.

AI Engineering PlatformadvancedMay 12, 20266 min read

By Viktor Bezdek · VP Engineering, Groupon

The bill arrives from Anthropic or OpenAI, and the number is wrong. Not wrong as in the math does not add up — the math is perfectly accurate. Wrong as in nobody budgeted for this during architecture review. A team running 50,000 requests per day on Claude Sonnet is spending roughly $22,500 per month on input tokens alone, assuming an average 300 tokens per request at $3/1M. Route 65% of that traffic to Claude Haiku at $0.80/1M and the same volume costs closer to $6,500/month. Nobody designed that routing. Nobody decided to pay the premium. It just happened, because the default is always the capable model.

The pattern is structural, not accidental. Teams pick a frontier model for development — sensibly, because frontier models surface edge cases faster and fail more expressively. Then the feature ships. Traffic grows. Six months later, the model choice is load-bearing infrastructure, the billing alarm fires, and the engineers who built it are in a different org. Retrofitting routing now means a migration that often costs more in engineering time than the accumulated overspend.

A heterogeneous LLM stack routes each request to the cheapest model that meets its quality requirement, rather than sending all traffic to a single default model. The mechanism is a task profile audit: an explicit map of which task types live in the system, what quality bar each requires, and which model tier meets that bar at minimum cost. Teams that build this before launch consistently report 40–85% lower inference spend from day one ^[1]^[2]. That is not a post-launch optimization — it is a design decision made before writing the first prompt.

100x

Price spread between frontier and small models

Claude Opus 4 at $15/1M input vs. GPT-4o-mini at $0.15/1M — the capability gap on most real tasks is far smaller than this price gap ^[5]

60–70%

Share of production requests that do not require a frontier model

Classification, extraction, formatting, and consistent-domain Q&A make up the majority of real workloads — all small-model territory ^[3]

40–85%

Cost reduction teams report after routing implementation

Teams using systematic three-tier routing report this range without measurable quality degradation ^[1]^[6]

~10K/mo

Query volume where content-based routing pays for itself

Below this threshold, a classifier router's token overhead erases the savings — rule-based routing is cheaper ^[4]

The Billing Surprise Is a Design Failure

Three structural reasons teams arrive at budget overruns — none of them fixable after launch without migration cost.

Development economics favor capability over cost. During prototyping, frontier models produce better outputs faster, surface problems earlier, and fail expressively — the error is legible, not silent. Those properties have high value when you are figuring out the right behavior. They have zero value in production when the task is a JSON extraction you have run ten thousand times.

The cost signal arrives too late. API costs aggregate to monthly bills. A model choice made on day one generates no visible feedback until month two or three, by which point the architecture has calcified. Every new feature built on top of the expensive default extends the blast radius of the original decision. The compounding is slow enough to be invisible until it is not.

Nobody owns the routing decision. Whoever writes the first prompt picks the model. That choice does not go through architecture review. It has no owner. It defaults to whatever the documentation examples show — typically the highest-capability tier — and becomes permanent by inertia. Cost-first architecture requires treating routing ownership the way you treat database schema ownership: explicitly assigned, reviewed before merge, not implicit.

Default Architecture

Pick the model that works during development
Ship without a cost model
Wait for the billing alarm six months later
Retrofit routing into working production code
Migration cost: 4–8 weeks of refactoring ^[2]

Cost-First Design

Audit task types and quality requirements before coding
Run synthetic profiling to validate model tier thresholds
Ship with routing configured and cost model validated
Per-task cost attribution from the first deploy
Routing updates: configuration change, no code deploy needed

Task Profile Audit: What You Build Before Writing Prompts

The routing decision lives in a document before it lives in code. Three questions per task determine which model tier handles it.

The task profile audit is a structured list of every discrete operation your LLM system performs. For each task, you answer three questions: What is the input distribution? What constitutes an acceptable output? What is the downstream consequence of a wrong answer?

Those three answers determine the model tier — and the third question is the one teams consistently skip. Task complexity and downstream consequence are not the same variable. A simple classification that feeds an automated refund decision needs more careful model selection than a complex synthesis that a human reviews before acting. Model tier selection is not only a cost decision — it is a risk calibration. The blast radius column in your audit is as important as the complexity column.

A short prompt does not equal a low-stakes task. A long, nuanced prompt does not always require a frontier model. Map them separately. Teams that classify only by input complexity and skip the consequence column build routing configs that look cheap on paper and expensive in production incident reports.

Task Type	Default Tier	Routing Signal	Escalation Trigger
Classification / intent detection	1 — Small ($0.15–$0.80/1M)	Closed label set, short input	Label outside known set
Entity extraction / structured output	1 — Small ($0.15–$0.80/1M)	Fixed schema, schema-validatable	Schema validation failure rate > 5%
Summarization (consistent domain)	2 — Mid ($3–$10/1M)	Variable length, quality check needed	Quality score below threshold
RAG answer generation	2 — Mid ($3–$10/1M)	Context-dependent, domain-specific	Multi-hop or open-domain queries
Multi-step reasoning	3 — Frontier ($15–$75/1M)	Ambiguous inputs, multi-hop logic	—
Novel code generation	3 — Frontier ($15–$75/1M)	High blast radius on errors	—

Routing Architecture: How the Three Tiers Work and What Breaks Between Them

Three routing strategies with different tradeoffs. The right one depends on how deterministic your task types are.

The three-tier model is a practical approximation covering most production workloads. Small models (GPT-4o-mini, Claude Haiku, Gemini Flash) run at $0.15–$0.80 per million input tokens. Mid-tier (Claude Sonnet, GPT-4o) at $3–$10. Frontier (Claude Opus, o-series reasoning models) at $15–$75 ^[5]. The price gap between frontier and small is roughly 100x. The capability gap on most real-world tasks is not 100x — it is closer to 5–15%, and often zero.

Rule-based routing uses application-layer metadata — task type tag, input length, structured-vs-open flag — to assign tiers without an additional LLM call. Zero added latency, zero added cost, roughly 60–75% classification accuracy ^[4]. The right default for systems where task type is deterministic (the same application path always produces the same task type). Most systems have more deterministic task distribution than engineers assume — the classification already lives in your route handlers, it just has not been wired to the model selector.

Content-based classification analyzes the actual prompt text using a small classifier — an embedding model, a BERT-scale model, or a cheap LLM call — to assign complexity. 75–93% accuracy, $0.001–$0.003 per query overhead ^[4]. Use this when your application layer cannot reliably tag task type, or when a single endpoint receives genuinely mixed complexity.

Cascade routing sends requests to the cheapest tier first, then escalates on quality failure. Highest accuracy (because failure is empirically measured, not predicted), highest latency, most complex failure modes. Reserve it for workloads with reliable programmatic quality signals and users who can tolerate 2–3 second escalation delays. Customer-facing interactive features are usually the wrong fit for cascade routing. Background processing pipelines are the right one.

Three-Tier LLM Routing Flow

Request classification routes to the cheapest tier that meets quality requirements. Every response logs model and token counts for cost attribution.

Profile Without Production Data: Running Synthetic Benchmarks Before Launch

You need traffic data to configure routing, but you need routing configured before traffic arrives. Synthetic profiling resolves this.

The standard routing advice assumes you have production traffic logs to analyze. You do not, on day one. Synthetic profiling fills that gap — and most competing articles on LLM routing skip it entirely.

The method: collect 200–500 representative prompts from analogous systems, internal dogfooding, or hand-constructed edge-case scenarios. Run them through each model tier in a staging environment. Apply your quality criteria to each output. The result is an empirical routing threshold — what percentage of your task corpus does Tier 1 handle to acceptable quality? — measured on your actual prompt distribution before a single user touches the system.

This is harder than it sounds for one reason: you must define "acceptable" before you start. Teams that skip this definition end up calibrating routing thresholds to a gestalt of "looks good," which does not survive engineer turnover or model upgrades. The quality oracle — the acceptance criteria applied to each output in the profiling run — becomes the canonical definition of correctness for the task. Write it down. Version it. Treat it like a spec.

1
Extract a representative task sample
Pull 200–500 prompts from analogous systems or synthesize inputs covering your edge cases. Weight the sample to reflect expected production distribution — if 70% of real traffic will be short classification requests, 70% of your sample should be too. If you have no prior data, create 50–100 prompts per task category and include adversarial inputs.
2
Define the quality oracle before running anything
For structured outputs: schema validity plus field accuracy against labeled answers. For open-ended outputs: a rubric with a grading model (a cheap LLM call) or human review of a stratified sample. The oracle becomes the production quality gate. Write it as code, not prose.
3
Run the profiling battery across all tiers
Submit each sample through Tier 1, Tier 2, and Tier 3 in isolation. Log output, latency, and token counts per tier per prompt. Run tiers in parallel to minimize wall-clock time. Expect 30–90 minutes for a 500-prompt sample.
4
Measure quality per tier and set routing thresholds
Apply the oracle to each output. Record pass rate by tier and task type. Where Tier 1 achieves 95%+ pass rate: route there by default. 85–95%: consider Tier 1 with quality monitoring. Below 85%: route to Tier 2. These thresholds are starting points — calibrate them against your acceptable quality floor, not industry benchmarks.
5
Build the pre-launch cost projection
Apply your expected production traffic distribution to the measured pass rates and token counts. This is your cost model: grounded in real quality measurements on your actual prompts. Share it with stakeholders before the first deploy. If the number is wrong, better to know before the system is live.

The Hidden Cost Multipliers That Break Routing Models

Two variables that make routing harder than the benchmarks suggest — and that compound in production when unaccounted for.

Context window as a feasibility ceiling: Small models have smaller effective context windows. GPT-4o-mini supports 128K tokens; Claude Haiku supports 200K. For most tasks this does not matter. For tasks involving long documents, multi-turn conversation history, or large retrieval contexts, routing to a small model is not just a quality question — it is a feasibility question. The request that works on Sonnet fails on Haiku because the context does not fit the configuration.

Teams that discover this mid-production end up with two bad options: limit context to the smallest model in the fleet (wasting capability on every high-tier task) or add a second routing dimension — complexity plus context length — that was not in the original design. Build a context-length guard into the classifier from the start. Any request exceeding 80% of the small model's context window should bypass Tier 1 automatically.

Output tokens compound faster than input tokens at scale: Development prompts tend to produce short outputs. Production traffic does not. A frontier model generating 2,000 output tokens per request costs roughly $0.15 per call at Opus pricing ($75/1M output). The same request on GPT-4o-mini costs $0.0012. The task profile audit should capture expected output length explicitly — not just input complexity. Generative tasks with open-ended outputs need output-length-aware routing, not just input-length-aware routing.

One more failure mode: model upgrade cycles invalidate thresholds. Routing configurations calibrated when Claude Haiku cost $0.80/1M become economically wrong if the same capability tier reaches $0.20/1M in the next model generation. Build thresholds as configuration values, not hardcoded constants. You will update them quarterly.

router_config.py

from litellm import Router

# task_type drives routing — no LLM classifier needed for deterministic task paths
TIER_MAP = {
    "classification": "tier1",
    "extraction": "tier1",
    "summarization": "tier2",
    "rag_answer": "tier2",
    "reasoning": "tier3",
    "code_generation": "tier3",
}

router = Router(
    model_list=[
        {"model_name": "tier1", "litellm_params": {"model": "claude-haiku-4-5-20251001"}},
        {"model_name": "tier2", "litellm_params": {"model": "claude-sonnet-4-6"}},
        {"model_name": "tier3", "litellm_params": {"model": "claude-opus-4-7"}},
    ],
    fallbacks=[{"tier1": ["tier2"]}],
    num_retries=1,
)


def route_request(
    task_type: str,
    messages: list,
    context_tokens: int = 0,
) -> str:
    model = TIER_MAP.get(task_type, "tier2")  # tier2 as safe default

    # Context-length guard: bypass Tier 1 if prompt exceeds 80% of 128K window
    if model == "tier1" and context_tokens > 102_400:
        model = "tier2"

    response = router.completion(
        model=model,
        messages=messages,
        metadata={"task_type": task_type},
    )

    # Log actual model served — required for per-task cost attribution
    log_cost_attribution(task_type, response.model, response.usage)
    return response.choices[0].message.content

Build vs. Buy the Routing Layer

The build-from-scratch router is defensible in exactly one case. Every other case has a cheaper answer.

Build from Scratch

Full control over routing logic and quality evaluation
Custom cost attribution and reporting structure
No per-request proxy overhead added
4–8 weeks to production-quality for a senior engineer ^[2]
Ongoing maintenance: model API changes, failover logic, provider updates
Justified when: specialized routing logic, regulatory audit trail requirements, or existing internal proxy infrastructure

Routing Proxy (LiteLLM / Portkey / OpenRouter)

Routing, fallback, and attribution ready in hours, not weeks
Multi-provider failover included — one provider outage does not take the system down
Cost dashboards and per-endpoint attribution out of the box
Routing logic is the vendor's maintenance problem, not yours
Small per-request overhead (negligible above 10K queries/month)
Justified for: most product teams, most workloads, and all teams without a dedicated platform team

What Good Looks Like Before the First Request Hits Production

Routing configured post-launch is a retrofit. Routing configured pre-launch is infrastructure.

Cost-Aware LLM Stack: Pre-Launch Readiness

Task types enumerated in the profile audit before any routing code is written
Quality oracle defined for each task type — written as executable code, not prose
Synthetic profiling run completed; thresholds set from measured quality, not vendor benchmarks
Cost model built from profiling results and shared with stakeholders before launch
response.model logged on every request — not sampled, not only on errors
Per-task-type cost attribution active from day one, not aggregate billing only
Routing thresholds stored as configuration, not hardcoded in application logic
Context-length guard in place before routing to small models
Quality monitoring active in production — escalation rate tracked per task type
Alert configured when Tier 1 escalation rate exceeds profiling baseline by more than 15%

How do I know which tasks are simple before I have production data?

You do not know with certainty — which is why synthetic profiling exists. Start with the task profile audit: classify by schema structure (well-defined vs. open-ended output), output length, and downstream consequence. For tasks in the gray zone, run synthetic benchmarks against your own prompt samples before launch. Rule of thumb: if you can write an evaluation script that checks output correctness mechanically — schema validation, field-level accuracy, exact-match classification — the task likely belongs in Tier 1.

What is the minimum query volume where routing makes economic sense?

For rule-based routing with zero classifier overhead: any volume. The only cost is engineering time to build the tier map, which is worth it at any scale. For content-based classifier routing: roughly 10,000 queries per month, where the savings from cheaper models exceed the classifier token overhead ^[4]. Below that threshold, a flat two-model strategy or rule-based routing is cheaper than a sophisticated classifier.

What happens when a Tier 1 model makes a mistake with downstream consequences?

The blast radius of a routing mistake is a function of task design, not just model selection. High-consequence tasks — those triggering financial transactions, customer-facing communications, or irreversible operations — need a quality gate regardless of model tier. Route them to Tier 1 if task complexity permits, but run the output through validation before the downstream action fires. Routing tier reduces inference cost. Quality gate reduces blast radius. These are separate responsibilities and must be designed separately.

Do I need to rebuild routing configuration when models update?

Yes, approximately quarterly. Model pricing shifts, capability improves, and new tiers appear. A routing threshold calibrated when Haiku cost $0.80/1M becomes economically wrong if the same capability tier costs $0.20/1M in the next generation. This is precisely why thresholds belong in configuration files, not hardcoded in application logic — so you can update them without a deployment when pricing or capability changes.

The bill you are looking at is a record of design decisions that had no cost dimension when they were made. The model was picked for capability during development, traffic grew, nobody updated the routing, and the accounting arrived six months late.

Cost-first architecture does not require a sophisticated ML router. It requires a task profile audit, a synthetic profiling run, and a configuration layer that routes by task type. Three weeks of work done before launch compounds into substantial savings as traffic scales. The measurement layer — logging response.model on every request, attributing cost per task type, alerting on escalation rate changes — is what keeps the system honest as models and pricing evolve.

Teams surprised by their quarterly bill skipped the audit. Teams that ran it do not get surprised.

Key terms in this piece

LLM cost optimizationheterogeneous LLM stackmodel routing architecturemulti-tier LLM routingAI inference costcost-aware AI architecture

Sources

[1]Zylos Research — AI Agent Model Routing and Dynamic Model Selection Strategies(zylos.ai)↩
[2]LLM Model Routing: The Complete Guide for Engineering Teams(promptunit.ai)↩
[3]LLM Routing — Smart Model Selection for Cost and Quality(myengineeringpath.dev)↩
[4]Kenny Tan — LLM Cost-Per-Query Optimization(uatgpt.com)↩
[5]How Intelligent Model Routing Cuts LLM Costs by 30-60%(trovald.com)↩
[6]Zylos Research — AI Agent Cost Optimization: Token Budgets, Model Routing, and Production FinOps(zylos.ai)↩
[7]Architecting Cost-Aware LLM Workloads with Model Router in Microsoft Foundry(techcommunity.microsoft.com)↩
[8]LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries(tianpan.co)↩

Share this article

X LinkedIn Hacker News

Your LLM Bill Is a Design Decision You Made Six Months Ago

AI Engineering PlatformadvancedMay 12, 20266 min read

By Viktor Bezdek · VP Engineering, Groupon

Task Type

Default Tier

Routing Signal

Escalation Trigger

Classification / intent detection

1 — Small ($0.15–$0.80/1M)

Closed label set, short input

Label outside known set

Entity extraction / structured output

1 — Small ($0.15–$0.80/1M)

Fixed schema, schema-validatable

Schema validation failure rate > 5%

Summarization (consistent domain)

2 — Mid ($3–$10/1M)

Variable length, quality check needed

Quality score below threshold

RAG answer generation

2 — Mid ($3–$10/1M)

Context-dependent, domain-specific

Multi-hop or open-domain queries

Multi-step reasoning

3 — Frontier ($15–$75/1M)

Ambiguous inputs, multi-hop logic

—

Novel code generation

3 — Frontier ($15–$75/1M)

High blast radius on errors

—

from litellm import Router # task_type drives routing — no LLM classifier needed for deterministic task paths TIER_MAP = { "classification": "tier1", "extraction": "tier1", "summarization": "tier2", "rag_answer": "tier2", "reasoning": "tier3", "code_generation": "tier3", } router = Router( model_list=[ {"model_name": "tier1", "litellm_params": {"model": "claude-haiku-4-5-20251001"}}, {"model_name": "tier2", "litellm_params": {"model": "claude-sonnet-4-6"}}, {"model_name": "tier3", "litellm_params": {"model": "claude-opus-4-7"}}, ], fallbacks=[{"tier1": ["tier2"]}], num_retries=1, ) def route_request( task_type: str, messages: list, context_tokens: int = 0, ) -> str: model = TIER_MAP.get(task_type, "tier2") # tier2 as safe default # Context-length guard: bypass Tier 1 if prompt exceeds 80% of 128K window if model == "tier1" and context_tokens > 102_400: model = "tier2" response = router.completion( model=model, messages=messages, metadata={"task_type": task_type}, ) # Log actual model served — required for per-task cost attribution log_cost_attribution(task_type, response.model, response.usage) return response.choices[0].message.content

Your LLM Bill Is a Design Decision You Made Six Months Ago

The Billing Surprise Is a Design Failure

Task Profile Audit: What You Build Before Writing Prompts

Routing Architecture: How the Three Tiers Work and What Breaks Between Them

Profile Without Production Data: Running Synthetic Benchmarks Before Launch

Extract a representative task sample

Define the quality oracle before running anything

Run the profiling battery across all tiers

Measure quality per tier and set routing thresholds

Build the pre-launch cost projection

The Hidden Cost Multipliers That Break Routing Models

Build vs. Buy the Routing Layer

What Good Looks Like Before the First Request Hits Production

Cost-Aware LLM Stack: Pre-Launch Readiness

Related

The Agent Observability Framework Nobody Ships

You Wrote the Prompt. Nobody Wrote the Spec.

The Model Isn't What Fails in Production. The Permissions Are.

Your LLM Bill Is a Design Decision You Made Six Months Ago

The Billing Surprise Is a Design Failure

Task Profile Audit: What You Build Before Writing Prompts

Routing Architecture: How the Three Tiers Work and What Breaks Between Them

Profile Without Production Data: Running Synthetic Benchmarks Before Launch

Extract a representative task sample

Define the quality oracle before running anything

Run the profiling battery across all tiers

Measure quality per tier and set routing thresholds

Build the pre-launch cost projection

The Hidden Cost Multipliers That Break Routing Models

Build vs. Buy the Routing Layer

What Good Looks Like Before the First Request Hits Production

Cost-Aware LLM Stack: Pre-Launch Readiness

Related

The Agent Observability Framework Nobody Ships

You Wrote the Prompt. Nobody Wrote the Spec.

The Model Isn't What Fails in Production. The Permissions Are.