Model Selection as an Architectural Decision: Per-Task LLM Framework

Building a smarter LLM router won't fix your inference bill. The problem lives upstream — in the absence of any written policy defining which tasks require which model tier, and who owns that decision.

The pattern is consistent: prototypes ran on the most capable available model because that was the fastest path to a working demo. The prototypes shipped to production. The model defaulted with them. Six months later, the billing breakdown shows intent classification, format conversion, and short-document summarization — all pattern-matching work — running at frontier prices. Nobody drew the line.

A 2026 multicriteria routing study found that a structured per-task routing policy achieved 94.4% response sufficiency — essentially indistinguishable from always using the strongest model — at 37.4% lower total cost. ^[1] The quality gap was 0.2 percentage points. The cost gap was more than a third of the total bill.

The fix is a per-task quality contract: a decision table that maps each task type to the quality threshold that defines "good enough," the model tier authorized to serve it, and the conditions under which escalation is warranted. Without that contract, model selection is not an engineering decision. It's a prototype artifact.

Three Task Classes, One Billing Regime

Pattern-matching, compositional, and reasoning-intensive tasks have different quality ceilings and different cost profiles. Most production systems treat them identically.

Production LLM workloads break into three categories, and the distribution matters for cost architecture.

Pattern-matching tasks — intent classification, entity extraction, format conversion, routing decisions, short summarization, structured output generation from well-formatted inputs — typically account for 40–60% of total request volume. ^[6] These tasks have near-deterministic expected outputs. Smaller models handle them at equivalent quality to frontier models.

Compositional tasks — document drafting, multi-step code generation, structured explanation, research synthesis, template completion with judgment — make up 25–40% of volume. These require some reasoning but rarely the deepest available. Mid-tier models handle the majority of this work with quality that holds under rubric-based evaluation.

Reasoning-intensive tasks — architectural review, complex multi-step planning, high-stakes decision support, long-context analysis, novel problem framing — are typically 5–20% of volume. These benefit from frontier-tier capability. They also bear the cost of it.

The billing problem is not that reasoning-intensive tasks are expensive. They should be. The billing problem is that pattern-matching traffic is priced at reasoning-intensive rates because nobody specified the task classes and their authorized tiers.

Task Class	Typical Volume	Examples	Quality Signal	Authorized Tier
Pattern-matching	40–60%	Intent classification, entity extraction, format conversion, routing, short summarization	Accuracy on eval set (threshold ≥ 0.93)	Lite (~$1/M tokens)
Compositional	25–40%	Document drafting, code generation, structured explanation, synthesis	Rubric score via judge model (threshold ≥ 7.0/10)	Mid (~$3–15/M tokens)
Reasoning-intensive	5–20%	Architecture review, complex planning, high-stakes decisions, long-context analysis	Task completion rate + human review sample (5%)	Frontier (~$15–75/M tokens)

What Does 'Good Enough' Mean for This Task?

You cannot choose a model tier before answering this question. Without a quality threshold, the tier assignment has no ground to stand on.

Every task class needs a quality SLO before a model tier can be assigned. Without it, "this task needs the frontier model" is not an engineering decision — it's an assumption with a monthly cost consequence.

A quality SLO for a pattern-matching task is typically deterministic: accuracy on a held-out eval set, F1 score on entity extraction, exact match on format conversion. The threshold is explicit — "intent classifier must score ≥ 0.93 accuracy on the production-distribution sample." You run the smaller model against that eval. If it passes, it ships on the smaller model. If it fails, you have evidence for upgrading.

Compositional tasks need rubric-based evaluation: a scoring criteria document and a judge model (or human reviewer) that applies it consistently. The SLO might be "average judge score ≥ 7.0/10 on a 50-sample draw." The eval is not expensive to run. What's expensive is skipping it.

The critical principle is burden-of-proof inversion. The default assumption is the smallest model that exists. Upgrading to a higher tier requires a failing eval, not a gut instinct that the larger model "will do better." This inversion changes the organizational dynamic: teams stop asking "is this safe on the frontier tier?" and start asking "does the evidence show we need it?" The first question has no floor. The second has an answer.

The True Cost of a Model Selection Decision Has Four Terms

Direct inference is one of four cost terms. Teams that optimize on per-token price alone are missing three others.

37.4%

Cost reduction

Multicriteria routing policy vs. always-frontier, with only 0.2pp quality gap (MDPI, 2026) ^[1]

4×

Cost reduction

SELECT-THEN-ROUTE: accuracy improved 91.7% → 94.3% while cost fell from $16.29 → $5.21 per 1,000 samples (EMNLP, 2025) ^[3]

24%

Of frontier cost

Cascaded routing systems achieve 97% of frontier quality at this cost share (RouteLLM, 2024) ^[2]

Per-token pricing is the input to a cost model, not the cost model. Teams that route based on listed price alone typically underestimate the cost of wrong choices and overestimate the savings from downgrading.

The true cost of a model selection decision has four terms:

Direct inference cost — input tokens × input rate + output tokens × output rate, including thinking tokens for reasoning models. This is the only number most teams track.

Retry cost — probability of quality failure × cost of the retry. A 5% failure rate on a task routed to the cheaper model, where each retry escalates to a frontier call, can wipe out the entire tier savings. The RouteLLM cascading framework shows 97% frontier quality at 24% frontier cost is achievable — but that math depends on reliable quality detection. ^[2]

Escalation cost — probability of human handoff × loaded cost of human review. This term is invisible in inference billing but highly visible in support queue depth and CS headcount. A workload with a 5% long-tail failure rate generates substantial escalation volume at scale. Those are headcount costs, not infrastructure costs.

Trust-damage cost — a discount factor applied to downstream revenue when the task fails in a user-visible way. Fuzzy but non-zero, and it dominates the math for customer-facing workflows. The inference line item is rarely the expensive part of a trust failure.

The corollary: on reasoning-intensive tasks with high escalation cost and trust-damage exposure, a single frontier call can be cheaper than several mid-tier calls that each generate follow-up review. ^[10] The procurement question — which model is cheapest per token? — and the architecture question — which model is cheapest per successfully resolved request? — are not the same question.

Per-Task Model Routing Decision Flow

Every request passes through task classification and a quality gate before output is accepted. Cascade paths activate only when output fails the SLO check.

Nobody Owns Model Selection. That's Why the Default Is Always Expensive.

The organizational failure mode behind bloated inference bills: model choices made during prototyping persist because no process exists to challenge them.

Default Drift

Model assigned once during prototyping, never revisited
No quality SLO per task — 'it works' is the bar
Frontier tier justified by 'safety,' not a failing eval
No ownership — model choices live in scattered env configs
Cost visibility: total invoice only, no per-task attribution
Tier decisions survive indefinitely; models change, configs don't

Decision Governance

Per-task tier assignment with defined review cadence (quarterly)
Quality SLO documented and measurable; eval required before assignment
Frontier tier requires documented failing eval from smaller model
Platform team owns tier policy; services consume and document it
Cost attributed per task class; regressions surface before invoice
Tier assignments rechecked against new model releases each quarter

The organizational pattern behind bloated inference bills is not engineering negligence. It's an absent policy. When nobody owns model selection as a standing decision, the prototype choice is the standing decision — by default, indefinitely.

Platform teams are the natural owners of the tier policy. They define the three tiers and their authorized use cases. Individual service teams slot their tasks into the taxonomy, run the required evals before shipping, and document the tier assignment. The platform team runs a quarterly review: new model releases shift the quality ceiling of each tier, and routing policies that were rational six months ago may be suboptimal in either direction.

The quarterly cadence matters because both per-token prices and model capabilities move on roughly that timescale. A pattern-matching task that required mid-tier quality three releases ago may be well within lite-tier capability today. Teams that don't revisit the assignment miss the cost reduction. Teams that do run it as a routine eval cycle — not a special audit triggered by a bad invoice.

Building the decision table means writing three things per task type: the quality SLO (specific threshold, not a qualitative description), the authorized tier (with the passing eval that justified it), and the escalation condition (the specific quality signal that triggers cascade, and the tier ceiling where it stops). A fourth column carries the cost note: what is the all-in cost per call at the current cascade rate, and does cascade still make economic sense compared to routing directly to the higher tier?

Leadership takeaways

Pre-Production Model Selection Checklist

01
Quality SLO defined for this task class — specific measurable threshold
not a qualitative description
02
Smaller model eval run and documented — failure mode and rate recorded, not assumed
Treat this as an ownership or evidence requirement before scaling the work.
03
Tier assignment grounded in a passing eval, not default or 'safety' instinct
Treat this as an ownership or evidence requirement before scaling the work.
04
Ownership assigned — team or rotation responsible for quarterly tier review
Treat this as an ownership or evidence requirement before scaling the work.
05
Cost model built
task volume × token estimate × tier price × expected retry rate
06
Escalation condition defined with a specific quality signal — not a vague fallback
Treat this as an ownership or evidence requirement before scaling the work.
07
Production monitoring in place
quality regression alert per task class, not just total error rate

How do I know if a task actually needs the frontier tier?

Run it through the mid-tier model against your quality eval. If it passes the SLO, it doesn't need frontier. If it fails, you have the evidence for an upgrade. 'It needs frontier' without a failing eval is an assumption — and it costs you every time that assumption is wrong. The burden of proof runs upward: justify the expensive model, not the cheap one.

What if I can't define a clear quality signal for a task?

That inability is the problem to fix, not a justification for defaulting upward. If quality is unmeasurable, you can't improve it, can't defend the model choice, and can't detect a regression in production. Start with a proxy: user re-triggers, downstream failure rates, reviewer override rates. Rough signal beats no signal, and rough signal is almost always available.

When does cascading make sense versus directly assigning to a higher tier?

Cascade when: the task distribution is bimodal (mostly simple, occasionally hard), the hard cases are detectable at output time, and the quality check is cheap relative to the cost saved. Skip cascade when: detection is unreliable and you'd be escalating 30%+ of calls anyway, or when the cascade adds more latency than the tier upgrade saves. Run the numbers on your actual traffic; the answer varies by workload.

How often should tier assignments be reviewed?

Quarterly. Model releases shift capability ceilings on roughly that timescale. The task that required mid-tier quality two model generations ago may be well within lite-tier capability today — and the reverse: a task that barely passed mid-tier may now have a better-suited model option. Per-token prices also shift. Routing policies built on last year's benchmarks are frequently suboptimal in both directions.

Platform teams that have brought inference costs under control didn't build smarter routers. They wrote per-task quality contracts, inverted the burden of proof, and assigned ownership of the tier policy as a standing function — not a one-time optimization sprint after a bad invoice.

That inversion changes everything downstream. When the default is the smallest model that meets the SLO, every upgrade requires evidence. When ownership is explicit, the quarterly review happens on schedule rather than under budget pressure. When cost attribution runs per task class, regressions surface before they compound across a billing cycle.

Model selection is not a configuration you set once. It's a contract between your task taxonomy and your infrastructure, and it needs maintenance as both sides evolve. The models change every quarter. The product traffic distribution shifts as features ship. The quality bars move as customer expectations increase.

Without that maintenance loop, you're not running a model selection policy. You're inheriting the decisions your engineers made during the prototype phase — and paying frontier prices for every task nobody revisited.

Key terms in this piece

model selection architectural decisionper-task LLM routingquality SLO model selectioninference cost optimizationLLM tier decision frameworkmodel selection policy

Sources

[1]MDPI Information Journal — A Multi-Criteria Decision Framework for Enterprise LLM Routing(mdpi.com)↩
[2]Ong et al. — RouteLLM: Learning to Route LLMs with Preference Data(arxiv.org)↩
[3]EMNLP 2025 Industry Track — SELECT-THEN-ROUTE: Taxonomy guided Routing for LLMs(aclanthology.org)↩
[4]Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey(arxiv.org)↩
[5]OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline–Online Learning(arxiv.org)↩
[6]Tian Pan — The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI(tianpan.co)↩
[7]Prompt20 Editorial — AI Inference Cost Economics: The Complete Guide(blog.prompt20.com)↩
[8]DeepInfra — Inference Economics: True AI Costs at Scale(deepinfra.com)↩
[9]Microsoft — How model router works in Microsoft Foundry(learn.microsoft.com)↩
[10]Tian Pan — Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts(tianpan.co)↩

Task Class

Typical Volume

Examples

Quality Signal

Authorized Tier

Pattern-matching

40–60%

Intent classification, entity extraction, format conversion, routing, short summarization

Accuracy on eval set (threshold ≥ 0.93)

Lite (~$1/M tokens)

Compositional

25–40%

Document drafting, code generation, structured explanation, synthesis

Rubric score via judge model (threshold ≥ 7.0/10)

Mid (~$3–15/M tokens)

Reasoning-intensive

5–20%

Architecture review, complex planning, high-stakes decisions, long-context analysis

Task completion rate + human review sample (5%)

Frontier (~$15–75/M tokens)

The true cost of a model selection decision has four terms:

Direct inference cost — input tokens × input rate + output tokens × output rate, including thinking tokens for reasoning models. This is the only number most teams track.

Model Selection Isn't a Configuration Choice. It's Architecture.

Three Task Classes, One Billing Regime

What Does 'Good Enough' Mean for This Task?

The True Cost of a Model Selection Decision Has Four Terms

Nobody Owns Model Selection. That's Why the Default Is Always Expensive.

Related

MAST Agent Failure Triage: 14 Failure Modes, 3 Root Causes, 1 Question Each

Agentic System Failure Modes: 7 Trace Signatures On-Call Teams Miss

Agent Cost Attribution Ledger: Real-Time Control for Multi-Agent Spend

Model Selection Isn't a Configuration Choice. It's Architecture.

Three Task Classes, One Billing Regime

What Does 'Good Enough' Mean for This Task?

The True Cost of a Model Selection Decision Has Four Terms

Nobody Owns Model Selection. That's Why the Default Is Always Expensive.

Related

MAST Agent Failure Triage: 14 Failure Modes, 3 Root Causes, 1 Question Each

Agentic System Failure Modes: 7 Trace Signatures On-Call Teams Miss

Agent Cost Attribution Ledger: Real-Time Control for Multi-Agent Spend