Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.
Building a smarter LLM router won't fix your inference bill. The problem lives upstream — in the absence of any written policy defining which tasks require which model tier, and who owns that decision.
The pattern is consistent: prototypes ran on the most capable available model because that was the fastest path to a working demo. The prototypes shipped to production. The model defaulted with them. Six months later, the billing breakdown shows intent classification, format conversion, and short-document summarization — all pattern-matching work — running at frontier prices. Nobody drew the line.
A 2026 multicriteria routing study found that a structured per-task routing policy achieved 94.4% response sufficiency — essentially indistinguishable from always using the strongest model — at 37.4% lower total cost. [1] The quality gap was 0.2 percentage points. The cost gap was more than a third of the total bill.
The fix is a per-task quality contract: a decision table that maps each task type to the quality threshold that defines "good enough," the model tier authorized to serve it, and the conditions under which escalation is warranted. Without that contract, model selection is not an engineering decision. It's a prototype artifact.
Pattern-matching, compositional, and reasoning-intensive tasks have different quality ceilings and different cost profiles. Most production systems treat them identically.
Production LLM workloads break into three categories, and the distribution matters for cost architecture.
Pattern-matching tasks — intent classification, entity extraction, format conversion, routing decisions, short summarization, structured output generation from well-formatted inputs — typically account for 40–60% of total request volume. [6] These tasks have near-deterministic expected outputs. Smaller models handle them at equivalent quality to frontier models.
Compositional tasks — document drafting, multi-step code generation, structured explanation, research synthesis, template completion with judgment — make up 25–40% of volume. These require some reasoning but rarely the deepest available. Mid-tier models handle the majority of this work with quality that holds under rubric-based evaluation.
Reasoning-intensive tasks — architectural review, complex multi-step planning, high-stakes decision support, long-context analysis, novel problem framing — are typically 5–20% of volume. These benefit from frontier-tier capability. They also bear the cost of it.
The billing problem is not that reasoning-intensive tasks are expensive. They should be. The billing problem is that pattern-matching traffic is priced at reasoning-intensive rates because nobody specified the task classes and their authorized tiers.
| Task Class | Typical Volume | Examples | Quality Signal | Authorized Tier |
|---|---|---|---|---|
| Pattern-matching | 40–60% | Intent classification, entity extraction, format conversion, routing, short summarization | Accuracy on eval set (threshold ≥ 0.93) | Lite (~$1/M tokens) |
| Compositional | 25–40% | Document drafting, code generation, structured explanation, synthesis | Rubric score via judge model (threshold ≥ 7.0/10) | Mid (~$3–15/M tokens) |
| Reasoning-intensive | 5–20% | Architecture review, complex planning, high-stakes decisions, long-context analysis | Task completion rate + human review sample (5%) | Frontier (~$15–75/M tokens) |
You cannot choose a model tier before answering this question. Without a quality threshold, the tier assignment has no ground to stand on.
Every task class needs a quality SLO before a model tier can be assigned. Without it, "this task needs the frontier model" is not an engineering decision — it's an assumption with a monthly cost consequence.
A quality SLO for a pattern-matching task is typically deterministic: accuracy on a held-out eval set, F1 score on entity extraction, exact match on format conversion. The threshold is explicit — "intent classifier must score ≥ 0.93 accuracy on the production-distribution sample." You run the smaller model against that eval. If it passes, it ships on the smaller model. If it fails, you have evidence for upgrading.
Compositional tasks need rubric-based evaluation: a scoring criteria document and a judge model (or human reviewer) that applies it consistently. The SLO might be "average judge score ≥ 7.0/10 on a 50-sample draw." The eval is not expensive to run. What's expensive is skipping it.
The critical principle is burden-of-proof inversion. The default assumption is the smallest model that exists. Upgrading to a higher tier requires a failing eval, not a gut instinct that the larger model "will do better." This inversion changes the organizational dynamic: teams stop asking "is this safe on the frontier tier?" and start asking "does the evidence show we need it?" The first question has no floor. The second has an answer.
Direct inference is one of four cost terms. Teams that optimize on per-token price alone are missing three others.
Multicriteria routing policy vs. always-frontier, with only 0.2pp quality gap (MDPI, 2026) [1]
SELECT-THEN-ROUTE: accuracy improved 91.7% → 94.3% while cost fell from $16.29 → $5.21 per 1,000 samples (EMNLP, 2025) [3]
Cascaded routing systems achieve 97% of frontier quality at this cost share (RouteLLM, 2024) [2]
Per-token pricing is the input to a cost model, not the cost model. Teams that route based on listed price alone typically underestimate the cost of wrong choices and overestimate the savings from downgrading.
The true cost of a model selection decision has four terms:
Direct inference cost — input tokens × input rate + output tokens × output rate, including thinking tokens for reasoning models. This is the only number most teams track.
Retry cost — probability of quality failure × cost of the retry. A 5% failure rate on a task routed to the cheaper model, where each retry escalates to a frontier call, can wipe out the entire tier savings. The RouteLLM cascading framework shows 97% frontier quality at 24% frontier cost is achievable — but that math depends on reliable quality detection. [2]
Escalation cost — probability of human handoff × loaded cost of human review. This term is invisible in inference billing but highly visible in support queue depth and CS headcount. A workload with a 5% long-tail failure rate generates substantial escalation volume at scale. Those are headcount costs, not infrastructure costs.
Trust-damage cost — a discount factor applied to downstream revenue when the task fails in a user-visible way. Fuzzy but non-zero, and it dominates the math for customer-facing workflows. The inference line item is rarely the expensive part of a trust failure.
The corollary: on reasoning-intensive tasks with high escalation cost and trust-damage exposure, a single frontier call can be cheaper than several mid-tier calls that each generate follow-up review. [10] The procurement question — which model is cheapest per token? — and the architecture question — which model is cheapest per successfully resolved request? — are not the same question.
The organizational failure mode behind bloated inference bills: model choices made during prototyping persist because no process exists to challenge them.
Model assigned once during prototyping, never revisited
No quality SLO per task — 'it works' is the bar
Frontier tier justified by 'safety,' not a failing eval
No ownership — model choices live in scattered env configs
Cost visibility: total invoice only, no per-task attribution
Tier decisions survive indefinitely; models change, configs don't
Per-task tier assignment with defined review cadence (quarterly)
Quality SLO documented and measurable; eval required before assignment
Frontier tier requires documented failing eval from smaller model
Platform team owns tier policy; services consume and document it
Cost attributed per task class; regressions surface before invoice
Tier assignments rechecked against new model releases each quarter
The organizational pattern behind bloated inference bills is not engineering negligence. It's an absent policy. When nobody owns model selection as a standing decision, the prototype choice is the standing decision — by default, indefinitely.
Platform teams are the natural owners of the tier policy. They define the three tiers and their authorized use cases. Individual service teams slot their tasks into the taxonomy, run the required evals before shipping, and document the tier assignment. The platform team runs a quarterly review: new model releases shift the quality ceiling of each tier, and routing policies that were rational six months ago may be suboptimal in either direction.
The quarterly cadence matters because both per-token prices and model capabilities move on roughly that timescale. A pattern-matching task that required mid-tier quality three releases ago may be well within lite-tier capability today. Teams that don't revisit the assignment miss the cost reduction. Teams that do run it as a routine eval cycle — not a special audit triggered by a bad invoice.
Building the decision table means writing three things per task type: the quality SLO (specific threshold, not a qualitative description), the authorized tier (with the passing eval that justified it), and the escalation condition (the specific quality signal that triggers cascade, and the tier ceiling where it stops). A fourth column carries the cost note: what is the all-in cost per call at the current cascade rate, and does cascade still make economic sense compared to routing directly to the higher tier?
How do I know if a task actually needs the frontier tier?
Run it through the mid-tier model against your quality eval. If it passes the SLO, it doesn't need frontier. If it fails, you have the evidence for an upgrade. 'It needs frontier' without a failing eval is an assumption — and it costs you every time that assumption is wrong. The burden of proof runs upward: justify the expensive model, not the cheap one.
What if I can't define a clear quality signal for a task?
That inability is the problem to fix, not a justification for defaulting upward. If quality is unmeasurable, you can't improve it, can't defend the model choice, and can't detect a regression in production. Start with a proxy: user re-triggers, downstream failure rates, reviewer override rates. Rough signal beats no signal, and rough signal is almost always available.
When does cascading make sense versus directly assigning to a higher tier?
Cascade when: the task distribution is bimodal (mostly simple, occasionally hard), the hard cases are detectable at output time, and the quality check is cheap relative to the cost saved. Skip cascade when: detection is unreliable and you'd be escalating 30%+ of calls anyway, or when the cascade adds more latency than the tier upgrade saves. Run the numbers on your actual traffic; the answer varies by workload.
How often should tier assignments be reviewed?
Quarterly. Model releases shift capability ceilings on roughly that timescale. The task that required mid-tier quality two model generations ago may be well within lite-tier capability today — and the reverse: a task that barely passed mid-tier may now have a better-suited model option. Per-token prices also shift. Routing policies built on last year's benchmarks are frequently suboptimal in both directions.
Platform teams that have brought inference costs under control didn't build smarter routers. They wrote per-task quality contracts, inverted the burden of proof, and assigned ownership of the tier policy as a standing function — not a one-time optimization sprint after a bad invoice.
That inversion changes everything downstream. When the default is the smallest model that meets the SLO, every upgrade requires evidence. When ownership is explicit, the quarterly review happens on schedule rather than under budget pressure. When cost attribution runs per task class, regressions surface before they compound across a billing cycle.
Model selection is not a configuration you set once. It's a contract between your task taxonomy and your infrastructure, and it needs maintenance as both sides evolve. The models change every quarter. The product traffic distribution shifts as features ship. The quality bars move as customer expectations increase.
Without that maintenance loop, you're not running a model selection policy. You're inheriting the decisions your engineers made during the prototype phase — and paying frontier prices for every task nobody revisited.
MAST (NeurIPS 2025, UC Berkeley) identifies 14 MAS failure modes across 3 structural categories. This playbook maps them to 3 diagnostic questions — and tells you which layer to fix before touching the model.
89% of teams have observability tooling. 62% can map a trace to a failure cause. Seven failure modes grounded in H1 2026 incident data — each with distinct OTel trace signatures and an LLM classifier that routes the incident before the postmortem.
When your multi-agent pipeline has no cost parentage, every billing incident is a surprise you could have prevented. How to build a real-time agent cost attribution ledger using OTel baggage propagation, gateway enforcement, and rate-of-spend anomaly detection.