Routing thresholds require historical traffic data. You haven't launched yet. This is the catch-22 that turns model selection into guesswork: pick a model during development, ship it, and let the billing alarm tell you six months later whether you chose wrong.
The problem isn't that teams skip routing. They skip it because they have nothing to calibrate it against before production. Standard advice — "run your task types through each tier and measure quality" — assumes a representative traffic corpus exists. Greenfield systems don't have one. So routing gets deferred to a post-launch optimization, the architecture review gets a number pulled from the model's pricing page, and the cost model describes a system that will be retrofitted under deadline pressure.
The mechanism that resolves the catch-22: a profiling harness built before the first production prompt. It generates a synthetic task corpus, defines acceptance criteria as executable code, runs the corpus through each model tier in parallel, and computes cost_per_success — the actual cost of getting a correct answer, not just the cost of generating tokens. The routing table emerges from measured data. It exists before traffic arrives, or you've made your model selection by architectural default.
Building a profiling harness from scratch takes a senior engineer two to three days [1]. Running it on a 200-sample corpus costs under $1 in API calls. Those numbers compare favorably to the post-launch retrofit, which typically runs four to eight weeks of engineering time.
TierBench's full T0–T3 benchmark suite costs roughly $0.15. A 500-sample custom corpus runs $0.50–$2 depending on tier mix and output length [1]
HyDRA matches Sonnet quality at 54% lower cost vs. always-Sonnet routing — a 6× improvement over the binary predecessor at the same quality floor [2]
UCCI achieved 31% cost reduction at micro-F1 = 0.91 by calibrating routing thresholds against measured error probability, not raw confidence scores [3]
One senior engineer. Cheaper than the 4–8 week retrofit that follows vibe-based routing at production scale [1]
Token Price Is Not the Unit You're Routing On
The architecture review wants cost per request. The right number is cost per correct answer — and those two figures diverge significantly across task types.
Token price tells you the cost of output. It doesn't tell you the cost of a correct answer.
A model that fails 40% of the time on your specific task is not 40% cheaper than a model that fails 5% of the time — it costs more per successful outcome, and it degrades every downstream system that depends on those answers.
The formula: cost_per_success = total_cost / count_passed. You need both terms. Most pre-launch cost models have only the first. And this isn't an academic concern — TierBench, an open benchmarking harness built precisely for this problem, makes cost_per_success the primary routing metric rather than token cost alone [1]. The reason: cheap models are good at many tasks, expensive models are only necessary for a few, but nobody measures success rate per task type, so token cost alone is meaningless without success probability.
For tasks with deterministic validation — schema extraction, classification against a closed label set, code that must pass a test suite — this metric is computable before launch. Construct test cases, define an acceptance oracle, run each tier, read the number. The routing decision follows mechanically: use the cheapest tier whose cost_per_success falls within an acceptable range.
| Task Type | Tier 1 Pass Rate | Tier 1 Cost/Success | Tier 2 Pass Rate | Tier 2 Cost/Success | Default Routing |
|---|---|---|---|---|---|
| Intent classification (closed labels) | ~95% | ~$0.0003 | ~99% | ~$0.0035 | Tier 1 |
| Structured entity extraction | ~88% | ~$0.0007 | ~97% | ~$0.0034 | Tier 1 + quality monitoring |
| Domain-specific summarization | ~72% | ~$0.0021 | ~95% | ~$0.0040 | Tier 2 |
| Open-domain RAG answer | ~60% | ~$0.0031 | ~91% | ~$0.0044 | Tier 2 |
| Multi-hop reasoning | ~42% | ~$0.0091 | ~82% | ~$0.0055 | Tier 2 or Tier 3 |
One Complexity Score Breaks When Your Workload Has Multiple Capability Axes
Most routing literature treats complexity as a single dimension. Real workloads mix reasoning, code, tool orchestration, and domain knowledge — each with different model affinity.
The standard routing framing — send "easy" queries to cheap models, "hard" queries to expensive ones — assumes complexity is scalar. It's not, and this assumption fails whenever your workload spans multiple capability types.
A query requiring deep multi-step reasoning but trivial code output differs fundamentally from one needing sophisticated code generation but no reasoning. A scalar router collapses this distinction and either overspends (both go to frontier) or underpowers (both go to small).
GitHub's production router — HyDRA — solved this by decomposing query requirements into four independent dimensions: reasoning, code generation, debugging, and tool use [2]. Each query gets a vector of scores rather than a single complexity number. Model capabilities are expressed in the same space. Routing selects the cheapest model that covers the query's requirements across all dimensions. Adding a new model to the catalog is a YAML edit — zero retraining.
At the conservative operating point, HyDRA achieves 54% cost savings against always-Sonnet routing while matching Sonnet's resolution rate [2]. The binary predecessor saved only 9.1% at the same quality target. That performance delta comes entirely from exploiting per-dimension model affinity that a scalar router misses.
The practical implication for profiling: your task corpus needs to tag each sample across capability dimensions, not just by a single complexity score. Even a coarse three-flag taxonomy — reasoning-heavy, code-heavy, domain-knowledge-heavy — gives the harness enough signal to detect which tier dominates on which dimension. Tasks that cluster on only one dimension are your best early routing candidates; tasks with mixed profiles are where you need the most data before trusting a threshold.
Corpus Construction Without Production Logs
Three sources that work for greenfield systems. The corpus is a version-controlled artifact, not a one-time experiment.
Without production logs, teams have three practical corpus sources.
Analogous system exports: If you're building a customer-support classification feature, your company already handles support tickets somewhere. Pull 200 de-identified examples from that channel. They won't perfectly match your new system's prompt format, but they'll match the domain vocabulary and input distribution — which matters more for tier selection than prompt format.
Internal dogfooding: Run the prototype against real-work tasks from your own team. An engineering team building a code-review agent can generate its own corpus in a few hours by passing open PRs through the prototype and capturing the prompt–response pairs. High-fidelity corpus, no data-collection infrastructure required.
Adversarial construction: For task types where you can't source real examples, build inputs that probe the boundaries. If your oracle is "extracts the correct legal entity from a contract clause," write 50 variants with increasing clause complexity, ambiguous pronouns, and unusual entity formats. Adversarial inputs surface the tier boundary more efficiently than uniformly distributed random examples — you're not trying to characterize average performance, you're trying to find where tiers diverge.
Sample size: 100+ per task type gives statistically reliable pass-rate estimates for binary oracles (Wilson confidence interval width < 10% at 95% confidence). For rare task types, 50 samples is workable — widen your routing threshold conservatively to account for the uncertainty. Fewer than 30 per type produces estimates too noisy to distinguish tier performance reliably.
- [01]
List task types from the architecture design document
Not from intuition — from the literal list of LLM calls your system makes. Every distinct call type should be a row in the profiling matrix. If a call isn't in the design doc yet, the profiling process surfaces that gap before it becomes a production surprise.
- [02]
Collect or construct 100–200 samples per task type
Use analogous system exports, internal dogfooding, or adversarial construction. Weight the sample to match your expected production distribution. If 60% of your traffic will be classification requests, 60% of your corpus should be classification samples.
- [03]
Write the quality oracle as executable code
A Python function that takes (output: str, expected: dict) → bool. For structured outputs: JSON schema validation plus field-level accuracy. For open-ended outputs: an LLM-as-judge call against an explicit scoring rubric stored in source control. The oracle becomes the production quality gate. If it lives in a doc, it will drift from the code.
- [04]
Run the tier battery in parallel
Submit each corpus sample to Tier 1, 2, and 3 concurrently. Log output, input tokens, output tokens, and oracle result per call. Using litellm's async interface, a 500-sample corpus across three tiers typically completes in 5–15 minutes depending on rate limits.
- [05]
Compute costpersuccess and generate the routing config
For each task type, identify the cheapest tier with pass_rate ≥ 0.95 on your corpus. Adjust this threshold based on task stakes — 0.85 is defensible for human-reviewed outputs, 0.95 is the floor for automated actions. Record the full metrics table. This is your cost model for the architecture review.
The Oracle Is the Hard Part. Most Teams Write the Wrong One.
An oracle measuring proxy quality instead of the actual requirement produces routing thresholds that look correct and fail in production.
The profiling harness is mechanically straightforward. The oracle is the judgment call.
The failure mode: writing an oracle that measures something correlated with quality but not identical to it. A fluency oracle (does the output read well?) is not an accuracy oracle (does the output contain the correct entity name?). A schema-validity oracle (is the JSON parseable?) is not a field-accuracy oracle (are the extracted values correct?). Teams writing fluency or schema-only oracles consistently overestimate Tier 1 performance on extraction tasks, because small models produce fluent, well-formed JSON containing wrong values.
Calibration matters here. Run your oracle against a human-labeled validation set — 50 samples minimum — before trusting it to drive routing decisions. A well-calibrated oracle gives you a direct estimate of its false-positive rate: how often it passes outputs a human would reject. Factor that rate into your routing threshold.
The UCCI research (May 2026) demonstrated this dynamic on a 75,000-query production named entity recognition workload [3]. Routing on uncalibrated confidence scores produced expected calibration error (ECE) of 0.12. Calibrating the routing signal via isotonic regression dropped ECE to 0.03 and cut inference cost by 31% at the same quality floor. The mechanism is identical to oracle calibration: a systematic bias in the routing signal gets corrected, letting you tighten the threshold without sacrificing quality.
For subjective tasks where no deterministic oracle exists: accept that your threshold will be approximate. Document the uncertainty explicitly in the routing config. A routing assignment with a labeled confidence: low is more useful for future recalibration than one with false precision.
profiling_harness.py"""Profiling harness: run task corpus through model tiers, compute cost_per_success."""
import asyncio
from dataclasses import dataclass
from typing import Callable
import litellm
TIERS = {
"tier1": "claude-haiku-4-5-20251001",
"tier2": "claude-sonnet-4-6",
"tier3": "claude-opus-4-7",
}
# USD per 1M tokens (input, output)
TIER_PRICES: dict[str, tuple[float, float]] = {
"tier1": (0.80, 4.00),
"tier2": (3.00, 15.00),
"tier3": (15.00, 75.00),
}
@dataclass
class ProfilingResult:
task_id: str
tier: str
passed: bool
cost: float
def compute_cost_per_success(results: list[ProfilingResult]) -> dict:
by_tier: dict[str, dict] = {}
for r in results:
d = by_tier.setdefault(r.tier, {"total_cost": 0.0, "passed": 0, "total": 0})
d["total_cost"] += r.cost
d["passed"] += int(r.passed)
d["total"] += 1
return {
tier: {
"pass_rate": d["passed"] / d["total"],
"cost_per_success": (
d["total_cost"] / d["passed"] if d["passed"] else float("inf")
),
}
for tier, d in by_tier.items()
}
async def _run_sample(
tier_name: str,
model: str,
item: dict,
oracle: Callable[[str, dict], bool],
) -> ProfilingResult:
response = await litellm.acompletion(model=model, messages=item["messages"])
output = response.choices[0].message.content
in_price, out_price = TIER_PRICES[tier_name]
cost = (
response.usage.prompt_tokens * in_price / 1_000_000
+ response.usage.completion_tokens * out_price / 1_000_000
)
return ProfilingResult(
task_id=item["id"],
tier=tier_name,
passed=oracle(output, item["expected"]),
cost=cost,
)
async def profile_task_type(
corpus: list[dict],
oracle: Callable[[str, dict], bool],
task_type: str,
pass_rate_floor: float = 0.95,
) -> dict:
"""Run corpus through all tiers. Return cheapest tier meeting pass_rate_floor."""
tasks = [
_run_sample(tier, model, item, oracle)
for tier, model in TIERS.items()
for item in corpus
]
results = await asyncio.gather(*tasks)
metrics = compute_cost_per_success(list(results))
# cheapest tier meeting the quality floor, Tier 3 as safe default
routing_tier = "tier3"
for tier in ["tier1", "tier2", "tier3"]:
if metrics[tier]["pass_rate"] >= pass_rate_floor:
routing_tier = tier
break
return {
"task_type": task_type,
"routing_tier": routing_tier,
"metrics": metrics,
"corpus_size": len(corpus),
}The Profiling Results Are Not Insights. They Are Configuration.
The routing table should be a machine-readable artifact generated from profiling data — not a spreadsheet that informs a human decision.
The profiling run produces cost_per_success per tier per task type, plus pass rates. The decision rule is mechanical: use the cheapest tier whose pass rate meets your floor. There is one judgment call: setting the floor.
For tasks feeding automated irreversible actions — financial transactions, email sends, configuration changes — 0.95 is the minimum. For tasks reviewed by a human before acting, 0.85 is defensible. For informational tasks with no downstream action, 0.80 may be acceptable. These are risk calibrations, not quality preferences. Write them down with the reason before setting the threshold.
Store the routing config as a YAML file in source control. It updates when: model pricing shifts (quarterly), the profiling harness re-runs after oracle changes, or production monitoring shows that escalation rates have diverged from profiling predictions. The YAML is the single source of truth for which model handles which task.
If the routing logic lives in application code instead of configuration, you've created an ownership problem: someone has to ship a code change every time a price shift changes the routing economics. That is the wrong cost.
Model chosen during prototype, unchanged at launch
Cost estimate: frontier price × expected request count
No acceptance criteria defined before shipping
Routing retrofit begins when billing alarm fires
Migration cost: 4–8 weeks of engineering time
Routing tier assigned from costpersuccess measurements
Cost estimate: weighted sum across tiers from profiling data
Quality oracle exists as executable code before launch
Routing table ships with the first production deploy
Threshold updates: YAML edit, no code deploy needed
Pre-Launch Profiling Readiness
Task types listed from architecture design doc — one row per distinct LLM call
Task corpus exists: 100+ samples per type, version-controlled alongside the routing config
Quality oracle implemented as executable code — not prose, not a rubric in a doc
Oracle false-positive rate measured against a human-labeled sample (50+ items)
Tier battery completed across all task types; costpersuccess table exists
Routing thresholds set with documented quality floor — reason recorded alongside each threshold
Routing config stored as YAML in source control — not hardcoded in application logic
Cost model shared with stakeholders before launch — built from profiling data, not pricing pages
Production monitoring plan exists for escalation rate and costpersuccess drift
How many corpus samples do I need for reliable thresholds?
100+ per task type for binary oracles. At 100 samples, the Wilson confidence interval for a 90% pass rate spans roughly ±6%. At 50 samples, it spans ±9%. Fewer than 30 per type produces estimates too noisy to distinguish tier performance. For rare task types you can't sample past 30, set a conservative threshold — use 0.97 instead of 0.95 — to account for the uncertainty in the estimate.
My task types are too unique to find analogous examples. What then?
Build them adversarially. Write 50–100 inputs yourself, weighted toward the edge cases your system will encounter under real load: unusual input lengths, domain-specific vocabulary, ambiguous requests. Adversarial construction is slower than data collection but produces a higher-signal corpus for tier discrimination, because you're probing the boundary rather than sampling the bulk. The corpus doesn't have to characterize all possible inputs — it has to expose the cases where tiers diverge.
What if 'correct' is subjective for my task type?
Use LLM-as-judge with an explicit scoring rubric stored in source control. Grade each output property independently: completeness, factual accuracy relative to the provided context, format compliance. Combine into a pass/fail score. Calibrate the judge against a human-labeled sample to measure its disagreement rate. A judge with 15% disagreement is still useful — just document the limitation in the routing config and widen the threshold margin accordingly.
When do I need to re-run the profiling harness?
When the oracle definition changes, when a new model tier becomes available that might shift the costpersuccess ranking, or when production monitoring shows escalation rates have diverged from profiling predictions by more than 10–15 percentage points. Don't run it on a fixed schedule — run it when the inputs to the routing decision change. Quarterly model pricing changes alone rarely justify a full re-run unless the price shift is large enough to change the cheapest-qualifying-tier for a given task type.
The routing table is a deliverable. Its existence or absence is an architectural decision, made before the architecture review or after the billing alarm, by measurement or by default.
Teams that run the profiling harness before launch arrive at that review with a cost model reflecting actual task-type performance on their specific prompt distribution — not a pricing page multiplied by a volume guess. They know which tier handles which task, what the cost_per_success looks like across the fleet, and where the quality floor sits on tasks that feed automated actions.
Two to three days to build the harness. One hour to run it. The routing config that comes out is grounded in empirical data. Everything else is a price list with a confidence interval of infinity.
- [1]TierBench: Deterministic benchmark harness and tier-based router for LLM cost optimization(github.com)↩
- [2]HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools(arxiv.org)↩
- [3]UCCI: Calibrated Uncertainty for Cost-Optimal LLM Cascade Routing(arxiv.org)↩
- [4]RouteLLM: Learning to Route LLMs with Preference Data(arxiv.org)↩
- [5]GuideLLM: Platform for evaluating language model performance under real workloads(github.com)↩