Teams default to vibe-based model selection because they lack production data pre-launch. A profiling harness resolves the catch-22: build a synthetic task corpus, define quality oracles as code, run every tier, and read cost_per_success from data before the first request.
Why token price is the wrong routing metric — and what costpersuccess measures instead
How to build a synthetic task corpus without production logs (three methods)
Writing quality oracles as executable code, calibrating them, and measuring false-positive rates
Running the tier battery and generating a YAML routing config from measured data
When multi-dimensional scoring beats scalar complexity (HyDRA's production numbers)
Production monitoring: escalation rate drift, re-run triggers, and threshold update cadence
Routing thresholds require historical traffic data. You haven't launched yet. This is the catch-22 that turns model selection into guesswork: pick a model during development, ship it, and let the billing alarm tell you six months later whether you chose wrong.
The problem isn't that teams skip routing. They skip it because they have nothing to calibrate it against before production. Standard advice — "run your task types through each tier and measure quality" — assumes a representative traffic corpus exists. Greenfield systems don't have one. So routing gets deferred to a post-launch optimization, the architecture review gets a number pulled from the model's pricing page, and the cost model describes a system that will be retrofitted under deadline pressure.
The mechanism that resolves the catch-22: a profiling harness built before the first production prompt. It generates a synthetic task corpus, defines acceptance criteria as executable code, runs the corpus through each model tier in parallel, and computes cost_per_success — the actual cost of getting a correct answer, not just the cost of generating tokens. The routing table emerges from measured data. It exists before traffic arrives, or you've made your model selection by architectural default.
Building a profiling harness from scratch takes a senior engineer two to three days [1]. Running it on a 200-sample corpus costs under $1 in API calls. Those numbers compare favorably to the post-launch retrofit, which typically runs four to eight weeks of engineering time.
TierBench's full T0–T3 benchmark suite costs roughly $0.15. A 500-sample custom corpus runs $0.50–$2 depending on tier mix and output length [1]
HyDRA matches Sonnet quality at 54.1% lower cost vs. always-Sonnet routing — a 6× improvement over the binary predecessor at the same quality floor [2]
Published at ICLR 2025: RouteLLM's matrix-factorization router routes 85% of queries to Mixtral 8x7B while maintaining 95% of GPT-4 performance on MT-Bench [6]
UCCI achieved 31% cost reduction at micro-F1 = 0.91 by calibrating routing thresholds against measured error probability, not raw confidence scores [3]
The architecture review wants cost per request. The right number is cost per correct answer — and those two figures diverge significantly across task types.
Token price tells you the cost of output. It doesn't tell you the cost of a correct answer.
A model that fails 40% of the time on your specific task is not 40% cheaper than a model that fails 5% of the time — it costs more per successful outcome, and it degrades every downstream system that depends on those answers.
The formula: cost_per_success = total_cost / count_passed. You need both terms. Most pre-launch cost models have only the first. And this isn't an academic concern — TierBench, an open benchmarking harness built precisely for this problem, makes cost_per_success the primary routing metric rather than token cost alone [1]. The reason: cheap models are good at many tasks, expensive models are only necessary for a few, but nobody measures success rate per task type, so token cost alone is meaningless without success probability.
For tasks with deterministic validation — schema extraction, classification against a closed label set, code that must pass a test suite — this metric is computable before launch. Construct test cases, define an acceptance oracle, run each tier, read the number. The routing decision follows mechanically: use the cheapest tier whose cost_per_success falls within an acceptable range.
| Task Type | Tier 1 Pass Rate | Tier 1 Cost/Success | Tier 2 Pass Rate | Tier 2 Cost/Success | Default Routing |
|---|---|---|---|---|---|
| Intent classification (closed labels) | ~95% | ~$0.0003 | ~99% | ~$0.0035 | Tier 1 |
| Structured entity extraction | ~88% | ~$0.0007 | ~97% | ~$0.0034 | Tier 1 + quality monitoring |
| Domain-specific summarization | ~72% | ~$0.0021 | ~95% | ~$0.0040 | Tier 2 |
| Open-domain RAG answer | ~60% | ~$0.0031 | ~91% | ~$0.0044 | Tier 2 |
| Multi-hop reasoning | ~42% | ~$0.0091 | ~82% | ~$0.0055 | Tier 2 or Tier 3 |
Most routing literature treats complexity as a single dimension. Real workloads mix reasoning, code, tool orchestration, and domain knowledge — each with different model affinity.
The standard routing framing — send "easy" queries to cheap models, "hard" queries to expensive ones — assumes complexity is scalar. It's not, and this assumption fails whenever your workload spans multiple capability types.
A query requiring deep multi-step reasoning but trivial code output differs fundamentally from one needing sophisticated code generation but no reasoning. A scalar router collapses this distinction and either overspends (both go to frontier) or underpowers (both go to small).
GitHub's production router — HyDRA — solved this by decomposing query requirements into four independent dimensions: reasoning, code generation, debugging, and tool use [2]. A ModernBERT encoder with four independent sigmoid heads scores each query along those axes. A shortfall-matching algorithm then selects the cheapest model whose capability profile meets or exceeds the query's predicted requirements. Adding a new model to the catalog is a YAML edit — zero retraining. The deployed predictor runs at 86 ms median CPU inference latency in production [2].
At the conservative operating point, HyDRA achieves 54.1% cost savings against always-Sonnet routing while matching Sonnet's resolution rate — a 6× improvement over the binary predecessor [2]. Push to aggressive mode and savings climb to 72.5%, trading 3.2 quality points. The performance delta comes entirely from exploiting per-dimension model affinity that a scalar router misses.
The practical implication for profiling: your task corpus needs to tag each sample across capability dimensions, not just by a single complexity score. Even a coarse three-flag taxonomy — reasoning-heavy, code-heavy, domain-knowledge-heavy — gives the harness enough signal to detect which tier dominates on which dimension. Tasks that cluster on only one dimension are your best early routing candidates; tasks with mixed profiles are where you need the most data before trusting a threshold.
Three sources that work for greenfield systems. The corpus is a version-controlled artifact, not a one-time experiment.
Without production logs, teams have three practical corpus sources.
Analogous system exports: If you're building a customer-support classification feature, your company already handles support tickets somewhere. Pull 200 de-identified examples from that channel. They won't perfectly match your new system's prompt format, but they'll match the domain vocabulary and input distribution — which matters more for tier selection than prompt format.
Internal dogfooding: Run the prototype against real-work tasks from your own team. An engineering team building a code-review agent can generate its own corpus in a few hours by passing open PRs through the prototype and capturing the prompt–response pairs. High-fidelity corpus, no data-collection infrastructure required.
Adversarial construction: For task types where you can't source real examples, build inputs that probe the boundaries. If your oracle is "extracts the correct legal entity from a contract clause," write 50 variants with increasing clause complexity, ambiguous pronouns, and unusual entity formats. Adversarial inputs surface the tier boundary more efficiently than uniformly distributed random examples — you're not trying to characterize average performance, you're trying to find where tiers diverge.
Sample size: 100+ per task type gives statistically reliable pass-rate estimates for binary oracles. A 2025 analysis of LLM evaluation statistics recommends against using the central limit theorem approximation for sample sizes under a few hundred, and endorses Wilson confidence intervals instead [9]. At 100 samples, the Wilson interval for a 90% pass rate spans roughly ±6%. At 50 samples, it spans ±9%. Fewer than 30 per type produces estimates too noisy to distinguish tier performance. For rare task types, 50 samples is workable — widen your routing threshold conservatively to account for the uncertainty.
Not from intuition — from the literal list of LLM calls your system makes. Every distinct call type should be a row in the profiling matrix. If a call isn't in the design doc yet, the profiling process surfaces that gap before it becomes a production surprise.
Use analogous system exports, internal dogfooding, or adversarial construction. Weight the sample to match your expected production distribution. If 60% of your traffic will be classification requests, 60% of your corpus should be classification samples.
A Python function that takes (output: str, expected: dict) → bool. For structured outputs: JSON schema validation plus field-level accuracy. For open-ended outputs: an LLM-as-judge call against an explicit scoring rubric stored in source control. The oracle becomes the production quality gate.
Submit each corpus sample to Tier 1, 2, and 3 concurrently. Log output, input tokens, output tokens, and oracle result per call. Using litellm's async interface, a 500-sample corpus across three tiers typically completes in 5–15 minutes depending on rate limits.
For each task type, identify the cheapest tier with pass_rate ≥ 0.95 on your corpus. Adjust this threshold based on task stakes. Record the full metrics table — this is your cost model for the architecture review.
An oracle measuring proxy quality instead of the actual requirement produces routing thresholds that look correct and fail in production.
The profiling harness is mechanically straightforward. The oracle is the judgment call.
The failure mode: writing an oracle that measures something correlated with quality but not identical to it. A fluency oracle (does the output read well?) is not an accuracy oracle (does the output contain the correct entity name?). A schema-validity oracle (is the JSON parseable?) is not a field-accuracy oracle (are the extracted values correct?). Teams writing fluency or schema-only oracles consistently overestimate Tier 1 performance on extraction tasks, because small models produce fluent, well-formed JSON containing wrong values.
Calibration matters here. Run your oracle against a human-labeled validation set — 50 samples minimum — before trusting it to drive routing decisions. A well-calibrated oracle gives you a direct estimate of its false-positive rate: how often it passes outputs a human would reject. Factor that rate into your routing threshold.
The UCCI research (May 2026) demonstrated this dynamic on a 75,000-query production named entity recognition workload [3]. Routing on uncalibrated confidence scores produced expected calibration error (ECE) of 0.12. Calibrating the routing signal via isotonic regression dropped ECE to 0.03 and cut inference cost by 31% at the same quality floor. The mechanism is identical to oracle calibration: a systematic bias in the routing signal gets corrected, letting you tighten the threshold without sacrificing quality.
RouteLLM (ICLR 2025) showed the same principle from the other direction [4] [6]: its matrix-factorization router uses preference data to calibrate the threshold between strong and weak models. The calibration step is explicit — you run calibrate_threshold against a dataset that resembles your production distribution. Without calibration, the threshold is wrong by default. With it, RouteLLM routes 85% of queries to the cheaper model while maintaining 95% of frontier-model performance on MT-Bench [7].
For subjective tasks where no deterministic oracle exists: accept that your threshold will be approximate. Document the uncertainty explicitly in the routing config. A routing assignment with a labeled confidence: low is more useful for future recalibration than one with false precision.
The routing table should be a machine-readable artifact generated from profiling data — not a spreadsheet that informs a human decision.
The profiling run produces cost_per_success per tier per task type, plus pass rates. The decision rule is mechanical: use the cheapest tier whose pass rate meets your floor. There is one judgment call: setting the floor.
For tasks feeding automated irreversible actions — financial transactions, email sends, configuration changes — 0.95 is the minimum. For tasks reviewed by a human before acting, 0.85 is defensible. For informational tasks with no downstream action, 0.80 may be acceptable. These are risk calibrations, not quality preferences. Write them down with the reason before setting the threshold.
Store the routing config as a YAML file in source control. Below is a minimal structure that captures the three fields every threshold entry needs: which model handles the task, what pass rate was observed at profiling time, and what floor was accepted:
The YAML becomes the single source of truth for which model handles which task. It updates when: model pricing shifts significantly, the profiling harness re-runs after oracle changes, or production monitoring shows escalation rates have diverged from profiling predictions.
If the routing logic lives in application code instead of configuration, you've created an ownership problem: someone has to ship a code change every time a price shift changes the routing economics. YAML edits don't require a code deploy. That matters at quarterly pricing cycles.
Model chosen during prototype, unchanged at launch
Cost estimate: frontier price × expected request count
No acceptance criteria defined before shipping
Routing retrofit begins when billing alarm fires
Migration cost: 4–8 weeks of engineering time
Routing tier assigned from costpersuccess measurements
Cost estimate: weighted sum across tiers from profiling data
Quality oracle exists as executable code before launch
Routing table ships with the first production deploy
Threshold updates: YAML edit, no code deploy needed
A routing config is calibrated at a point in time. Models improve, prompts change, product requirements shift. Two metrics tell you when re-profiling is warranted.
Profiling gives you a routing table calibrated against your synthetic corpus at a point in time. In production, two numbers tell you whether that calibration is still valid.
Escalation rate: the fraction of requests that your routing logic escalates from a lower tier to a higher one because the lower tier's output failed the quality check at inference time. If you profiled intent classification at 96% pass rate for Tier 1, you'd expect roughly 4% escalation. If production shows 20%, something has drifted — your prompt, your domain distribution, or the model itself.
costpersuccess delta: compare your profiling baseline against the rolling production average for each task type. A widening gap indicates either oracle calibration drift (the oracle is passing things it shouldn't) or model capability regression.
For RouteLLM-style learned routers, the recommendation is re-calibration monthly, because production traffic distributions shift as products evolve [7]. For threshold-based harness routing, the trigger is event-driven rather than scheduled: re-run the harness when the escalation rate deviates more than 10–15 percentage points from the profiling baseline, or when a new model tier enters the catalog that might shift the cheapest-qualifying-tier assignment.
Don't run it on a fixed schedule. Run it when the inputs to the routing decision change. Quarterly model pricing changes alone rarely justify a full re-run unless the price shift is large enough to change which tier is cheapest at your quality floor.
Operational constraints that keep the harness trustworthy over multiple model generations and pricing cycles.
Token price without pass rate is a cost estimate, not a routing decision. A model that fails 40% of requests costs more per success than a frontier model with 2% failure, even if the per-token price is lower.
If the corpus drifts from the oracle definition, the profiling results are meaningless. Co-locate them so a PR that changes the oracle must update the corpus and re-run results before merging.
An oracle with 15% false positives produces a routing table calibrated against the wrong signal. The false-positive rate directly affects how much margin you need above your quality floor.
500 samples across 3 tiers is 1,500 API calls. An unbounded run on a large corpus at frontier pricing can generate a surprising bill. Cap first, run second.
Future engineers inheriting the routing config need to know why the threshold was set, not just what it is. A number without context will be copied and pasted into the wrong task type.
Routing tables go stale silently. Without a baseline from day one, you have nothing to compare when drift is suspected six months later.
A fixed quarterly schedule wastes time if nothing has changed and misses problems that occur between cycles. Trigger on change events, not on calendar dates.
How many corpus samples do I need for reliable thresholds?
100+ per task type for binary oracles. A 2025 statistical analysis of LLM evaluation methodology recommends Wilson confidence intervals over the normal approximation, and at 100 samples the Wilson interval for a 90% pass rate spans roughly ±6% [9]. At 50 samples it spans ±9%. Fewer than 30 per type produces estimates too noisy to distinguish tier performance. For rare task types you can't sample past 30, set a conservative threshold — use 0.97 instead of 0.95 — to account for the estimation uncertainty.
My task types are too unique to find analogous examples. What then?
Build them adversarially. Write 50–100 inputs yourself, weighted toward the edge cases your system will encounter under real load: unusual input lengths, domain-specific vocabulary, ambiguous requests. Adversarial construction is slower than data collection but produces a higher-signal corpus for tier discrimination, because you're probing the boundary rather than sampling the bulk. The corpus doesn't have to characterize all possible inputs — it has to expose the cases where tiers diverge.
What if 'correct' is subjective for my task type?
Use LLM-as-judge with an explicit scoring rubric stored in source control. Grade each output property independently: completeness, factual accuracy relative to the provided context, format compliance. Combine into a pass/fail score. Calibrate the judge against a human-labeled sample to measure its disagreement rate. A judge with 15% disagreement is still useful — just document the limitation in the routing config and widen the threshold margin accordingly.
When do I need to re-run the profiling harness?
When the oracle definition changes, when a new model tier becomes available that might shift the costpersuccess ranking, or when production monitoring shows escalation rates have diverged from profiling predictions by more than 10–15 percentage points. RouteLLM's production guidance recommends monthly re-calibration for learned routers because traffic distributions shift as products evolve [7]. For threshold-based harnesses, event-driven is better: run it when the inputs to the routing decision change, not on a fixed schedule.
Should I use a learned router (like RouteLLM) or a threshold-based harness?
Threshold-based harness first. A learned router requires training data — which you don't have pre-launch. RouteLLM's matrix-factorization router delivers 85% cost reduction on MT-Bench at 95% of GPT-4 quality [6] [7], but only after calibration against production-representative data. Build the profiling harness to ship a working routing table on day one. Migrate to a learned router once you have six to eight weeks of production traffic and can calibrate the threshold against real query distributions.
What's the difference between a cascade and a routing harness?
A cascade tries a cheap model first and escalates to a more capable one if quality is below threshold — at inference time, per request. A routing harness runs offline before launch, assigns task types to tiers statically, and stores those assignments as configuration. Cascades add per-request latency (you always call the cheap model, even when you know it will fail). Harnesses add engineering time upfront. Use a harness for known, repeatable task types; use a cascade when task complexity is genuinely unpredictable at request time.
The routing table is a deliverable. Its existence or absence is an architectural decision, made before the architecture review or after the billing alarm, by measurement or by default.
Teams that run the profiling harness before launch arrive at that review with a cost model reflecting actual task-type performance on their specific prompt distribution — not a pricing page multiplied by a volume guess. They know which tier handles which task, what cost_per_success looks like across the fleet, and where the quality floor sits on tasks that feed automated actions.
Two to three days to build the harness. One hour to run it. The routing config that comes out is grounded in empirical data. Everything else is a price list with a confidence interval of infinity.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.