Latency, error rate, and token cost stay green while LLM output quality degrades for weeks. The infrastructure layer cannot see semantic failure. Sampled evals, prompt hash drift, and distribution alerts are the signals that catch it before users do.
Three weeks into production. Median latency 380ms. Error rate 0.3%. Token spend on budget. SLO review signed off. The team moved on.
User satisfaction had been dropping every day for two weeks. It took a VP escalating before anyone correlated the curve with a prompt cleanup eleven days earlier — nine words removed from the system prompt that had told the model to format responses as structured JSON. Latency did not move. Errors did not move. Cost did not move. The only signal was user complaints, and nobody had wired user complaints into the engineering timeline.
This is the gap. The signals teams instrument first — latency, error rate, token cost — all sit healthy while semantic quality collapses underneath them. Infrastructure metrics tell you the request completed. They have no opinion on whether it worked. Logging that records 'request returned 200' is not observability for an LLM. It is an alibi.
Latency, error rate, and token cost measure completion, not usefulness — three healthy dashboards can sit next to a collapsing product experience for weeks.
Sampled LLM-as-judge on 5–10% of live traffic is the cheapest mechanism that catches quality drift before users escalate it.
Prompt hash drift catches the most common silent regression: a template change that nobody announced and nothing logged.
OpenTelemetry GenAI semantic conventions cover infra signals natively; quality signals require custom span attributes you add yourself.
Cost tracking comes before sampling. Your eval sample rate is bounded by what each evaluation costs — without that number you're guessing.
Alert on distribution shift across a rolling window. A 5% drop over 7 days is a signal. A single score moving is noise.
Judge calibration is not optional. Cohen's kappa below 0.6 means you're scaling confident-looking noise, not measurement.
The fight: a tracing model designed for deterministic services applied to a non-deterministic one. The infrastructure layer cannot see semantic failure.
Distributed tracing was built on a contract: a function returns 200, the request succeeded. SLOs hold. Alerts fire. Dashboards mean something. LLMs break that contract.
A response can be syntactically valid, arrive under 500ms, return 200, consume the budgeted tokens, and still be wrong, malformed, or off-intent. The span is clean. The service is healthy. The user experience is degrading. Nothing in your tracing stack was built to notice the difference.
The gap between 'request completed' and 'request was useful' is where every LLM quality failure hides. Teams find out the same way every time — monitoring fires on model API outages and stays silent through multi-week quality regressions that users notice long before engineers do.[3] By the time someone escalates, the curve has been bending for two weeks.
The failure taxonomy is wider than most teams expect. Research on production LLM failure modes identifies at least five distinct classes that infrastructure metrics are blind to: output format deviation, semantic drift, factual incoherence, context handling failures (long context truncation, retrieval miss), and safety degradation (refusal rate changes from provider-side model updates).[10] Each class requires a different signal to detect.
RAG compounds it. Retrieval scores that drift a few percent will not trigger any infrastructure alert. The model does its best with worse context. Outputs degrade subtly. Aggregating retrieval and generation into one health number is how the root cause stays hidden — track them as separate signals or don't bother tracking them at all.
p50/p95/p99 latency SLOs
HTTP error rate
Token spend alerts
Model API availability
Rate limit proximity
Sampled LLM-as-judge scores against rolling baselines
Prompt hash drift attached to every span
Per-intent quality tracked as independent series
Retrieval and generation scored separately — never aggregated
Distribution-shift alerts, not point-variance noise
The GenAI semantic conventions are a solid infra foundation. Quality signals are a separate layer you build on top.
The OpenTelemetry GenAI semantic conventions standardize how LLM calls are recorded across providers.[7] They're the right foundation. But understanding exactly what they cover — and what they don't — determines where you have to build.
What the conventions give you out of the box:
gen_ai.system — the provider (e.g., anthropic, openai, aws.bedrock)gen_ai.request.model — the exact model name requestedgen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token countsgen_ai.response.finish_reasons — why generation stopped (stop, length, tool_calls)gen_ai.client.operation.duration — a histogram of LLM call latencies[8]As of early 2026, most GenAI semantic conventions are in experimental status — the API isn't fully stabilized, and major vendors like Datadog and Grafana are beginning native support but with version-specific caveats. Pin your instrumentation library version and test collector compatibility before assuming pass-through.
What the conventions do not cover:
Those are yours to instrument. The convention doesn't stop you — OpenTelemetry collectors pass custom span attributes through without modification. You add gen_ai.prompt.hash, gen_ai.prompt.version, quality.score.format, quality.score.relevance as custom attributes on the same span. They flow through the same pipeline your infra metrics use. No separate pipeline, no separate backend.
Each one targets a specific failure class infrastructure metrics ignore. Ordered by build cost and leverage.
The fight: signal coverage versus inference cost. Sample too low, miss the regression. Sample too high, double the bill.
The mechanism is simple. After a request completes, route a sample of it through an evaluator that scores the response against the criteria you actually care about — format compliance, relevance to intent, factual coherence. Aggregate the scores. Alert when the trend breaks.
Sampling rate is a cost decision, not a statistics decision pretending to be one. Adaline's 2026 guide puts the range at 5–10% of production traffic — enough for confidence intervals without doubling inference spend.[1] Don't start there. Start at 1–2% and validate that the judge agrees with human raters on the same samples. Scaling to 5–10% before you trust the judge means collecting misleading data faster.
Two rules hold regardless of sampling rate. Evaluate 100% of requests that errored or got escalated. Stratify by intent so each category gets proportional coverage. Sampling only from the most recent traffic gives you recency bias and hides slow degradations that compound across intent categories unevenly.
The metric that matters is trend, not absolute score. 3.8/5.0 today is meaningless. 4.2 → 3.5 over two weeks is a structural signal. Continuous quality monitoring reports a 5% drop in any single dimension over a 7-day rolling window as the threshold that surfaces real degradation before users escalate.[2]
One honest constraint: non-determinism means individual scores will move even on identical inputs. Don't alert on single-request variance. Alert on distribution shift — when the population of scores changes shape, not when one number fluctuates.
Confidence intervals you can trust without doubling inference cost (Adaline, 2026)
7-day rolling drop in any single quality dimension — the threshold worth waking up for
Errored and escalated requests always evaluated. Sample rate does not apply to known failures.
Judge agreement with human raters must hit this bar before you trust scores at scale
A miscalibrated judge running at 7% sample rate produces confident-looking noise. Cohen's kappa is the gate.
Most teams skip calibration. They pick a judge model, write a rubric, and start collecting scores. The scores look plausible. The trend lines look smooth. What they've actually built is an expensive random number generator with a good aesthetic.
Calibration is the step that separates measurement from theater. Rate 50–200 production samples by hand, have your judge score the same samples, then compute Cohen's kappa between the two. For balanced criteria on a binary pass/fail rubric, 50 stratified traces will pin kappa to within ±0.10–0.15 at a 95% bootstrap confidence interval. That's enough to know whether your judge is tracking human judgment.[9]
When your criteria involve rare-but-expensive failure classes — safety violations appearing in 6% of traces, for example — 50 samples isn't enough. The variance of kappa is dominated by the count of minority-class examples, not the total sample size. Plan for 200+ when minority classes matter.[9]
The threshold that matters: kappa ≥ 0.6 before scaling to 5–10% of production traffic. Below that, the judge isn't tracking human quality judgment consistently enough to detect real regressions. Between 0.6 and 0.8 is workable for most quality dimensions. Above 0.8 is the bar for high-stakes criteria like safety detection.
Calibration also has a shelf life. Judge drift is real: without a monthly recalibration cadence, agreement degrades over 60–90 days as production distribution shifts. Run your gold set monthly. Alert if kappa drops below threshold. Treat the judge as a system that also needs monitoring.
Nobody announces a prompt change. Without a fingerprint on every span, a nine-word deletion is indistinguishable from no change at all.
Prompt drift is invisible by design. A developer refactors a helper. A template variable shifts scope. A conditional branch that used to include an instruction now skips it. The effective prompt the model receives changes, and unless you're hashing the template, no deployment log will mention it.
The fix is one function and one span attribute. Hash the template before rendering it with user data. Attach the hash to every LLM call span. When the hash moves, the template moved — whether or not anyone wrote a changelog entry.
The leverage is in correlation. A hash change followed by a quality score drop within 24–48 hours is almost always the root cause. A hash change with no quality impact is the safe case — note it and move on. The combination tells you what a deployment log alone cannot: did the change actually matter.
Model provider updates are a second, subtler form of drift. Providers update model weights, safety policies, and serving infrastructure without changing API endpoints. An August 2025 postmortem documented bugs affecting Claude Sonnet 4's performance without any API modifications, including routing errors affecting up to 16% of requests. Prompt hashing doesn't catch this — the template didn't change. What catches provider-side drift is a continuous baseline: quality scores should be stable between template changes, and any shift that isn't correlated with a hash change needs a different root cause hypothesis.
One constraint: hashing catches changes to the template itself, not to the data interpolated into it. When user history, retrieved documents, or business rules drift, you need embedding-based drift detection on the rendered input distribution.[6] Hashing is a zero-cost signal for the most common silent regression. It's not the whole story, and a team that claims otherwise is selling something.
Hashing catches template changes. This catches when the world stops sending you the queries you built for.
There's a failure class that prompt hashing misses entirely: the input distribution shifts while your prompt stays constant. Users start asking questions your system wasn't built for. A seasonal event changes query intent. A product update attracts a different user segment. The model handles these new queries worse than your training distribution, but every span shows the same template hash.
Embedding-based drift detection addresses this. Embed a rolling sample of incoming queries using the same encoder your retrieval system uses. Cluster them over time. When the centroid of the production distribution moves outside a threshold of your golden dataset coverage, you're receiving queries meaningfully different from what you built for — before failure rates climb.[2]
The mechanics: compute cosine distance between the production query embedding centroid (rolling 7-day window) and the golden dataset centroid. A 15–20% shift in centroid distance is a reasonable threshold for investigation. It's not a reason to page; it's a reason to sample manually and decide whether the new distribution requires prompt updates or retrieval index refreshes.
For RAG pipelines, a variant of this applies at the retrieval layer. Track the distribution of retrieval scores across a rolling window. A 10% drop in the median retrieval score over 3 days means the knowledge base is drifting relative to the queries — not a model problem, a data freshness problem. Track it as a separate signal from generation quality, or you'll spend two days debugging the wrong layer.
| Failure Class | Caught by Latency/Error? | Caught by Prompt Hash? | Caught by Sampled Eval? | Caught by Embedding Drift? |
|---|---|---|---|---|
| Template edit removed instruction | No | Yes — hash changes | Yes — scores drop | No |
| Provider model update changed behavior | No | No — template unchanged | Yes — if baseline is continuous | No |
| Input distribution shifted (new query types) | No | No | Partially — if sampled queries reflect new dist. | Yes |
| RAG knowledge base staleness | No | No | Partially — coherence score drop | Yes — retrieval score variant |
| Format regression (JSON → prose) | No | Yes (if instruction removed) | Yes — format score | No |
| Response truncation / length collapse | Token count alert fires | No | Yes — coherence/relevance drop | No |
How a request becomes a score, becomes a baseline, becomes an alert that lands on the right person.
Deterministic checks cost nothing per request and catch format failures before the judge ever runs.
Before you spend inference dollars on an LLM judge, run deterministic checks on everything. These execute in microseconds, cover 100% of requests, and catch the class of failures that are easiest to produce and hardest to notice: structural regressions.
Format validation is the highest-ROI heuristic. If your application expects JSON, validate the schema on every response. Track the JSON-parse failure rate as a metric. A spike from 0.1% to 3% failure rate is an earlier signal than any quality score — and it doesn't cost you an LLM call to detect. Similar logic applies to response length: if your summarization workflow should produce 100–300 words, a response histogram that collapses to 20-word outputs is detectable without a judge.
Other high-value heuristics: scan for refusal patterns (I'm sorry, I can't, As an AI) to track safety calibration drift from model updates; check for unexpected language or character sets in international deployments; validate that expected structure markers appear (section headers, numbered lists, code fences) when the task requires them.
The pass/fail rate of each heuristic check is itself a monitoring signal. Logging pass/fail rates by intent type over time gives you the earliest warning of model drift or provider-side API changes — visible within hours, not days.
Alerts that fire on noise stop firing entirely. Each threshold below names a specific failure class and the first move that catches the cause.
| Signal | Threshold | Window | First Action |
|---|---|---|---|
| Quality score drop | Any dimension falls 5%+ below baseline | 7-day rolling | Pull traces from degradation period; check prompt hash timeline |
| Prompt hash change | Any unexpected hash change outside a deployment window | Per-request | Verify change was intentional; correlate with next 48h quality trend |
| Retrieval relevance drop (RAG) | Retrieval score falls 10%+ below baseline | 3-day rolling | Check knowledge base freshness; inspect retrieval pipeline config |
| Hallucination / refusal rate spike | Flagged rate exceeds 2× historical baseline | Daily | Immediate escalation; roll back most recent prompt or model changes |
| Format validation failure rate | Rises above 2× baseline (e.g., 0.1% → 0.2%) | 1-hour rolling | Check for provider-side model update; inspect template for regressions |
| Judge kappa drop (monthly calibration) | Kappa falls below 0.6 on gold set | Monthly re-run | Suspend scoring trust; re-rate gold set and update judge rubric |
| Cost-per-quality ratio | Spend per quality point rises 30%+ week-over-week | Weekly | Review model selection and prompt efficiency; check for context bloat |
The counterintuitive prerequisite to quality monitoring
The most common mistake is starting with quality evaluators before having cost tracking in place. This seems backwards until you realize: your sampling rate for LLM-as-judge is directly constrained by how much that evaluation costs. Without knowing your baseline inference cost per request type, you have no principled basis for choosing a sampling rate. You'll either undersample (missing real degradation) or oversample (running quality evals that cost more than the system they're monitoring).
Cost tracking also delivers fast value on its own. Adding trace IDs and cost-per-span takes a few hours. It immediately shows which request types are expensive and whether any single workflow is responsible for disproportionate spend. That data shapes every downstream sampling decision.
There's also the uncomfortable truth about observability comprehensiveness: teams that try to build the full quality monitoring stack in the first sprint typically ship nothing usable in the first month. Phased delivery — something working at each stage — consistently outperforms a complete design that's half-implemented.
Attach a unique trace ID to every request and a cost attribute to every LLM call span. Immediately surfaces expensive outliers and establishes the budget for subsequent sampling decisions. This is the foundation everything else depends on.
Add genai.prompt.hash and genai.prompt.version attributes to every LLM call span. Zero inference cost, immediate audit trail. Catches the most common cause of silent quality regressions — unannounced prompt changes — before they compound.
Run lightweight rule-based checks on 100% of traffic: format validation, response length bounds, regex patterns for expected structure. Track pass/fail rates as OTel metrics. Detects structural failures cheaply and calibrates what 'normal' looks like before investing in semantic evaluation.
Before any LLM-as-judge goes to production, manually rate 50–200 production samples. Compute Cohen's kappa between your judge and human raters. Don't proceed until kappa ≥ 0.6 on balanced criteria. This is not optional.
Start at 1–2% sample rate with a validated judge. Establish a 2-week rolling baseline before configuring drift alerts. Scale to 5–10% once you trust the scores and know the cost profile. Schedule monthly kappa recalibration against the gold set.
The checklist before you claim production quality observability
LLM-as-judge is not always the right tool. Know the cases where it wastes money and misleads.
Semantic judges add value when human judgment is genuinely required to assess quality. These are the tasks where heuristics can't cover the full quality surface.
Run the SQL. Parse the JSON. Execute the code. A judge's opinion on whether the SQL is correct is slower, more expensive, and less reliable than just executing it and checking the result.
An uncalibrated judge with a Cohen's kappa of 0.3 is not a measurement instrument. It's a random number generator with a clean dashboard.
The cost of missing a safety failure is not comparable to the cost of an extra LLM call. Sample rate is a cost optimization for the happy path, not for known-bad or flagged requests.
Route those requests through a self-hosted or on-premise evaluation model, or restrict to heuristic-only checks. Compliance constraints override eval coverage goals.
How is LLM quality monitoring different from traditional APM?
Traditional APM measures whether code executed correctly — a function ran, the database responded, the request completed within latency bounds. LLM quality monitoring measures whether outputs were useful — whether the response answered the question, matched the expected format, and was factually coherent. These are orthogonal concerns. A request that completes in 200ms with a 200 OK can still produce an output that's wrong, incoherent, or misformatted. Infrastructure metrics alone cannot capture that distinction.
Can I use OpenTelemetry for quality signals?
Yes — with a clear understanding of what the conventions cover. The OTel GenAI semantic conventions standardize genai.system, genai.request.model, token counts, finish reasons, and latency histograms. As of early 2026, most GenAI attributes are experimental status. For quality signals, you extend with custom span attributes: quality.score.format, quality.score.relevance, genai.prompt.hash, genai.prompt.version. Collectors pass custom attributes through without modification — no pipeline changes needed beyond adding the attributes at instrumentation time.
How do I handle non-determinism when scoring quality?
Don't try to get consistent scores on individual requests — LLM non-determinism means individual scores will vary even for identical inputs. Track score distributions over time and use statistical methods to detect when the distribution shifts. A score of 4.1 today versus 4.0 yesterday is noise. A distribution that was centered at 4.2 last week and is now centered at 3.6 is a meaningful signal. Alert on distribution shift, not point-in-time variance.
At what traffic volume does quality monitoring make sense?
Start immediately, even at low traffic volumes. The value is baseline establishment, not absolute numbers. Two weeks of quality scores at 10 requests per day gives you a baseline to alert against when traffic scales or a prompt change causes degradation. Two weeks of no data gives you nothing to compare to when you actually need it. The cost at low traffic is negligible; the cost of missing a regression while operating blind is not.
Should I store prompt and response content in spans?
PII and data residency requirements usually prevent storing full content in telemetry. Store the hash of the prompt template (not the rendered prompt with user data), response length, a category label, and quality scores. For quality evaluation, pass content through your evaluation pipeline separately — but don't attach raw prompt/response content to spans. Structure your evaluation pipeline so it reads from a secure, access-controlled store, not from your distributed trace backend.
What's the right judge model to use for production evaluation?
Use a smaller, faster model than your production model — Claude Haiku or GPT-4o-mini at current pricing. The judge doesn't need to be smarter than the system being evaluated; it needs to be consistent. A smaller model with a well-designed rubric consistently outperforms a larger model with an underspecified prompt. Cost matters: at 7% sample rate with a Haiku-class judge, eval cost is roughly 1–3% of production inference cost. Verify this estimate with your actual token counts before committing to a sample rate.
How do I detect model provider updates affecting quality?
Prompt hashing won't catch this — the template didn't change. A continuous quality baseline catches it: if scores shift without a corresponding hash change, the most likely cause is a provider-side model update. Track your quality baseline as a continuous series. Any discontinuity that isn't correlated with a deployment or prompt hash change is a signal to check provider changelogs and consider pinning to a specific model version if the API supports it.
The teams that catch quality degradation before it escalates to executive complaints share one characteristic: they treated output quality as an engineering concern from the start, not a customer support problem to triage after the fact.
The infrastructure for this isn't complicated. It's a hash function, a sampling decision, a judge model, and the discipline to build baselines before you need them. What's hard is the organizational habit of treating semantic quality as a first-class signal alongside latency and error rate — not as a soft metric that belongs on a different team's dashboard.
Your traces are already recording that requests completed. The question is whether you're recording whether they worked. That gap — between completion and usefulness — is where production LLM quality lives and dies.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.