Three weeks into production, every dashboard was green. Latency median at 380ms, error rate at 0.3%, token spend exactly on budget. The engineering team had signed off on the SLO review and moved to the next sprint. User satisfaction scores had been dropping every single day for two weeks.
It took a VP of Product escalating before anyone traced the degradation back to a prompt regression. A developer had cleaned up the system prompt eleven days earlier, removing nine words that told the model to format responses as structured JSON. Latency didn't change. Error rate didn't budge. Token cost was unaffected. The only signal was user feedback that nobody had connected to the engineering timeline.
This is the quality gap in LLM observability: the signals most teams instrument first — latency, error rate, token cost — can all sit healthy while output quality collapses. Infrastructure metrics measure whether your system functioned. They say nothing about whether it worked.
Key Takeaways
- ✓
Latency, error rate, and token cost can all look healthy while LLM output quality silently degrades — these metrics measure completion, not usefulness.
- ✓
Sampled production evaluation (5–10% of traffic via LLM-as-judge) is the core technique for catching quality drift before users escalate it.
- ✓
Prompt hash drift detection catches the most common cause of silent regressions: unannounced prompt template changes that alter model behavior.
- ✓
Cost tracking must come before quality sampling — your sampling rate is directly constrained by how much evaluation costs per request type.
- ✓
Alert on distribution shift over rolling windows, not point-in-time score variance. A 5% drop over 7 days is actionable; individual score fluctuation is noise.
The Quality Gap in LLM Observability
Why infrastructure metrics are necessary but not sufficient
Traditional distributed tracing assumes determinism. If a function returns 200, it succeeded. You can write an SLO against it, alert on deviations, build dashboards. LLMs break this assumption completely.
A response can be syntactically valid, arrive in under 500ms, return a 200 OK, consume exactly the expected tokens, and still be factually wrong, poorly formatted, or entirely off-topic for the user's intent. The span looks clean. The service looks healthy. The user experience is quietly deteriorating.
The gap between 'request completed' and 'request was useful' is where most LLM quality failures hide. Teams commonly discover this the hard way: their monitoring alerts on model API outages but goes silent during slow, multi-week quality degradation events that users notice long before engineers do.[3]
For RAG systems, this compounds. A retrieval step that returns documents with slightly lower relevance scores won't trigger any infrastructure alert. The model does its best with worse context. Outputs degrade subtly. Tracking retrieval and generation quality separately is the only way to diagnose which layer is failing — aggregating them into a single health metric hides the root cause.
p50/p95/p99 latency SLOs
HTTP error rate
Token spend alerts
Model API availability
Rate limit proximity
Sampled LLM-as-judge quality scores
Prompt hash drift detection
Per-intent quality trend tracking
Retrieval vs. generation split (RAG)
Statistical drift alerts on rolling baselines
Four Quality Signals Worth Building
Ordered by implementation complexity and return on investment
Sampled Production Evaluation
Running quality judges on live traffic without burning your inference budget
The core technique: after a request completes, route a sample of it through an evaluator that scores the response against criteria you define — format compliance, relevance, factual coherence. Aggregate those scores over time and alert when trends break.
The sampling rate is a cost decision. Adaline's 2026 observability guide recommends 5–10% of production traffic for LLM-as-judge evaluation — enough for statistical confidence without doubling inference costs.[1] Start lower, at 1–2%, until you've validated that your judge model produces scores consistent with human judgment on the same samples. Scaling to 5–10% before you trust the scores means collecting misleading data at scale.
Two rules regardless of sampling rate: evaluate 100% of requests that triggered an error or a user complaint, and stratify by intent type so each request category gets proportional coverage. Sampling only from the most recent requests will give you recency bias — you'll miss gradual degradation that accumulates across intent categories unevenly.
The metric that matters is trend, not absolute score. A quality dimension sitting at 3.8/5.0 today is meaningless without history. That same dimension dropping from 4.2 to 3.5 over two weeks is a meaningful signal. Teams running continuous quality monitoring report that a 5% drop in any single dimension over a 7-day rolling window is the threshold that surfaces real degradation before users escalate.[2]
One honest caveat: non-determinism means you can't expect consistent scores on individual requests. Don't alert on single-request variance. Alert on distribution shift — when the population of scores changes shape, not when one score fluctuates.
quality_evaluator.pyimport json, random
from anthropic import Anthropic
from opentelemetry import trace
tracer = trace.get_tracer("quality.evaluator")
client = Anthropic()
SAMPLE_RATE = 0.07 # 7% of production traffic
def sample_and_evaluate(
trace_id: str, query: str, response: str
) -> dict | None:
if random.random() > SAMPLE_RATE:
return None # not sampled
with tracer.start_as_current_span("quality.judge") as span:
span.set_attribute("quality.trace_ref", trace_id)
result = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=64,
messages=[{"role": "user", "content": judge_prompt(query, response)}],
)
scores = json.loads(result.content[0].text)
for dim, score in scores.items():
span.set_attribute(f"quality.score.{dim}", score)
return scores
def judge_prompt(query: str, response: str) -> str:
return (
"Rate this response: format (1-5), relevance (1-5), coherence (1-5).\n"
f"Query: {query}\nResponse: {response}\n"
"Reply JSON only."
)Prompt Hash Drift: Catching Invisible Changes
How a nine-word deletion goes undetected without fingerprinting
Prompt drift is invisible by design. Nobody announces it. A developer refactors a helper function, a template variable changes scope, a conditional branch that used to include an instruction now skips it. The effective prompt the model receives changes — and unless you're hashing your templates, you have no record it happened.
The fix requires one function and one span attribute. Hash your prompt template before rendering it with user data, and attach that hash to every LLM call span. When the hash changes, you know the template changed — regardless of whether any deployment log mentions it.
Correlate hash changes with quality score trends in the following 24–48 hours. A hash change accompanied by a quality score drop is almost always the root cause. A hash change with no quality impact is usually safe to let stand. The combination tells you what a deployment log alone cannot: whether the change actually mattered to output quality.
One thing to note: this only catches changes to the template itself, not changes to the data interpolated into it. For cases where changing context (like user history, retrieved documents, or business rules) is the source of quality drift, you need embedding-based drift detection on the rendered input distribution.[6] Prompt hashing is a fast, zero-cost signal for the most common cause of silent regressions — it's not a complete solution.
prompt_tracing.pyimport hashlib
from opentelemetry import trace
tracer = trace.get_tracer("llm.client")
def prompt_hash(template: str) -> str:
return hashlib.sha256(template.encode()).hexdigest()[:12]
def call_llm_with_tracing(
template: str, template_version: str, rendered: str
) -> str:
with tracer.start_as_current_span("gen_ai.request") as span:
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.prompt.hash", prompt_hash(template))
span.set_attribute("gen_ai.prompt.version", template_version)
# Make the actual LLM call with the rendered prompt
return _do_llm_call(rendered)The Quality Signal Pipeline
How traces flow from production traffic to actionable alerts
Alert Thresholds That Actually Matter
What to fire on, what window to use, and what to do first
| Signal | Threshold | Window | First Action |
|---|---|---|---|
| Quality score drop | Any dimension falls 5%+ below baseline | 7-day rolling | Pull traces from degradation period; check prompt hash timeline |
| Prompt hash change | Any unexpected hash change outside a deployment | Per-request | Verify change was intentional; correlate with next 48h quality trend |
| Retrieval relevance drop (RAG) | Retrieval score falls 10%+ below baseline | 3-day rolling | Check knowledge base freshness; inspect retrieval pipeline configuration |
| Hallucination rate spike | Flagged rate exceeds 2× historical baseline | Daily | Immediate escalation; roll back most recent prompt or model changes |
| Cost-per-quality ratio | Spend per quality point rises 30%+ week-over-week | Weekly | Review model selection and prompt efficiency; check for context bloat |
Build Order: Why Cost Tracking Comes First
The counterintuitive prerequisite to quality monitoring
The most common mistake is starting with quality evaluators before having cost tracking in place. This seems backwards until you realize: your sampling rate for LLM-as-judge is directly constrained by how much that evaluation costs. Without knowing your baseline inference cost per request type, you have no principled basis for choosing a sampling rate. You'll either undersample (missing real degradation) or oversample (running quality evals that cost more than the system they're monitoring).
Cost tracking also delivers fast value on its own. Adding trace IDs and cost-per-span takes a few hours. It immediately shows which request types are expensive and whether any single workflow is responsible for disproportionate spend. That data shapes every downstream sampling decision.
There's also the uncomfortable truth about observability comprehensiveness: teams that try to build the full quality monitoring stack in the first sprint typically ship nothing usable in the first month. Phased delivery — something working at each stage — consistently outperforms a complete design that's half-implemented.
- 1
Trace IDs + Cost Tracking (Weeks 1–2)
Attach a unique trace ID to every request and a cost attribute to every LLM call span. Immediately surfaces expensive outliers and establishes the budget for subsequent sampling decisions. This is the foundation everything else depends on.
- 2
Prompt Hash Instrumentation (Weeks 2–3)
Add prompthash and promptversion attributes to every LLM call span. Zero inference cost, immediate audit trail. Catches the most common cause of silent quality regressions — unannounced prompt changes — before they compound.
- 3
Heuristic Quality Checks (Weeks 3–5)
Run lightweight rule-based checks on 100% of traffic: format validation, response length bounds, regex patterns for expected structure. Detects structural failures cheaply and helps calibrate what 'normal' looks like before investing in semantic evaluation.
- 4
Sampled LLM-as-Judge (Weeks 5–8)
Start at 1–2% sample rate. Validate scores against 50–100 human-rated samples before scaling. Establish a 2-week rolling baseline before configuring drift alerts. Scale to 5–10% once you trust the scores and know the cost profile.
Minimum Viable LLM Observability Checklist
The checklist before you claim production quality observability
Production Quality Observability Checklist
Every request has a unique trace ID propagated across all service boundaries
Every LLM call span records model, token counts, and estimated cost
Every LLM call span includes genai.prompt.hash and genai.prompt.version
Heuristic checks run on 100% of production traffic (format, length, keyword patterns)
LLM-as-judge running on at least 1% of production traffic with validated scores
2 weeks of quality score history collected before any drift alerts are configured
Alerts configured on 7-day rolling window, 5% drop threshold per quality dimension
Retrieval quality tracked separately from generation quality in RAG pipelines
Escalated and flagged requests evaluated at 100% regardless of sample rate
Prompt hash change + quality correlation runbook documented and accessible
How is LLM quality monitoring different from traditional APM?
Traditional APM measures whether code executed correctly — a function ran, the database responded, the request completed within latency bounds. LLM quality monitoring measures whether outputs were useful — whether the response answered the question, matched the expected format, and was factually coherent. These are orthogonal concerns. A request that completes in 200ms with a 200 OK can still produce an output that's wrong, incoherent, or misformatted. Infrastructure metrics alone cannot capture that distinction.
Can I use OpenTelemetry for quality signals?
Yes. The GenAI semantic conventions cover model, token counts, and latency out of the box as of 2025. For quality signals, you extend with custom span attributes: quality.score.format, quality.score.relevance, genai.prompt.hash, genai.prompt.version. OpenTelemetry collectors pass custom attributes through without modification — no changes to your pipeline, backend, or dashboard configuration are needed beyond adding the attributes at instrumentation time.
How do I handle non-determinism when scoring quality?
Don't try to get consistent scores on individual requests — LLM non-determinism means individual scores will vary even for identical inputs. Track score distributions over time instead, and use statistical methods to detect when the distribution shifts. A score of 4.1 today versus 4.0 yesterday is noise. A distribution that was centered at 4.2 last week and is now centered at 3.6 is a meaningful signal. Alert on distribution shift, not point-in-time variance.
At what traffic volume does quality monitoring make sense?
Start immediately, even at low traffic volumes. The value is baseline establishment, not absolute numbers. Two weeks of quality scores at 10 requests per day gives you a baseline to alert against when traffic scales or a prompt change causes degradation. Two weeks of no data gives you nothing to compare to when you actually need it. The cost at low traffic is negligible; the cost of missing a regression while operating blind is not.
Should I store prompt and response content in spans?
PII and data residency requirements usually prevent storing full content in telemetry. Store the hash of the prompt template (not the rendered prompt with user data), response length, a category label, and quality scores. For quality evaluation, pass content through your evaluation pipeline separately — but don't attach raw prompt/response content to spans. Structure your evaluation pipeline so it reads from a secure, access-controlled store, not from your distributed trace backend.
The teams that catch quality degradation before it escalates to executive complaints share one characteristic: they treated output quality as an engineering concern from the start, not a customer support problem to triage after the fact.
The infrastructure for this isn't complicated. It's a hash function, a sampling decision, a judge model, and the discipline to build baselines before you need them. What's hard is the organizational habit of treating semantic quality as a first-class signal alongside latency and error rate — not as a soft metric that belongs on a different team's dashboard.
Your traces are already recording that requests completed. Start recording whether they worked.
- [1]The Complete Guide to LLM Observability & Monitoring in 2026 (Adaline, Feb 2026)(adaline.ai)↩
- [2]AI Production Monitoring: Quality Drift, Hallucinations, Costs (Particula Tech, Feb 2026)(particula.tech)↩
- [3]Production Monitoring Alerts for LLM Quality Drops — Braintrust 24-hour blindspot case (Technivorz, Mar 2026)(technivorz.com)↩
- [4]AI Agent Observability — Evolving Standards and Best Practices (OpenTelemetry, 2025)(opentelemetry.io)↩
- [5]Quality Monitoring: Drift Detection and Regression Alerts for LLMs (Brenndoerfer, Feb 2026)(mbrenndoerfer.com)↩
- [6]Detecting drift in production generative AI applications (AWS Prescriptive Guidance)(docs.aws.amazon.com)↩