Three weeks into production. Median latency 380ms. Error rate 0.3%. Token spend on budget. SLO review signed off. The team moved on.
User satisfaction had been dropping every day for two weeks. It took a VP escalating before anyone correlated the curve with a prompt cleanup eleven days earlier — nine words removed from the system prompt that had told the model to format responses as structured JSON. Latency did not move. Errors did not move. Cost did not move. The only signal was user complaints, and nobody had wired user complaints into the engineering timeline.
This is the gap. The signals teams instrument first — latency, error rate, token cost — all sit healthy while semantic quality collapses underneath them. Infrastructure metrics tell you the request completed. They have no opinion on whether it worked. Logging that records 'request returned 200' is not observability for an LLM. It is an alibi.
What This Article Covers
- ✓
Latency, error rate, and token cost measure completion, not usefulness — three healthy dashboards can sit next to a collapsing product experience for weeks.
- ✓
Sampled LLM-as-judge on 5–10% of live traffic is the cheapest mechanism that catches quality drift before users escalate it.
- ✓
Prompt hash drift catches the most common silent regression: a template change that nobody announced and nothing logged.
- ✓
OpenTelemetry GenAI semantic conventions cover infra signals natively; quality signals require custom span attributes you add yourself.
- ✓
Cost tracking comes before sampling. Your eval sample rate is bounded by what each evaluation costs — without that number you're guessing.
- ✓
Alert on distribution shift across a rolling window. A 5% drop over 7 days is a signal. A single score moving is noise.
- ✓
Judge calibration is not optional. Cohen's kappa below 0.6 means you're scaling confident-looking noise, not measurement.
Distributed Tracing Assumed Determinism. LLMs Broke That Assumption.
The fight: a tracing model designed for deterministic services applied to a non-deterministic one. The infrastructure layer cannot see semantic failure.
Distributed tracing was built on a contract: a function returns 200, the request succeeded. SLOs hold. Alerts fire. Dashboards mean something. LLMs break that contract.
A response can be syntactically valid, arrive under 500ms, return 200, consume the budgeted tokens, and still be wrong, malformed, or off-intent. The span is clean. The service is healthy. The user experience is degrading. Nothing in your tracing stack was built to notice the difference.
The gap between 'request completed' and 'request was useful' is where every LLM quality failure hides. Teams find out the same way every time — monitoring fires on model API outages and stays silent through multi-week quality regressions that users notice long before engineers do.[3] By the time someone escalates, the curve has been bending for two weeks.
The failure taxonomy is wider than most teams expect. Research on production LLM failure modes identifies at least five distinct classes that infrastructure metrics are blind to: output format deviation, semantic drift, factual incoherence, context handling failures (long context truncation, retrieval miss), and safety degradation (refusal rate changes from provider-side model updates).[10] Each class requires a different signal to detect.
RAG compounds it. Retrieval scores that drift a few percent will not trigger any infrastructure alert. The model does its best with worse context. Outputs degrade subtly. Aggregating retrieval and generation into one health number is how the root cause stays hidden — track them as separate signals or don't bother tracking them at all.
p50/p95/p99 latency SLOs
HTTP error rate
Token spend alerts
Model API availability
Rate limit proximity
Sampled LLM-as-judge scores against rolling baselines
Prompt hash drift attached to every span
Per-intent quality tracked as independent series
Retrieval and generation scored separately — never aggregated
Distribution-shift alerts, not point-variance noise
What OpenTelemetry Actually Covers — and Where You're On Your Own
The GenAI semantic conventions are a solid infra foundation. Quality signals are a separate layer you build on top.
The OpenTelemetry GenAI semantic conventions standardize how LLM calls are recorded across providers.[7] They're the right foundation. But understanding exactly what they cover — and what they don't — determines where you have to build.
What the conventions give you out of the box:
gen_ai.system— the provider (e.g.,anthropic,openai,aws.bedrock)gen_ai.request.model— the exact model name requestedgen_ai.usage.input_tokens/gen_ai.usage.output_tokens— token countsgen_ai.response.finish_reasons— why generation stopped (stop,length,tool_calls)gen_ai.client.operation.duration— a histogram of LLM call latencies[8]
As of early 2026, most GenAI semantic conventions are in experimental status — the API isn't fully stabilized, and major vendors like Datadog and Grafana are beginning native support but with version-specific caveats. Pin your instrumentation library version and test collector compatibility before assuming pass-through.
What the conventions do not cover:
- Output evaluation or quality scoring
- Safety and hallucination detection
- Prompt template identity or versioning
- Intent-level quality tracking
Those are yours to instrument. The convention doesn't stop you — OpenTelemetry collectors pass custom span attributes through without modification. You add gen_ai.prompt.hash, gen_ai.prompt.version, quality.score.format, quality.score.relevance as custom attributes on the same span. They flow through the same pipeline your infra metrics use. No separate pipeline, no separate backend.
otel_llm_span.py# Full span: OTel GenAI conventions + custom quality attributes on the same span.
# Collectors pass custom attributes through without modification.
from opentelemetry import trace
from opentelemetry.semconv._incubating.attributes import gen_ai_attributes as GenAI
import hashlib, time
tracer = trace.get_tracer("llm.client", "1.0.0")
def call_with_full_instrumentation(
template: str,
rendered: str,
model: str = "claude-sonnet-4-5",
) -> str:
template_hash = hashlib.sha256(template.encode()).hexdigest()[:12]
with tracer.start_as_current_span("gen_ai.client.chat") as span:
# --- OTel GenAI conventions (standard) ---
span.set_attribute(GenAI.GEN_AI_SYSTEM, "anthropic")
span.set_attribute(GenAI.GEN_AI_REQUEST_MODEL, model)
t0 = time.perf_counter()
response, usage = _do_llm_call(rendered, model)
elapsed = time.perf_counter() - t0
span.set_attribute(GenAI.GEN_AI_USAGE_INPUT_TOKENS, usage.input_tokens)
span.set_attribute(GenAI.GEN_AI_USAGE_OUTPUT_TOKENS, usage.output_tokens)
span.set_attribute("gen_ai.client.operation.duration", elapsed)
# --- Custom quality attributes (not in conventions — add yourself) ---
span.set_attribute("gen_ai.prompt.hash", template_hash)
span.set_attribute("gen_ai.prompt.version", "v2.4.1")
span.set_attribute("gen_ai.request.intent", "summarization")
# quality.score.* populated by async evaluator after sampling gate
return responseFour Signals That Catch What Latency Cannot
Each one targets a specific failure class infrastructure metrics ignore. Ordered by build cost and leverage.
Sampled Evals: A Judge on Your Live Traffic
The fight: signal coverage versus inference cost. Sample too low, miss the regression. Sample too high, double the bill.
The mechanism is simple. After a request completes, route a sample of it through an evaluator that scores the response against the criteria you actually care about — format compliance, relevance to intent, factual coherence. Aggregate the scores. Alert when the trend breaks.
Sampling rate is a cost decision, not a statistics decision pretending to be one. Adaline's 2026 guide puts the range at 5–10% of production traffic — enough for confidence intervals without doubling inference spend.[1] Don't start there. Start at 1–2% and validate that the judge agrees with human raters on the same samples. Scaling to 5–10% before you trust the judge means collecting misleading data faster.
Two rules hold regardless of sampling rate. Evaluate 100% of requests that errored or got escalated. Stratify by intent so each category gets proportional coverage. Sampling only from the most recent traffic gives you recency bias and hides slow degradations that compound across intent categories unevenly.
The metric that matters is trend, not absolute score. 3.8/5.0 today is meaningless. 4.2 → 3.5 over two weeks is a structural signal. Continuous quality monitoring reports a 5% drop in any single dimension over a 7-day rolling window as the threshold that surfaces real degradation before users escalate.[2]
One honest constraint: non-determinism means individual scores will move even on identical inputs. Don't alert on single-request variance. Alert on distribution shift — when the population of scores changes shape, not when one number fluctuates.
Confidence intervals you can trust without doubling inference cost (Adaline, 2026)
7-day rolling drop in any single quality dimension — the threshold worth waking up for
Errored and escalated requests always evaluated. Sample rate does not apply to known failures.
Judge agreement with human raters must hit this bar before you trust scores at scale
Calibrate the Judge Before You Trust the Scores
A miscalibrated judge running at 7% sample rate produces confident-looking noise. Cohen's kappa is the gate.
Most teams skip calibration. They pick a judge model, write a rubric, and start collecting scores. The scores look plausible. The trend lines look smooth. What they've actually built is an expensive random number generator with a good aesthetic.
Calibration is the step that separates measurement from theater. Rate 50–200 production samples by hand, have your judge score the same samples, then compute Cohen's kappa between the two. For balanced criteria on a binary pass/fail rubric, 50 stratified traces will pin kappa to within ±0.10–0.15 at a 95% bootstrap confidence interval. That's enough to know whether your judge is tracking human judgment.[9]
When your criteria involve rare-but-expensive failure classes — safety violations appearing in 6% of traces, for example — 50 samples isn't enough. The variance of kappa is dominated by the count of minority-class examples, not the total sample size. Plan for 200+ when minority classes matter.[9]
The threshold that matters: kappa ≥ 0.6 before scaling to 5–10% of production traffic. Below that, the judge isn't tracking human quality judgment consistently enough to detect real regressions. Between 0.6 and 0.8 is workable for most quality dimensions. Above 0.8 is the bar for high-stakes criteria like safety detection.
Calibration also has a shelf life. Judge drift is real: without a monthly recalibration cadence, agreement degrades over 60–90 days as production distribution shifts. Run your gold set monthly. Alert if kappa drops below threshold. Treat the judge as a system that also needs monitoring.
quality_evaluator.py# Judge runs on 7% of traffic. Errored and escalated requests bypass the gate entirely.
# Calibration: run this against 50-200 hand-rated samples and check Cohen's kappa >= 0.6.
import json, random
from anthropic import Anthropic
from opentelemetry import trace
tracer = trace.get_tracer("quality.evaluator")
client = Anthropic()
SAMPLE_RATE = 0.07
def sample_and_evaluate(
trace_id: str, query: str, response: str, intent: str, is_error: bool = False
) -> dict | None:
# Errors always evaluated — sample rate doesn't apply
if not is_error and random.random() > SAMPLE_RATE:
return None
with tracer.start_as_current_span("quality.judge") as span:
span.set_attribute("quality.trace_ref", trace_id)
span.set_attribute("quality.intent", intent)
span.set_attribute("quality.forced", is_error) # distinguish forced evals
result = client.messages.create(
model="claude-haiku-4-5",
max_tokens=64,
messages=[{"role": "user", "content": judge_prompt(query, response, intent)}],
)
scores = json.loads(result.content[0].text)
for dim, score in scores.items():
span.set_attribute(f"quality.score.{dim}", score)
return scores
def judge_prompt(query: str, response: str, intent: str) -> str:
return (
f"Task type: {intent}\n"
"Rate this response on three dimensions (1-5 each):\n"
"- format: does the response match the expected structure?\n"
"- relevance: does it address the query accurately?\n"
"- coherence: is it internally consistent and factually plausible?\n"
f"Query: {query}\nResponse: {response}\n"
"Reply JSON only: {\"format\": N, \"relevance\": N, \"coherence\": N}"
)Prompt Drift Is the Default. Hash It or Stop Pretending.
Nobody announces a prompt change. Without a fingerprint on every span, a nine-word deletion is indistinguishable from no change at all.
Prompt drift is invisible by design. A developer refactors a helper. A template variable shifts scope. A conditional branch that used to include an instruction now skips it. The effective prompt the model receives changes, and unless you're hashing the template, no deployment log will mention it.
The fix is one function and one span attribute. Hash the template before rendering it with user data. Attach the hash to every LLM call span. When the hash moves, the template moved — whether or not anyone wrote a changelog entry.
The leverage is in correlation. A hash change followed by a quality score drop within 24–48 hours is almost always the root cause. A hash change with no quality impact is the safe case — note it and move on. The combination tells you what a deployment log alone cannot: did the change actually matter.
Model provider updates are a second, subtler form of drift. Providers update model weights, safety policies, and serving infrastructure without changing API endpoints. An August 2025 postmortem documented bugs affecting Claude Sonnet 4's performance without any API modifications, including routing errors affecting up to 16% of requests. Prompt hashing doesn't catch this — the template didn't change. What catches provider-side drift is a continuous baseline: quality scores should be stable between template changes, and any shift that isn't correlated with a hash change needs a different root cause hypothesis.
One constraint: hashing catches changes to the template itself, not to the data interpolated into it. When user history, retrieved documents, or business rules drift, you need embedding-based drift detection on the rendered input distribution.[6] Hashing is a zero-cost signal for the most common silent regression. It's not the whole story, and a team that claims otherwise is selling something.
prompt_tracing.py# Hash the template before rendering. The hash is the signal; the rendered prompt is the user's data.
import hashlib
from opentelemetry import trace
tracer = trace.get_tracer("llm.client")
def prompt_hash(template: str) -> str:
return hashlib.sha256(template.encode()).hexdigest()[:12]
def call_llm_with_tracing(
template: str, template_version: str, rendered: str
) -> str:
with tracer.start_as_current_span("gen_ai.request") as span:
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.prompt.hash", prompt_hash(template))
span.set_attribute("gen_ai.prompt.version", template_version)
# Send the rendered prompt to the model. Hash stays on the span.
return _do_llm_call(rendered)When Input Distribution Shifts: Embedding Drift Detection
Hashing catches template changes. This catches when the world stops sending you the queries you built for.
There's a failure class that prompt hashing misses entirely: the input distribution shifts while your prompt stays constant. Users start asking questions your system wasn't built for. A seasonal event changes query intent. A product update attracts a different user segment. The model handles these new queries worse than your training distribution, but every span shows the same template hash.
Embedding-based drift detection addresses this. Embed a rolling sample of incoming queries using the same encoder your retrieval system uses. Cluster them over time. When the centroid of the production distribution moves outside a threshold of your golden dataset coverage, you're receiving queries meaningfully different from what you built for — before failure rates climb.[2]
The mechanics: compute cosine distance between the production query embedding centroid (rolling 7-day window) and the golden dataset centroid. A 15–20% shift in centroid distance is a reasonable threshold for investigation. It's not a reason to page; it's a reason to sample manually and decide whether the new distribution requires prompt updates or retrieval index refreshes.
For RAG pipelines, a variant of this applies at the retrieval layer. Track the distribution of retrieval scores across a rolling window. A 10% drop in the median retrieval score over 3 days means the knowledge base is drifting relative to the queries — not a model problem, a data freshness problem. Track it as a separate signal from generation quality, or you'll spend two days debugging the wrong layer.
| Failure Class | Caught by Latency/Error? | Caught by Prompt Hash? | Caught by Sampled Eval? | Caught by Embedding Drift? |
|---|---|---|---|---|
| Template edit removed instruction | No | Yes — hash changes | Yes — scores drop | No |
| Provider model update changed behavior | No | No — template unchanged | Yes — if baseline is continuous | No |
| Input distribution shifted (new query types) | No | No | Partially — if sampled queries reflect new dist. | Yes |
| RAG knowledge base staleness | No | No | Partially — coherence score drop | Yes — retrieval score variant |
| Format regression (JSON → prose) | No | Yes (if instruction removed) | Yes — format score | No |
| Response truncation / length collapse | Token count alert fires | No | Yes — coherence/relevance drop | No |
The Pipeline From Live Traffic to Page
How a request becomes a score, becomes a baseline, becomes an alert that lands on the right person.
Heuristic Checks: Free Signal on 100% of Traffic
Deterministic checks cost nothing per request and catch format failures before the judge ever runs.
Before you spend inference dollars on an LLM judge, run deterministic checks on everything. These execute in microseconds, cover 100% of requests, and catch the class of failures that are easiest to produce and hardest to notice: structural regressions.
Format validation is the highest-ROI heuristic. If your application expects JSON, validate the schema on every response. Track the JSON-parse failure rate as a metric. A spike from 0.1% to 3% failure rate is an earlier signal than any quality score — and it doesn't cost you an LLM call to detect. Similar logic applies to response length: if your summarization workflow should produce 100–300 words, a response histogram that collapses to 20-word outputs is detectable without a judge.
Other high-value heuristics: scan for refusal patterns (I'm sorry, I can't, As an AI) to track safety calibration drift from model updates; check for unexpected language or character sets in international deployments; validate that expected structure markers appear (section headers, numbered lists, code fences) when the task requires them.
The pass/fail rate of each heuristic check is itself a monitoring signal. Logging pass/fail rates by intent type over time gives you the earliest warning of model drift or provider-side API changes — visible within hours, not days.
heuristic_checks.py# Deterministic checks on 100% of traffic. No inference cost.
# Pass/fail rates tracked as metrics — spikes are early drift signals.
import json, re
from opentelemetry import metrics
meter = metrics.get_meter("quality.heuristics")
format_fail_counter = meter.create_counter(
"quality.heuristic.format_fail",
description="Count of responses failing format validation",
)
refusal_counter = meter.create_counter(
"quality.heuristic.refusal_detected",
description="Count of responses with refusal patterns",
)
REFUSAL_PATTERNS = re.compile(
r"(I'm sorry|I can't|As an AI|I cannot|I am not able)",
re.IGNORECASE,
)
LENGTH_BOUNDS = {"summarization": (80, 400), "qa": (20, 600), "extraction": (10, 200)}
def heuristic_check(response: str, intent: str, expected_schema: dict | None = None) -> dict:
results = {}
# JSON format validation
if expected_schema:
try:
parsed = json.loads(response)
results["format_ok"] = True
except json.JSONDecodeError:
results["format_ok"] = False
format_fail_counter.add(1, {"intent": intent})
# Refusal detection
results["refusal"] = bool(REFUSAL_PATTERNS.search(response))
if results["refusal"]:
refusal_counter.add(1, {"intent": intent})
# Length bounds
word_count = len(response.split())
lo, hi = LENGTH_BOUNDS.get(intent, (0, 10_000))
results["length_ok"] = lo <= word_count <= hi
results["word_count"] = word_count
return resultsThresholds That Page the Right Person, Not the Whole Team
Alerts that fire on noise stop firing entirely. Each threshold below names a specific failure class and the first move that catches the cause.
| Signal | Threshold | Window | First Action |
|---|---|---|---|
| Quality score drop | Any dimension falls 5%+ below baseline | 7-day rolling | Pull traces from degradation period; check prompt hash timeline |
| Prompt hash change | Any unexpected hash change outside a deployment window | Per-request | Verify change was intentional; correlate with next 48h quality trend |
| Retrieval relevance drop (RAG) | Retrieval score falls 10%+ below baseline | 3-day rolling | Check knowledge base freshness; inspect retrieval pipeline config |
| Hallucination / refusal rate spike | Flagged rate exceeds 2× historical baseline | Daily | Immediate escalation; roll back most recent prompt or model changes |
| Format validation failure rate | Rises above 2× baseline (e.g., 0.1% → 0.2%) | 1-hour rolling | Check for provider-side model update; inspect template for regressions |
| Judge kappa drop (monthly calibration) | Kappa falls below 0.6 on gold set | Monthly re-run | Suspend scoring trust; re-rate gold set and update judge rubric |
| Cost-per-quality ratio | Spend per quality point rises 30%+ week-over-week | Weekly | Review model selection and prompt efficiency; check for context bloat |
Build Order: Why Cost Tracking Comes First
The counterintuitive prerequisite to quality monitoring
The most common mistake is starting with quality evaluators before having cost tracking in place. This seems backwards until you realize: your sampling rate for LLM-as-judge is directly constrained by how much that evaluation costs. Without knowing your baseline inference cost per request type, you have no principled basis for choosing a sampling rate. You'll either undersample (missing real degradation) or oversample (running quality evals that cost more than the system they're monitoring).
Cost tracking also delivers fast value on its own. Adding trace IDs and cost-per-span takes a few hours. It immediately shows which request types are expensive and whether any single workflow is responsible for disproportionate spend. That data shapes every downstream sampling decision.
There's also the uncomfortable truth about observability comprehensiveness: teams that try to build the full quality monitoring stack in the first sprint typically ship nothing usable in the first month. Phased delivery — something working at each stage — consistently outperforms a complete design that's half-implemented.
- [01]
Trace IDs + Cost Tracking (Weeks 1–2)
Attach a unique trace ID to every request and a cost attribute to every LLM call span. Immediately surfaces expensive outliers and establishes the budget for subsequent sampling decisions. This is the foundation everything else depends on.
- [02]
Prompt Hash Instrumentation (Weeks 2–3)
Add genai.prompt.hash and genai.prompt.version attributes to every LLM call span. Zero inference cost, immediate audit trail. Catches the most common cause of silent quality regressions — unannounced prompt changes — before they compound.
- [03]
Heuristic Quality Checks (Weeks 3–5)
Run lightweight rule-based checks on 100% of traffic: format validation, response length bounds, regex patterns for expected structure. Track pass/fail rates as OTel metrics. Detects structural failures cheaply and calibrates what 'normal' looks like before investing in semantic evaluation.
- [04]
Judge Calibration Run (Week 5)
Before any LLM-as-judge goes to production, manually rate 50–200 production samples. Compute Cohen's kappa between your judge and human raters. Don't proceed until kappa ≥ 0.6 on balanced criteria. This is not optional.
- [05]
Sampled LLM-as-Judge (Weeks 5–8)
Start at 1–2% sample rate with a validated judge. Establish a 2-week rolling baseline before configuring drift alerts. Scale to 5–10% once you trust the scores and know the cost profile. Schedule monthly kappa recalibration against the gold set.
Minimum Viable LLM Observability Checklist
The checklist before you claim production quality observability
Production Quality Observability Checklist
Every request has a unique trace ID propagated across all service boundaries
Every LLM call span records genai.system, genai.request.model, token counts, and estimated cost
Every LLM call span includes genai.prompt.hash and genai.prompt.version
Heuristic checks run on 100% of production traffic; pass/fail rates tracked as OTel metrics
Judge calibrated: Cohen's kappa ≥ 0.6 on 50–200 hand-rated samples before production use
LLM-as-judge running on at least 1% of production traffic with validated scores
2 weeks of quality score history collected before any drift alerts are configured
Alerts configured on 7-day rolling window, 5% drop threshold per quality dimension
Retrieval quality tracked separately from generation quality in RAG pipelines
Escalated and flagged requests evaluated at 100% regardless of sample rate
Prompt hash change + quality correlation runbook documented and accessible
Monthly judge re-calibration scheduled; kappa drop below 0.6 triggers rubric review
When to Skip Semantic Evaluation (and What to Do Instead)
LLM-as-judge is not always the right tool. Know the cases where it wastes money and misleads.
LLM-as-Judge: Apply / Skip Decision Rules
Apply when outputs are open-ended prose, structured summaries, or conversational responses where correctness isn't binary.
Semantic judges add value when human judgment is genuinely required to assess quality. These are the tasks where heuristics can't cover the full quality surface.
Skip when the output is deterministic or verifiable — SQL queries, code, JSON extraction against a fixed schema.
Run the SQL. Parse the JSON. Execute the code. A judge's opinion on whether the SQL is correct is slower, more expensive, and less reliable than just executing it and checking the result.
Skip when you haven't calibrated the judge. Uncalibrated scoring at scale produces data that feels authoritative and guides bad decisions.
An uncalibrated judge with a Cohen's kappa of 0.3 is not a measurement instrument. It's a random number generator with a clean dashboard.
Apply to 100% of safety-critical or escalated requests regardless of sample rate.
The cost of missing a safety failure is not comparable to the cost of an extra LLM call. Sample rate is a cost optimization for the happy path, not for known-bad or flagged requests.
Skip judge evaluation for requests where response content can't be safely passed to a third-party model due to data residency or PII constraints.
Route those requests through a self-hosted or on-premise evaluation model, or restrict to heuristic-only checks. Compliance constraints override eval coverage goals.
How is LLM quality monitoring different from traditional APM?
Traditional APM measures whether code executed correctly — a function ran, the database responded, the request completed within latency bounds. LLM quality monitoring measures whether outputs were useful — whether the response answered the question, matched the expected format, and was factually coherent. These are orthogonal concerns. A request that completes in 200ms with a 200 OK can still produce an output that's wrong, incoherent, or misformatted. Infrastructure metrics alone cannot capture that distinction.
Can I use OpenTelemetry for quality signals?
Yes — with a clear understanding of what the conventions cover. The OTel GenAI semantic conventions standardize genai.system, genai.request.model, token counts, finish reasons, and latency histograms. As of early 2026, most GenAI attributes are experimental status. For quality signals, you extend with custom span attributes: quality.score.format, quality.score.relevance, genai.prompt.hash, genai.prompt.version. Collectors pass custom attributes through without modification — no pipeline changes needed beyond adding the attributes at instrumentation time.
How do I handle non-determinism when scoring quality?
Don't try to get consistent scores on individual requests — LLM non-determinism means individual scores will vary even for identical inputs. Track score distributions over time and use statistical methods to detect when the distribution shifts. A score of 4.1 today versus 4.0 yesterday is noise. A distribution that was centered at 4.2 last week and is now centered at 3.6 is a meaningful signal. Alert on distribution shift, not point-in-time variance.
At what traffic volume does quality monitoring make sense?
Start immediately, even at low traffic volumes. The value is baseline establishment, not absolute numbers. Two weeks of quality scores at 10 requests per day gives you a baseline to alert against when traffic scales or a prompt change causes degradation. Two weeks of no data gives you nothing to compare to when you actually need it. The cost at low traffic is negligible; the cost of missing a regression while operating blind is not.
Should I store prompt and response content in spans?
PII and data residency requirements usually prevent storing full content in telemetry. Store the hash of the prompt template (not the rendered prompt with user data), response length, a category label, and quality scores. For quality evaluation, pass content through your evaluation pipeline separately — but don't attach raw prompt/response content to spans. Structure your evaluation pipeline so it reads from a secure, access-controlled store, not from your distributed trace backend.
What's the right judge model to use for production evaluation?
Use a smaller, faster model than your production model — Claude Haiku or GPT-4o-mini at current pricing. The judge doesn't need to be smarter than the system being evaluated; it needs to be consistent. A smaller model with a well-designed rubric consistently outperforms a larger model with an underspecified prompt. Cost matters: at 7% sample rate with a Haiku-class judge, eval cost is roughly 1–3% of production inference cost. Verify this estimate with your actual token counts before committing to a sample rate.
How do I detect model provider updates affecting quality?
Prompt hashing won't catch this — the template didn't change. A continuous quality baseline catches it: if scores shift without a corresponding hash change, the most likely cause is a provider-side model update. Track your quality baseline as a continuous series. Any discontinuity that isn't correlated with a deployment or prompt hash change is a signal to check provider changelogs and consider pinning to a specific model version if the API supports it.
The teams that catch quality degradation before it escalates to executive complaints share one characteristic: they treated output quality as an engineering concern from the start, not a customer support problem to triage after the fact.
The infrastructure for this isn't complicated. It's a hash function, a sampling decision, a judge model, and the discipline to build baselines before you need them. What's hard is the organizational habit of treating semantic quality as a first-class signal alongside latency and error rate — not as a soft metric that belongs on a different team's dashboard.
Your traces are already recording that requests completed. The question is whether you're recording whether they worked. That gap — between completion and usefulness — is where production LLM quality lives and dies.
- [1]The Complete Guide to LLM Observability & Monitoring in 2026 (Adaline, Feb 2026)(adaline.ai)↩
- [2]AI Production Monitoring: Quality Drift, Hallucinations, Costs (Particula Tech, Feb 2026)(particula.tech)↩
- [3]Production Monitoring Alerts for LLM Quality Drops — Braintrust 24-hour blindspot case (Technivorz, Mar 2026)(technivorz.com)↩
- [4]AI Agent Observability — Evolving Standards and Best Practices (OpenTelemetry, 2025)(opentelemetry.io)↩
- [5]Quality Monitoring: Drift Detection and Regression Alerts for LLMs (Brenndoerfer, Feb 2026)(mbrenndoerfer.com)↩
- [6]Detecting drift in production generative AI applications (AWS Prescriptive Guidance)(docs.aws.amazon.com)↩
- [7]Semantic conventions for generative client AI spans (OpenTelemetry, 2026)(opentelemetry.io)↩
- [8]Semantic conventions for generative AI metrics (OpenTelemetry, 2026)(opentelemetry.io)↩
- [9]LLM-as-Judge Best Practices in 2026: Calibration, Bias, and Cost (FutureAGI, 2026)(futureagi.com)↩
- [10]Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications (arXiv, Nov 2025)(arxiv.org)↩