LLM Observability: The Signals Latency Cannot See

LLM Observability: Catch Output Quality Drift Your Green Traces Can't See

Latency, error rate, and token cost stay green while LLM output quality degrades for weeks. The infrastructure layer cannot see semantic failure. Sampled evals, prompt hash drift, and distribution alerts are the signals that catch it before users do.

AI Engineering PlatformadvancedApr 22, 202610 min read

By Viktor Bezdek · VP Engineering, Groupon

Three weeks into production. Median latency 380ms. Error rate 0.3%. Token spend on budget. SLO review signed off. The team moved on.

User satisfaction had been dropping every day for two weeks. It took a VP escalating before anyone correlated the curve with a prompt cleanup eleven days earlier — nine words removed from the system prompt that had told the model to format responses as structured JSON. Latency did not move. Errors did not move. Cost did not move. The only signal was user complaints, and nobody had wired user complaints into the engineering timeline.

This is the gap. The signals teams instrument first — latency, error rate, token cost — all sit healthy while semantic quality collapses underneath them. Infrastructure metrics tell you the request completed. They have no opinion on whether it worked. Logging that records 'request returned 200' is not observability for an LLM. It is an alibi.

What This Article Covers

✓
Latency, error rate, and token cost measure completion, not usefulness — three healthy dashboards can sit next to a collapsing product experience for weeks.
✓
Sampled LLM-as-judge on 5–10% of live traffic is the cheapest mechanism that catches quality drift before users escalate it.
✓
Prompt hash drift catches the most common silent regression: a template change that nobody announced and nothing logged.
✓
OpenTelemetry GenAI semantic conventions cover infra signals natively; quality signals require custom span attributes you add yourself.
✓
Cost tracking comes before sampling. Your eval sample rate is bounded by what each evaluation costs — without that number you're guessing.
✓
Alert on distribution shift across a rolling window. A 5% drop over 7 days is a signal. A single score moving is noise.
✓
Judge calibration is not optional. Cohen's kappa below 0.6 means you're scaling confident-looking noise, not measurement.

Distributed Tracing Assumed Determinism. LLMs Broke That Assumption.

The fight: a tracing model designed for deterministic services applied to a non-deterministic one. The infrastructure layer cannot see semantic failure.

Distributed tracing was built on a contract: a function returns 200, the request succeeded. SLOs hold. Alerts fire. Dashboards mean something. LLMs break that contract.

A response can be syntactically valid, arrive under 500ms, return 200, consume the budgeted tokens, and still be wrong, malformed, or off-intent. The span is clean. The service is healthy. The user experience is degrading. Nothing in your tracing stack was built to notice the difference.

The gap between 'request completed' and 'request was useful' is where every LLM quality failure hides. Teams find out the same way every time — monitoring fires on model API outages and stays silent through multi-week quality regressions that users notice long before engineers do.^[3] By the time someone escalates, the curve has been bending for two weeks.

The failure taxonomy is wider than most teams expect. Research on production LLM failure modes identifies at least five distinct classes that infrastructure metrics are blind to: output format deviation, semantic drift, factual incoherence, context handling failures (long context truncation, retrieval miss), and safety degradation (refusal rate changes from provider-side model updates).^[10] Each class requires a different signal to detect.

RAG compounds it. Retrieval scores that drift a few percent will not trigger any infrastructure alert. The model does its best with worse context. Outputs degrade subtly. Aggregating retrieval and generation into one health number is how the root cause stays hidden — track them as separate signals or don't bother tracking them at all.

Alibi

p50/p95/p99 latency SLOs
HTTP error rate
Token spend alerts
Model API availability
Rate limit proximity

Observability

Sampled LLM-as-judge scores against rolling baselines
Prompt hash drift attached to every span
Per-intent quality tracked as independent series
Retrieval and generation scored separately — never aggregated
Distribution-shift alerts, not point-variance noise

What OpenTelemetry Actually Covers — and Where You're On Your Own

The GenAI semantic conventions are a solid infra foundation. Quality signals are a separate layer you build on top.

The OpenTelemetry GenAI semantic conventions standardize how LLM calls are recorded across providers.^[7] They're the right foundation. But understanding exactly what they cover — and what they don't — determines where you have to build.

What the conventions give you out of the box:

gen_ai.system — the provider (e.g., anthropic, openai, aws.bedrock)
gen_ai.request.model — the exact model name requested
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token counts
gen_ai.response.finish_reasons — why generation stopped (stop, length, tool_calls)
gen_ai.client.operation.duration — a histogram of LLM call latencies^[8]

As of early 2026, most GenAI semantic conventions are in experimental status — the API isn't fully stabilized, and major vendors like Datadog and Grafana are beginning native support but with version-specific caveats. Pin your instrumentation library version and test collector compatibility before assuming pass-through.

What the conventions do not cover:

Output evaluation or quality scoring
Safety and hallucination detection
Prompt template identity or versioning
Intent-level quality tracking

Those are yours to instrument. The convention doesn't stop you — OpenTelemetry collectors pass custom span attributes through without modification. You add gen_ai.prompt.hash, gen_ai.prompt.version, quality.score.format, quality.score.relevance as custom attributes on the same span. They flow through the same pipeline your infra metrics use. No separate pipeline, no separate backend.

otel_llm_span.py

# Full span: OTel GenAI conventions + custom quality attributes on the same span.
# Collectors pass custom attributes through without modification.
from opentelemetry import trace
from opentelemetry.semconv._incubating.attributes import gen_ai_attributes as GenAI
import hashlib, time

tracer = trace.get_tracer("llm.client", "1.0.0")


def call_with_full_instrumentation(
    template: str,
    rendered: str,
    model: str = "claude-sonnet-4-5",
) -> str:
    template_hash = hashlib.sha256(template.encode()).hexdigest()[:12]

    with tracer.start_as_current_span("gen_ai.client.chat") as span:
        # --- OTel GenAI conventions (standard) ---
        span.set_attribute(GenAI.GEN_AI_SYSTEM, "anthropic")
        span.set_attribute(GenAI.GEN_AI_REQUEST_MODEL, model)

        t0 = time.perf_counter()
        response, usage = _do_llm_call(rendered, model)
        elapsed = time.perf_counter() - t0

        span.set_attribute(GenAI.GEN_AI_USAGE_INPUT_TOKENS, usage.input_tokens)
        span.set_attribute(GenAI.GEN_AI_USAGE_OUTPUT_TOKENS, usage.output_tokens)
        span.set_attribute("gen_ai.client.operation.duration", elapsed)

        # --- Custom quality attributes (not in conventions — add yourself) ---
        span.set_attribute("gen_ai.prompt.hash", template_hash)
        span.set_attribute("gen_ai.prompt.version", "v2.4.1")
        span.set_attribute("gen_ai.request.intent", "summarization")
        # quality.score.* populated by async evaluator after sampling gate

        return response

Four Signals That Catch What Latency Cannot

Each one targets a specific failure class infrastructure metrics ignore. Ordered by build cost and leverage.

Sampled Evals

LLM-as-judge on 5–10% of live traffic, scoring format, relevance, coherence

Prompt Hashing

Fingerprint the template on every span. Detects silent template changes without logging user data.

Per-Intent Scoring

Quality tracked per request category. Isolates which workflow is degrading instead of averaging it away.

RAG Quality Split

Retrieval and generation as independent signals. One number hides which layer is failing.

Sampled Evals: A Judge on Your Live Traffic

The fight: signal coverage versus inference cost. Sample too low, miss the regression. Sample too high, double the bill.

The mechanism is simple. After a request completes, route a sample of it through an evaluator that scores the response against the criteria you actually care about — format compliance, relevance to intent, factual coherence. Aggregate the scores. Alert when the trend breaks.

Sampling rate is a cost decision, not a statistics decision pretending to be one. Adaline's 2026 guide puts the range at 5–10% of production traffic — enough for confidence intervals without doubling inference spend.^[1] Don't start there. Start at 1–2% and validate that the judge agrees with human raters on the same samples. Scaling to 5–10% before you trust the judge means collecting misleading data faster.

Two rules hold regardless of sampling rate. Evaluate 100% of requests that errored or got escalated. Stratify by intent so each category gets proportional coverage. Sampling only from the most recent traffic gives you recency bias and hides slow degradations that compound across intent categories unevenly.

The metric that matters is trend, not absolute score. 3.8/5.0 today is meaningless. 4.2 → 3.5 over two weeks is a structural signal. Continuous quality monitoring reports a 5% drop in any single dimension over a 7-day rolling window as the threshold that surfaces real degradation before users escalate.^[2]

One honest constraint: non-determinism means individual scores will move even on identical inputs. Don't alert on single-request variance. Alert on distribution shift — when the population of scores changes shape, not when one number fluctuates.

5–10%

The sampling rate that pays for itself

Confidence intervals you can trust without doubling inference cost (Adaline, 2026)

The drop that means something happened

7-day rolling drop in any single quality dimension — the threshold worth waking up for

100%

Coverage that is non-negotiable

Errored and escalated requests always evaluated. Sample rate does not apply to known failures.

0.6

Minimum Cohen's kappa before scaling

Judge agreement with human raters must hit this bar before you trust scores at scale

Calibrate the Judge Before You Trust the Scores

A miscalibrated judge running at 7% sample rate produces confident-looking noise. Cohen's kappa is the gate.

Most teams skip calibration. They pick a judge model, write a rubric, and start collecting scores. The scores look plausible. The trend lines look smooth. What they've actually built is an expensive random number generator with a good aesthetic.

Calibration is the step that separates measurement from theater. Rate 50–200 production samples by hand, have your judge score the same samples, then compute Cohen's kappa between the two. For balanced criteria on a binary pass/fail rubric, 50 stratified traces will pin kappa to within ±0.10–0.15 at a 95% bootstrap confidence interval. That's enough to know whether your judge is tracking human judgment.^[9]

When your criteria involve rare-but-expensive failure classes — safety violations appearing in 6% of traces, for example — 50 samples isn't enough. The variance of kappa is dominated by the count of minority-class examples, not the total sample size. Plan for 200+ when minority classes matter.^[9]

The threshold that matters: kappa ≥ 0.6 before scaling to 5–10% of production traffic. Below that, the judge isn't tracking human quality judgment consistently enough to detect real regressions. Between 0.6 and 0.8 is workable for most quality dimensions. Above 0.8 is the bar for high-stakes criteria like safety detection.

Calibration also has a shelf life. Judge drift is real: without a monthly recalibration cadence, agreement degrades over 60–90 days as production distribution shifts. Run your gold set monthly. Alert if kappa drops below threshold. Treat the judge as a system that also needs monitoring.

quality_evaluator.py

# Judge runs on 7% of traffic. Errored and escalated requests bypass the gate entirely.
# Calibration: run this against 50-200 hand-rated samples and check Cohen's kappa >= 0.6.
import json, random
from anthropic import Anthropic
from opentelemetry import trace

tracer = trace.get_tracer("quality.evaluator")
client = Anthropic()
SAMPLE_RATE = 0.07


def sample_and_evaluate(
    trace_id: str, query: str, response: str, intent: str, is_error: bool = False
) -> dict | None:
    # Errors always evaluated — sample rate doesn't apply
    if not is_error and random.random() > SAMPLE_RATE:
        return None

    with tracer.start_as_current_span("quality.judge") as span:
        span.set_attribute("quality.trace_ref", trace_id)
        span.set_attribute("quality.intent", intent)
        span.set_attribute("quality.forced", is_error)  # distinguish forced evals

        result = client.messages.create(
            model="claude-haiku-4-5",
            max_tokens=64,
            messages=[{"role": "user", "content": judge_prompt(query, response, intent)}],
        )

        scores = json.loads(result.content[0].text)
        for dim, score in scores.items():
            span.set_attribute(f"quality.score.{dim}", score)

        return scores


def judge_prompt(query: str, response: str, intent: str) -> str:
    return (
        f"Task type: {intent}\n"
        "Rate this response on three dimensions (1-5 each):\n"
        "- format: does the response match the expected structure?\n"
        "- relevance: does it address the query accurately?\n"
        "- coherence: is it internally consistent and factually plausible?\n"
        f"Query: {query}\nResponse: {response}\n"
        "Reply JSON only: {\"format\": N, \"relevance\": N, \"coherence\": N}"
    )

Prompt Drift Is the Default. Hash It or Stop Pretending.

Nobody announces a prompt change. Without a fingerprint on every span, a nine-word deletion is indistinguishable from no change at all.

Prompt drift is invisible by design. A developer refactors a helper. A template variable shifts scope. A conditional branch that used to include an instruction now skips it. The effective prompt the model receives changes, and unless you're hashing the template, no deployment log will mention it.

The fix is one function and one span attribute. Hash the template before rendering it with user data. Attach the hash to every LLM call span. When the hash moves, the template moved — whether or not anyone wrote a changelog entry.

The leverage is in correlation. A hash change followed by a quality score drop within 24–48 hours is almost always the root cause. A hash change with no quality impact is the safe case — note it and move on. The combination tells you what a deployment log alone cannot: did the change actually matter.

Model provider updates are a second, subtler form of drift. Providers update model weights, safety policies, and serving infrastructure without changing API endpoints. An August 2025 postmortem documented bugs affecting Claude Sonnet 4's performance without any API modifications, including routing errors affecting up to 16% of requests. Prompt hashing doesn't catch this — the template didn't change. What catches provider-side drift is a continuous baseline: quality scores should be stable between template changes, and any shift that isn't correlated with a hash change needs a different root cause hypothesis.

One constraint: hashing catches changes to the template itself, not to the data interpolated into it. When user history, retrieved documents, or business rules drift, you need embedding-based drift detection on the rendered input distribution.^[6] Hashing is a zero-cost signal for the most common silent regression. It's not the whole story, and a team that claims otherwise is selling something.

prompt_tracing.py

# Hash the template before rendering. The hash is the signal; the rendered prompt is the user's data.
import hashlib
from opentelemetry import trace

tracer = trace.get_tracer("llm.client")


def prompt_hash(template: str) -> str:
    return hashlib.sha256(template.encode()).hexdigest()[:12]


def call_llm_with_tracing(
    template: str, template_version: str, rendered: str
) -> str:
    with tracer.start_as_current_span("gen_ai.request") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.prompt.hash", prompt_hash(template))
        span.set_attribute("gen_ai.prompt.version", template_version)
        # Send the rendered prompt to the model. Hash stays on the span.
        return _do_llm_call(rendered)

When Input Distribution Shifts: Embedding Drift Detection

Hashing catches template changes. This catches when the world stops sending you the queries you built for.

There's a failure class that prompt hashing misses entirely: the input distribution shifts while your prompt stays constant. Users start asking questions your system wasn't built for. A seasonal event changes query intent. A product update attracts a different user segment. The model handles these new queries worse than your training distribution, but every span shows the same template hash.

Embedding-based drift detection addresses this. Embed a rolling sample of incoming queries using the same encoder your retrieval system uses. Cluster them over time. When the centroid of the production distribution moves outside a threshold of your golden dataset coverage, you're receiving queries meaningfully different from what you built for — before failure rates climb.^[2]

The mechanics: compute cosine distance between the production query embedding centroid (rolling 7-day window) and the golden dataset centroid. A 15–20% shift in centroid distance is a reasonable threshold for investigation. It's not a reason to page; it's a reason to sample manually and decide whether the new distribution requires prompt updates or retrieval index refreshes.

For RAG pipelines, a variant of this applies at the retrieval layer. Track the distribution of retrieval scores across a rolling window. A 10% drop in the median retrieval score over 3 days means the knowledge base is drifting relative to the queries — not a model problem, a data freshness problem. Track it as a separate signal from generation quality, or you'll spend two days debugging the wrong layer.

Failure Class	Caught by Latency/Error?	Caught by Prompt Hash?	Caught by Sampled Eval?	Caught by Embedding Drift?
Template edit removed instruction	No	Yes — hash changes	Yes — scores drop	No
Provider model update changed behavior	No	No — template unchanged	Yes — if baseline is continuous	No
Input distribution shifted (new query types)	No	No	Partially — if sampled queries reflect new dist.	Yes
RAG knowledge base staleness	No	No	Partially — coherence score drop	Yes — retrieval score variant
Format regression (JSON → prose)	No	Yes (if instruction removed)	Yes — format score	No
Response truncation / length collapse	Token count alert fires	No	Yes — coherence/relevance drop	No

The Pipeline From Live Traffic to Page

How a request becomes a score, becomes a baseline, becomes an alert that lands on the right person.

Quality Signal Pipeline: From Live Traffic to On-Call Page

Traffic splits at the sampler. Heuristics cover 100%; the judge covers 5–10%. Both feed the score store. The drift detector reads rolling baselines and pages on shift, not on point-variance.

Heuristic Checks: Free Signal on 100% of Traffic

Deterministic checks cost nothing per request and catch format failures before the judge ever runs.

Before you spend inference dollars on an LLM judge, run deterministic checks on everything. These execute in microseconds, cover 100% of requests, and catch the class of failures that are easiest to produce and hardest to notice: structural regressions.

Format validation is the highest-ROI heuristic. If your application expects JSON, validate the schema on every response. Track the JSON-parse failure rate as a metric. A spike from 0.1% to 3% failure rate is an earlier signal than any quality score — and it doesn't cost you an LLM call to detect. Similar logic applies to response length: if your summarization workflow should produce 100–300 words, a response histogram that collapses to 20-word outputs is detectable without a judge.

Other high-value heuristics: scan for refusal patterns (I'm sorry, I can't, As an AI) to track safety calibration drift from model updates; check for unexpected language or character sets in international deployments; validate that expected structure markers appear (section headers, numbered lists, code fences) when the task requires them.

The pass/fail rate of each heuristic check is itself a monitoring signal. Logging pass/fail rates by intent type over time gives you the earliest warning of model drift or provider-side API changes — visible within hours, not days.

heuristic_checks.py

# Deterministic checks on 100% of traffic. No inference cost.
# Pass/fail rates tracked as metrics — spikes are early drift signals.
import json, re
from opentelemetry import metrics

meter = metrics.get_meter("quality.heuristics")
format_fail_counter = meter.create_counter(
    "quality.heuristic.format_fail",
    description="Count of responses failing format validation",
)
refusal_counter = meter.create_counter(
    "quality.heuristic.refusal_detected",
    description="Count of responses with refusal patterns",
)

REFUSAL_PATTERNS = re.compile(
    r"(I'm sorry|I can't|As an AI|I cannot|I am not able)",
    re.IGNORECASE,
)
LENGTH_BOUNDS = {"summarization": (80, 400), "qa": (20, 600), "extraction": (10, 200)}


def heuristic_check(response: str, intent: str, expected_schema: dict | None = None) -> dict:
    results = {}

    # JSON format validation
    if expected_schema:
        try:
            parsed = json.loads(response)
            results["format_ok"] = True
        except json.JSONDecodeError:
            results["format_ok"] = False
            format_fail_counter.add(1, {"intent": intent})

    # Refusal detection
    results["refusal"] = bool(REFUSAL_PATTERNS.search(response))
    if results["refusal"]:
        refusal_counter.add(1, {"intent": intent})

    # Length bounds
    word_count = len(response.split())
    lo, hi = LENGTH_BOUNDS.get(intent, (0, 10_000))
    results["length_ok"] = lo <= word_count <= hi
    results["word_count"] = word_count

    return results

Thresholds That Page the Right Person, Not the Whole Team

Alerts that fire on noise stop firing entirely. Each threshold below names a specific failure class and the first move that catches the cause.

Signal	Threshold	Window	First Action
Quality score drop	Any dimension falls 5%+ below baseline	7-day rolling	Pull traces from degradation period; check prompt hash timeline
Prompt hash change	Any unexpected hash change outside a deployment window	Per-request	Verify change was intentional; correlate with next 48h quality trend
Retrieval relevance drop (RAG)	Retrieval score falls 10%+ below baseline	3-day rolling	Check knowledge base freshness; inspect retrieval pipeline config
Hallucination / refusal rate spike	Flagged rate exceeds 2× historical baseline	Daily	Immediate escalation; roll back most recent prompt or model changes
Format validation failure rate	Rises above 2× baseline (e.g., 0.1% → 0.2%)	1-hour rolling	Check for provider-side model update; inspect template for regressions
Judge kappa drop (monthly calibration)	Kappa falls below 0.6 on gold set	Monthly re-run	Suspend scoring trust; re-rate gold set and update judge rubric
Cost-per-quality ratio	Spend per quality point rises 30%+ week-over-week	Weekly	Review model selection and prompt efficiency; check for context bloat

Build Order: Why Cost Tracking Comes First

The counterintuitive prerequisite to quality monitoring

The most common mistake is starting with quality evaluators before having cost tracking in place. This seems backwards until you realize: your sampling rate for LLM-as-judge is directly constrained by how much that evaluation costs. Without knowing your baseline inference cost per request type, you have no principled basis for choosing a sampling rate. You'll either undersample (missing real degradation) or oversample (running quality evals that cost more than the system they're monitoring).

Cost tracking also delivers fast value on its own. Adding trace IDs and cost-per-span takes a few hours. It immediately shows which request types are expensive and whether any single workflow is responsible for disproportionate spend. That data shapes every downstream sampling decision.

There's also the uncomfortable truth about observability comprehensiveness: teams that try to build the full quality monitoring stack in the first sprint typically ship nothing usable in the first month. Phased delivery — something working at each stage — consistently outperforms a complete design that's half-implemented.

[01]
Trace IDs + Cost Tracking (Weeks 1–2)
Attach a unique trace ID to every request and a cost attribute to every LLM call span. Immediately surfaces expensive outliers and establishes the budget for subsequent sampling decisions. This is the foundation everything else depends on.
[02]
Prompt Hash Instrumentation (Weeks 2–3)
Add genai.prompt.hash and genai.prompt.version attributes to every LLM call span. Zero inference cost, immediate audit trail. Catches the most common cause of silent quality regressions — unannounced prompt changes — before they compound.
[03]
Heuristic Quality Checks (Weeks 3–5)
Run lightweight rule-based checks on 100% of traffic: format validation, response length bounds, regex patterns for expected structure. Track pass/fail rates as OTel metrics. Detects structural failures cheaply and calibrates what 'normal' looks like before investing in semantic evaluation.
[04]
Judge Calibration Run (Week 5)
Before any LLM-as-judge goes to production, manually rate 50–200 production samples. Compute Cohen's kappa between your judge and human raters. Don't proceed until kappa ≥ 0.6 on balanced criteria. This is not optional.
[05]
Sampled LLM-as-Judge (Weeks 5–8)
Start at 1–2% sample rate with a validated judge. Establish a 2-week rolling baseline before configuring drift alerts. Scale to 5–10% once you trust the scores and know the cost profile. Schedule monthly kappa recalibration against the gold set.

Minimum Viable LLM Observability Checklist

The checklist before you claim production quality observability

Production Quality Observability Checklist

Every request has a unique trace ID propagated across all service boundaries
Every LLM call span records genai.system, genai.request.model, token counts, and estimated cost
Every LLM call span includes genai.prompt.hash and genai.prompt.version
Heuristic checks run on 100% of production traffic; pass/fail rates tracked as OTel metrics
Judge calibrated: Cohen's kappa ≥ 0.6 on 50–200 hand-rated samples before production use
LLM-as-judge running on at least 1% of production traffic with validated scores
2 weeks of quality score history collected before any drift alerts are configured
Alerts configured on 7-day rolling window, 5% drop threshold per quality dimension
Retrieval quality tracked separately from generation quality in RAG pipelines
Escalated and flagged requests evaluated at 100% regardless of sample rate
Prompt hash change + quality correlation runbook documented and accessible
Monthly judge re-calibration scheduled; kappa drop below 0.6 triggers rubric review

When to Skip Semantic Evaluation (and What to Do Instead)

LLM-as-judge is not always the right tool. Know the cases where it wastes money and misleads.

LLM-as-Judge: Apply / Skip Decision Rules

[01]

Apply when outputs are open-ended prose, structured summaries, or conversational responses where correctness isn't binary.

Semantic judges add value when human judgment is genuinely required to assess quality. These are the tasks where heuristics can't cover the full quality surface.

[02]

Skip when the output is deterministic or verifiable — SQL queries, code, JSON extraction against a fixed schema.

Run the SQL. Parse the JSON. Execute the code. A judge's opinion on whether the SQL is correct is slower, more expensive, and less reliable than just executing it and checking the result.

[03]

Skip when you haven't calibrated the judge. Uncalibrated scoring at scale produces data that feels authoritative and guides bad decisions.

An uncalibrated judge with a Cohen's kappa of 0.3 is not a measurement instrument. It's a random number generator with a clean dashboard.

[04]

Apply to 100% of safety-critical or escalated requests regardless of sample rate.

The cost of missing a safety failure is not comparable to the cost of an extra LLM call. Sample rate is a cost optimization for the happy path, not for known-bad or flagged requests.

[05]

Skip judge evaluation for requests where response content can't be safely passed to a third-party model due to data residency or PII constraints.

Route those requests through a self-hosted or on-premise evaluation model, or restrict to heuristic-only checks. Compliance constraints override eval coverage goals.

How is LLM quality monitoring different from traditional APM?

Traditional APM measures whether code executed correctly — a function ran, the database responded, the request completed within latency bounds. LLM quality monitoring measures whether outputs were useful — whether the response answered the question, matched the expected format, and was factually coherent. These are orthogonal concerns. A request that completes in 200ms with a 200 OK can still produce an output that's wrong, incoherent, or misformatted. Infrastructure metrics alone cannot capture that distinction.

Can I use OpenTelemetry for quality signals?

Yes — with a clear understanding of what the conventions cover. The OTel GenAI semantic conventions standardize genai.system, genai.request.model, token counts, finish reasons, and latency histograms. As of early 2026, most GenAI attributes are experimental status. For quality signals, you extend with custom span attributes: quality.score.format, quality.score.relevance, genai.prompt.hash, genai.prompt.version. Collectors pass custom attributes through without modification — no pipeline changes needed beyond adding the attributes at instrumentation time.

How do I handle non-determinism when scoring quality?

Don't try to get consistent scores on individual requests — LLM non-determinism means individual scores will vary even for identical inputs. Track score distributions over time and use statistical methods to detect when the distribution shifts. A score of 4.1 today versus 4.0 yesterday is noise. A distribution that was centered at 4.2 last week and is now centered at 3.6 is a meaningful signal. Alert on distribution shift, not point-in-time variance.

At what traffic volume does quality monitoring make sense?

Start immediately, even at low traffic volumes. The value is baseline establishment, not absolute numbers. Two weeks of quality scores at 10 requests per day gives you a baseline to alert against when traffic scales or a prompt change causes degradation. Two weeks of no data gives you nothing to compare to when you actually need it. The cost at low traffic is negligible; the cost of missing a regression while operating blind is not.

Should I store prompt and response content in spans?

PII and data residency requirements usually prevent storing full content in telemetry. Store the hash of the prompt template (not the rendered prompt with user data), response length, a category label, and quality scores. For quality evaluation, pass content through your evaluation pipeline separately — but don't attach raw prompt/response content to spans. Structure your evaluation pipeline so it reads from a secure, access-controlled store, not from your distributed trace backend.

What's the right judge model to use for production evaluation?

Use a smaller, faster model than your production model — Claude Haiku or GPT-4o-mini at current pricing. The judge doesn't need to be smarter than the system being evaluated; it needs to be consistent. A smaller model with a well-designed rubric consistently outperforms a larger model with an underspecified prompt. Cost matters: at 7% sample rate with a Haiku-class judge, eval cost is roughly 1–3% of production inference cost. Verify this estimate with your actual token counts before committing to a sample rate.

How do I detect model provider updates affecting quality?

Prompt hashing won't catch this — the template didn't change. A continuous quality baseline catches it: if scores shift without a corresponding hash change, the most likely cause is a provider-side model update. Track your quality baseline as a continuous series. Any discontinuity that isn't correlated with a deployment or prompt hash change is a signal to check provider changelogs and consider pinning to a specific model version if the API supports it.

The teams that catch quality degradation before it escalates to executive complaints share one characteristic: they treated output quality as an engineering concern from the start, not a customer support problem to triage after the fact.

The infrastructure for this isn't complicated. It's a hash function, a sampling decision, a judge model, and the discipline to build baselines before you need them. What's hard is the organizational habit of treating semantic quality as a first-class signal alongside latency and error rate — not as a soft metric that belongs on a different team's dashboard.

Your traces are already recording that requests completed. The question is whether you're recording whether they worked. That gap — between completion and usefulness — is where production LLM quality lives and dies.

Key terms in this piece

LLM observabilityLLM quality monitoringprompt drift detectionsampled production evaluationLLM-as-judge productionproduction AI monitoring

Sources

[1]The Complete Guide to LLM Observability & Monitoring in 2026 (Adaline, Feb 2026)(adaline.ai)↩
[2]AI Production Monitoring: Quality Drift, Hallucinations, Costs (Particula Tech, Feb 2026)(particula.tech)↩
[3]Production Monitoring Alerts for LLM Quality Drops — Braintrust 24-hour blindspot case (Technivorz, Mar 2026)(technivorz.com)↩
[4]AI Agent Observability — Evolving Standards and Best Practices (OpenTelemetry, 2025)(opentelemetry.io)↩
[5]Quality Monitoring: Drift Detection and Regression Alerts for LLMs (Brenndoerfer, Feb 2026)(mbrenndoerfer.com)↩
[6]Detecting drift in production generative AI applications (AWS Prescriptive Guidance)(docs.aws.amazon.com)↩
[7]Semantic conventions for generative client AI spans (OpenTelemetry, 2026)(opentelemetry.io)↩
[8]Semantic conventions for generative AI metrics (OpenTelemetry, 2026)(opentelemetry.io)↩
[9]LLM-as-Judge Best Practices in 2026: Calibration, Bias, and Cost (FutureAGI, 2026)(futureagi.com)↩
[10]Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications (arXiv, Nov 2025)(arxiv.org)↩

Share this article

X LinkedIn Hacker News

LLM Observability: Catch Output Quality Drift Your Green Traces Can't See

AI Engineering PlatformadvancedApr 22, 202610 min read

By Viktor Bezdek · VP Engineering, Groupon

What the conventions give you out of the box:

gen_ai.system — the provider (e.g., anthropic, openai, aws.bedrock)
gen_ai.request.model — the exact model name requested
gen_ai.usage.input_tokens / gen_ai.usage.output_tokens — token counts
gen_ai.response.finish_reasons — why generation stopped (stop, length, tool_calls)
gen_ai.client.operation.duration — a histogram of LLM call latencies^[8]

What the conventions do not cover:

Output evaluation or quality scoring
Safety and hallucination detection
Prompt template identity or versioning
Intent-level quality tracking

# Full span: OTel GenAI conventions + custom quality attributes on the same span. # Collectors pass custom attributes through without modification. from opentelemetry import trace from opentelemetry.semconv._incubating.attributes import gen_ai_attributes as GenAI import hashlib, time tracer = trace.get_tracer("llm.client", "1.0.0") def call_with_full_instrumentation( template: str, rendered: str, model: str = "claude-sonnet-4-5", ) -> str: template_hash = hashlib.sha256(template.encode()).hexdigest()[:12] with tracer.start_as_current_span("gen_ai.client.chat") as span: # --- OTel GenAI conventions (standard) --- span.set_attribute(GenAI.GEN_AI_SYSTEM, "anthropic") span.set_attribute(GenAI.GEN_AI_REQUEST_MODEL, model) t0 = time.perf_counter() response, usage = _do_llm_call(rendered, model) elapsed = time.perf_counter() - t0 span.set_attribute(GenAI.GEN_AI_USAGE_INPUT_TOKENS, usage.input_tokens) span.set_attribute(GenAI.GEN_AI_USAGE_OUTPUT_TOKENS, usage.output_tokens) span.set_attribute("gen_ai.client.operation.duration", elapsed) # --- Custom quality attributes (not in conventions — add yourself) --- span.set_attribute("gen_ai.prompt.hash", template_hash) span.set_attribute("gen_ai.prompt.version", "v2.4.1") span.set_attribute("gen_ai.request.intent", "summarization") # quality.score.* populated by async evaluator after sampling gate return response

# Judge runs on 7% of traffic. Errored and escalated requests bypass the gate entirely. # Calibration: run this against 50-200 hand-rated samples and check Cohen's kappa >= 0.6. import json, random from anthropic import Anthropic from opentelemetry import trace tracer = trace.get_tracer("quality.evaluator") client = Anthropic() SAMPLE_RATE = 0.07 def sample_and_evaluate( trace_id: str, query: str, response: str, intent: str, is_error: bool = False ) -> dict | None: # Errors always evaluated — sample rate doesn't apply if not is_error and random.random() > SAMPLE_RATE: return None with tracer.start_as_current_span("quality.judge") as span: span.set_attribute("quality.trace_ref", trace_id) span.set_attribute("quality.intent", intent) span.set_attribute("quality.forced", is_error) # distinguish forced evals result = client.messages.create( model="claude-haiku-4-5", max_tokens=64, messages=[{"role": "user", "content": judge_prompt(query, response, intent)}], ) scores = json.loads(result.content[0].text) for dim, score in scores.items(): span.set_attribute(f"quality.score.{dim}", score) return scores def judge_prompt(query: str, response: str, intent: str) -> str: return ( f"Task type: {intent}\n" "Rate this response on three dimensions (1-5 each):\n" "- format: does the response match the expected structure?\n" "- relevance: does it address the query accurately?\n" "- coherence: is it internally consistent and factually plausible?\n" f"Query: {query}\nResponse: {response}\n" "Reply JSON only: {\"format\": N, \"relevance\": N, \"coherence\": N}" )

# Hash the template before rendering. The hash is the signal; the rendered prompt is the user's data. import hashlib from opentelemetry import trace tracer = trace.get_tracer("llm.client") def prompt_hash(template: str) -> str: return hashlib.sha256(template.encode()).hexdigest()[:12] def call_llm_with_tracing( template: str, template_version: str, rendered: str ) -> str: with tracer.start_as_current_span("gen_ai.request") as span: span.set_attribute("gen_ai.system", "anthropic") span.set_attribute("gen_ai.prompt.hash", prompt_hash(template)) span.set_attribute("gen_ai.prompt.version", template_version) # Send the rendered prompt to the model. Hash stays on the span. return _do_llm_call(rendered)

Failure Class

Caught by Latency/Error?

Caught by Prompt Hash?

Caught by Sampled Eval?

Caught by Embedding Drift?

Template edit removed instruction

Yes — hash changes

Yes — scores drop

Provider model update changed behavior

No — template unchanged

Yes — if baseline is continuous

Input distribution shifted (new query types)

Partially — if sampled queries reflect new dist.

Yes

RAG knowledge base staleness

Partially — coherence score drop

Yes — retrieval score variant

Format regression (JSON → prose)

Yes (if instruction removed)

Yes — format score

Response truncation / length collapse

Token count alert fires

Yes — coherence/relevance drop

# Deterministic checks on 100% of traffic. No inference cost. # Pass/fail rates tracked as metrics — spikes are early drift signals. import json, re from opentelemetry import metrics meter = metrics.get_meter("quality.heuristics") format_fail_counter = meter.create_counter( "quality.heuristic.format_fail", description="Count of responses failing format validation", ) refusal_counter = meter.create_counter( "quality.heuristic.refusal_detected", description="Count of responses with refusal patterns", ) REFUSAL_PATTERNS = re.compile( r"(I'm sorry|I can't|As an AI|I cannot|I am not able)", re.IGNORECASE, ) LENGTH_BOUNDS = {"summarization": (80, 400), "qa": (20, 600), "extraction": (10, 200)} def heuristic_check(response: str, intent: str, expected_schema: dict | None = None) -> dict: results = {} # JSON format validation if expected_schema: try: parsed = json.loads(response) results["format_ok"] = True except json.JSONDecodeError: results["format_ok"] = False format_fail_counter.add(1, {"intent": intent}) # Refusal detection results["refusal"] = bool(REFUSAL_PATTERNS.search(response)) if results["refusal"]: refusal_counter.add(1, {"intent": intent}) # Length bounds word_count = len(response.split()) lo, hi = LENGTH_BOUNDS.get(intent, (0, 10_000)) results["length_ok"] = lo <= word_count <= hi results["word_count"] = word_count return results

Signal

Threshold

Window

First Action

Quality score drop

Any dimension falls 5%+ below baseline

7-day rolling

Pull traces from degradation period; check prompt hash timeline

Prompt hash change

Any unexpected hash change outside a deployment window

Per-request

Verify change was intentional; correlate with next 48h quality trend

Retrieval relevance drop (RAG)

Retrieval score falls 10%+ below baseline

3-day rolling

Check knowledge base freshness; inspect retrieval pipeline config

Hallucination / refusal rate spike

Flagged rate exceeds 2× historical baseline

Daily

Immediate escalation; roll back most recent prompt or model changes

Format validation failure rate

Rises above 2× baseline (e.g., 0.1% → 0.2%)

1-hour rolling

Check for provider-side model update; inspect template for regressions

Judge kappa drop (monthly calibration)

Kappa falls below 0.6 on gold set

Monthly re-run

Suspend scoring trust; re-rate gold set and update judge rubric

Cost-per-quality ratio

Spend per quality point rises 30%+ week-over-week

Weekly

Review model selection and prompt efficiency; check for context bloat

What This Article Covers

Distributed Tracing Assumed Determinism. LLMs Broke That Assumption.

What OpenTelemetry Actually Covers — and Where You're On Your Own

Four Signals That Catch What Latency Cannot

Sampled Evals: A Judge on Your Live Traffic

Calibrate the Judge Before You Trust the Scores

Prompt Drift Is the Default. Hash It or Stop Pretending.

When Input Distribution Shifts: Embedding Drift Detection

The Pipeline From Live Traffic to Page

Heuristic Checks: Free Signal on 100% of Traffic

Thresholds That Page the Right Person, Not the Whole Team

Build Order: Why Cost Tracking Comes First

Trace IDs + Cost Tracking (Weeks 1–2)

Prompt Hash Instrumentation (Weeks 2–3)

Heuristic Quality Checks (Weeks 3–5)

Judge Calibration Run (Week 5)

Sampled LLM-as-Judge (Weeks 5–8)

Minimum Viable LLM Observability Checklist

Production Quality Observability Checklist

When to Skip Semantic Evaluation (and What to Do Instead)

LLM-as-Judge: Apply / Skip Decision Rules

Apply when outputs are open-ended prose, structured summaries, or conversational responses where correctness isn't binary.

Skip when the output is deterministic or verifiable — SQL queries, code, JSON extraction against a fixed schema.

Skip when you haven't calibrated the judge. Uncalibrated scoring at scale produces data that feels authoritative and guides bad decisions.

Apply to 100% of safety-critical or escalated requests regardless of sample rate.

Skip judge evaluation for requests where response content can't be safely passed to a third-party model due to data residency or PII constraints.

Related

What This Article Covers

Distributed Tracing Assumed Determinism. LLMs Broke That Assumption.

What OpenTelemetry Actually Covers — and Where You're On Your Own

Four Signals That Catch What Latency Cannot

Sampled Evals: A Judge on Your Live Traffic

Calibrate the Judge Before You Trust the Scores

Prompt Drift Is the Default. Hash It or Stop Pretending.

When Input Distribution Shifts: Embedding Drift Detection

The Pipeline From Live Traffic to Page

Heuristic Checks: Free Signal on 100% of Traffic

Thresholds That Page the Right Person, Not the Whole Team

Build Order: Why Cost Tracking Comes First

Trace IDs + Cost Tracking (Weeks 1–2)

Prompt Hash Instrumentation (Weeks 2–3)

Heuristic Quality Checks (Weeks 3–5)

Judge Calibration Run (Week 5)

Sampled LLM-as-Judge (Weeks 5–8)

Minimum Viable LLM Observability Checklist

Production Quality Observability Checklist

When to Skip Semantic Evaluation (and What to Do Instead)

LLM-as-Judge: Apply / Skip Decision Rules

Apply when outputs are open-ended prose, structured summaries, or conversational responses where correctness isn't binary.

Skip when the output is deterministic or verifiable — SQL queries, code, JSON extraction against a fixed schema.

Skip when you haven't calibrated the judge. Uncalibrated scoring at scale produces data that feels authoritative and guides bad decisions.

Apply to 100% of safety-critical or escalated requests regardless of sample rate.

Skip judge evaluation for requests where response content can't be safely passed to a third-party model due to data residency or PII constraints.

Related