LLM Evals: The Pipeline Your APM Cannot See

The 200 OK Is Not the Output. Your APM Cannot See the Failure.

The dashboard goes green while the model invents a refund policy. Status codes are not a quality signal for generative output. The fix is an eval stack: CI gates, judge models, sampled production scoring, and a dataset that compounds with every failure.

AI Engineering PlatformadvancedNov 15, 20257 min read

By Viktor Bezdek · VP Engineering, Groupon

200 OK in 1.2 seconds. The dashboard is green. The model invented a refund policy that does not exist, quoted the wrong deadline, and shipped code that compiled cleanly while corrupting a column in production.

The HTTP layer never saw any of it. To latency monitoring, a fabricated refund and a correct one are identical bytes. Status codes are not a quality signal for generative output. The output is the system. Evaluating that output — continuously, before users become your QA team — is the only layer that closes the gap.

This is the shape of that layer when it is built to ship.

Determinism Is the Property You Are Trading Away

The property that makes the model useful is the same property that breaks the test contract you used to ship behind.

A unit test asserts equality. Input X returns Y, every run, or the test fails. The contract is binary. The contract is stable.

Model output is neither. It drifts between runs at fixed temperature. It reacts to phrasing the test author never anticipated. It is frequently correct in three different ways at once. A response can be factually right and uselessly long. Concise and dangerously incomplete. Authoritative in tone and entirely fabricated in substance.

The binary contract does not survive that surface area. What replaces it is graded scoring across quality dimensions — measured continuously, calibrated against human labels, owned as infrastructure rather than checked once before launch.

Deterministic Contract

Same input maps to the same output, every run
Assertions resolve binary — pass or fail
Tests stay valid until the function changes
APM and error rates surface most real failures
Coverage tracks lines and branches

Graded Output

Outputs drift between runs at fixed temperature and seed
Scores on faithfulness, safety, completeness — graded, not binary
The dataset has to grow every time production reveals a new failure mode
Wrong answers return clean 200s; the dashboard is no longer the signal
Coverage tracks failure-mode surface area, not code paths

Most Teams Are at Level 1. The Field Is Operating at Level 3.

Where the maturity curve actually inflects — and why the gap between rungs is widening.

Roughly 57% of organizations report agents in production, and quality is the cited blocker for further rollout — 32% of respondents in LangChain's 2026 State of AI Agents report name it as the top barrier^[2]. The same survey shows most teams stuck at Level 0 or Level 1: manual review, no automation, no systematic coverage of the failure surface.

The ladder, and what each rung actually buys you:

Level	What it actually does	Typical coverage	What it cannot catch
0 — YOLO	Ship. Watch user complaints in support channels.	0%	Everything
1 — Spot checks	Engineer eyeballs a handful of outputs before the deploy.	~10%	Regressions, anything not sampled, anything not noticed
2 — Deterministic gates	Rule-based CI checks: schema, banned phrases, length bounds.	~30%	Semantic quality, faithfulness, anything a regex misses
3 — LLM-as-judge in CI	Automated scoring on the golden dataset before every deploy.	~70%	Drift in production, novel failures, distribution shift
4 — Continuous eval	Production traffic sampled and scored; alerts on degradation.	~90%	Adversarial inputs, red-team coverage, intentional jailbreaks
5 — Eval flywheel	Production failures route back into the dataset on contact.	95%+	Mostly tuning thresholds and watching for judge drift

Most teams reading this sit at Level 1 or 2. Reliability starts to meaningfully improve at Level 3 — LLM-as-judge in CI. The jump from 3 to 4 is where production failures stop surprising you.

The path is not complicated. It requires treating the eval dataset with the same review discipline as production code.

A caution before the climb. One team spent three months on a Level 5 flywheel before discovering 80% of their quality problems traced to a single ambiguous system prompt. The pipeline faithfully measured a broken baseline. Fifty golden examples and two weeks of manual output review will teach you more than instrumentation pointed at the wrong target. Build the dataset before the harness.

The Pipeline Is a Loop, Not a Gate

What the full eval surface looks like, from pre-commit to production traffic and back into the dataset.

Eval Pipeline: CI Gate, Production Sampling, Dataset Flywheel

Failures route back into the dataset on contact. The flywheel is the difference between coverage that compounds and coverage that stagnates.

The structural point: the pipeline is a loop, not a linear gate. Production failures route back into the eval dataset. The dataset grows sharper every time the system fails in a new way. This is the flywheel. Without it, eval coverage stagnates the day you stop adding examples by hand.

LLM-as-Judge Is the Only Thing That Scales — Until Bias Eats the Score

Using a model to grade a model is the workhorse of semantic eval. Its failure modes are systematic, not random.

LLM-as-judge is the practice of using a capable model — usually stronger or separately tuned — to score the outputs of your application model. It is the only semantic eval mechanism that scales to production traffic without crippling cost.

The economics are decisive. Roughly 500x to 5000x cheaper than full human review, with about 80% agreement against human preferences in calibrated settings^[6] — comparable to how often two humans agree with each other on the same task. After calibration against human labels, precision and recall can approach 0.9 in some domains, though numbers vary by task complexity and domain specificity^[1].

The failure modes are systematic, not random. Plan for them or the score becomes theater.

Known biases that distort the score

Position bias: models including GPT-4 show roughly 40% inconsistency depending on which response appears first in a pairwise comparison
Verbosity bias: longer responses can receive approximately 15% score inflation regardless of underlying quality
Self-enhancement bias: models rate their own outputs 5–7% higher than equivalent outputs from other models in controlled studies
Domain gaps: agreement with human annotation drops 10–15% in specialized technical domains the judge was not trained against

Mitigations that hold up under load

✓
Chain-of-thought in the judge prompt — reason first, then score. Improves reliability 10–15% and gives you a debuggable trace
✓
Run 3–5 judges and majority-vote on critical evaluations. Cuts bias 30–40% at a cost you can budget
✓
Calibrate against human-labeled samples before any threshold becomes a deployment gate
✓
Maintain regression tests on the judge itself. Judge drift is a failure mode, not a quirk

Six Dimensions. Anything Less Hides a Failure Class.

Output quality, behavioral safety, operational efficiency. Skip a layer and a failure mode walks straight through the gate.

Which metrics depend on what the application does. Every production LLM system needs at least three layers covered: output quality, behavioral safety, operational efficiency. The dimensions below are the minimum surface.

Faithfulness

Does the output stay grounded in the source context? The first thing RAG systems lose.

Safety

Does the output avoid harmful, biased, or policy-violating content under adversarial input?

Relevance

Does the output address what was asked? Tone-deaf answers score low here regardless of accuracy.

Efficiency

Step count, token count, wall-clock. Agents can succeed expensively. Cost is observability.

Completeness

Does the output cover every sub-goal? Partial answers are silent failures the user never reports.

Consistency

Does quality hold across phrasing and context drift? If not, the dataset is too narrow.

Agent systems need one more dimension: trajectory quality. Not just whether the agent landed on the right answer, but whether it took a defensible path. An agent that completes a five-step task in fifty steps is not a win. It is a reliability risk wearing a green check mark.

The Golden Dataset Is Production Code. Treat It That Way.

Every eval score is bounded by the dataset behind it. The dataset is the most undervalued asset in the eval stack.

The golden dataset is the set of (input, expected behavior) pairs CI runs before every deploy. It is the highest-leverage asset in the eval stack and the most commonly neglected.

Golden datasets fail in two ways. They are too small to catch regressions reliably. Or they go stale as the application evolves and nobody owns the curation. Both are expensive. Both are silent until production breaks.

[01]
Seed with adversarial inputs, not happy paths
The first 50 examples should be hostile to the model: ambiguous inputs, boundary conditions, multi-step instructions, intent collisions. Happy paths do not catch regressions. They confirm the model still works on inputs you already shipped against.
[02]
Specify success criteria a judge can verify
For each example, define 'correct' in terms a judge model can score. 'Good response' is not a criterion. 'Response cites at least one source, stays within scope, and avoids speculative claims' is. Vague criteria propagate vague scores.
[03]
Version it like production code
The dataset lives in version control. Additions and removals route through PR review. Each entry carries the reasoning for why it is there — without it, future maintainers cannot judge whether to retire the row.
[04]
Route every production failure back into the dataset within 24 hours
Every meaningful production failure is a row addition. Triage the failure, confirm the expected behavior, write the (input, criteria) pair, ship it. The 24-hour ceiling is the discipline that makes coverage compound. Skip it twice and the flywheel stops turning.

Wire the Eval Into CI/CD as a Hard Gate, Not a Suggestion

If quality is advisory, it loses every fight against deadline pressure. Make regression harder than fix.

The goal is to make shipping a regression harder than fixing it. CI fails the build when scores drop below threshold. Not a comment on the PR. A red X. The engineer cannot merge through it without an explicit override.

A practical CI eval job:

.github/workflows/eval.yml

name: LLM Quality Gate

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/agents/**'
      - 'src/rag/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run golden dataset eval
        run: |
          python scripts/run_eval.py \
            --dataset evals/golden.jsonl \
            --model ${{ vars.APP_MODEL }} \
            --judge ${{ vars.JUDGE_MODEL }} \
            --threshold 0.82
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Upload eval report
        uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: eval_results/
        if: always()

      - name: Comment scores on PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const summary = JSON.parse(fs.readFileSync('eval_results/summary.json'));
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Eval Results\n\n` +
                    `| Metric | Score | Threshold | Status |\n` +
                    `|--------|-------|-----------|--------|\n` +
                    summary.metrics.map(m =>
                      `| ${m.name} | ${m.score.toFixed(2)} | ${m.threshold} | ${m.pass ? '✅' : '❌'} |`
                    ).join('\n')
            });

Production Monitoring Is the Layer Most Teams Skip — and Where Real Failures Live

CI gates are necessary, not sufficient. Real users find failure modes the dataset never anticipated.

Pre-deploy evals catch regressions against known behavior. They cannot catch what breaks when real users hit the system in ways the dataset never modeled — which is, predictably, every day.

Production monitoring for LLM quality is not latency monitoring with extra steps. There is no health endpoint. The pattern is asynchronous sampling, off the critical path, scored by a judge, alerted on sustained score degradation.

The load-bearing word is sustained. Individual outputs vary. A single low score is noise. A 3% drop in faithfulness across 500 sampled requests over 48 hours is signal.

5–10%

Sampling rate for async production eval — a practical starting point; tune to volume and judge cost

48h

Rolling window for trend-based alerting — calibrate against deploy cadence and traffic volume

~3%

Score degradation worth investigating — set the floor against the application's quality tolerance, not a default

~0.9

Human-agreement precision achievable after judge calibration in favorable conditions — domain-specific results vary

Async Eval: Scoring Off the Critical Path

Sampling happens after the response ships. Zero latency added to the user. Trend analysis absorbs single-output noise.

Two Roles Automated Judges Cannot Take From Humans

Automation handles volume. Humans calibrate the judge and discover the failure modes the dataset never modeled.

The temptation is to automate everything and never read the outputs again. Resist it.

Human review carries two roles automated judges cannot. The first is calibration: judge scores are only meaningful relative to human judgment. Stop sampling real outputs and the calibration drifts, the scores become self-referential noise, and the dashboard turns into theater. The second is novel failure discovery: automated judges only catch failure modes you have already defined. Humans catch the weird ones — outputs technically within spec, clearly wrong in context.

A lightweight weekly review workflow:

Weekly Operator Review — Six Verifiable States

20–30 randomly sampled production outputs reviewed across every quality dimension
Every near-threshold judge score reviewed — borderline cases reveal calibration drift first
5–10 outputs labeled to re-calibrate the judge against current human preference
Novel failure patterns surfaced — outputs that pass current metrics but fail context
Confirmed failures added to the golden dataset with explicit, judge-verifiable criteria
Judge consistency check: 10 previously scored examples re-run, drift logged

Tools at Each Layer — and Where They Stop Being Interchangeable

The ecosystem matured. The harder problem is choosing the right tool per layer and connecting them coherently.

The ecosystem matured. Almost no team needs to build eval infrastructure from scratch. The harder problem is selecting the right tool per layer and wiring them so the dataset, judge, and production sampler share a single source of truth.

Tool	Primary strength	Where it fits
DeepEval	Pre-built metric library, CI integration, unit-test-style API	Teams starting from zero that need fast time-to-coverage
RAGAS	RAG-specific metrics — faithfulness, context relevance, answer relevance	Any system with retrieval. The de facto standard for RAG evals
W&B Weave	Eval playground, golden dataset management, experiment tracking	Teams already on W&B for training that want unified observability
LangSmith	Trace-level debugging, production traffic replay, human annotation	LangChain users that need end-to-end trace visibility
Confident AI	Hosted regression suites, A/B testing, dataset versioning	Teams that want managed eval infrastructure rather than building it
Datadog LLM Observability	Eval scoring next to APM, unified alerting surface	Teams on Datadog that need AI quality alongside infra metrics

From Level 0 to Meaningful CI Coverage in Two Weeks

What two engineers can ship in a sprint. The gap between no evals and useful evals is smaller than it feels.

The gap between no evals and useful evals is smaller than it looks from outside. Two engineers in two weeks can land Level 2 and meaningful CI coverage. The plan below is the minimum viable bootstrap.

[01]

Audit the worst failures already on record (Day 1–2)

bash

# Pull 30 days of logged requests where users flagged the response.
# No log? Start logging today. The dataset starts the moment you do.
grep 'user_feedback=negative' logs/app.jsonl | \
  jq '{input: .prompt, output: .completion, issue: .feedback_note}' \
  > evals/failure_review.jsonl

[02]
Write 50 golden examples drawn from those failures (Day 3–5)

[03]

Land three deterministic checks in CI (Day 6–7)

python

def check_output(output: str, criteria: dict) -> dict:
    checks = {}
    # Length bound — silent verbosity and silent truncation both cost trust.
    checks['length'] = criteria['min_words'] <= len(output.split()) <= criteria['max_words']
    # Banned content — enforced here, not advised in the prompt.
    checks['safe'] = not any(phrase in output.lower() for phrase in criteria['banned_phrases'])
    # Required elements — partial answers are silent failures.
    checks['complete'] = all(req in output for req in criteria.get('required_elements', []))
    return checks

[04]
Add LLM-as-judge scoring for semantic quality (Day 8–10)
[05]
Wire both into CI behind a fail threshold (Day 11–12)
[06]
Stand up async production sampling (Day 13–14)

The Model Is Replaceable. The Eval Stack Compounds.

The teams pulling ahead are not on the newest model. They are on the eval stack they started building eighteen months ago.

LangChain's coding agent moved from 52.8% to 66.5% on Terminal Bench 2.0 — outside the top 30 to the top 5 — without changing the model^[3]. They changed the harness. Same model. Different eval and correction infrastructure. Dramatically different results.

The insight most teams internalize too slowly: the underlying model matters less than the system around it. The model is a replaceable component. The eval infrastructure, the golden dataset, the feedback loops — those compound. They get sharper every time production fails and the team handles it correctly.

Teams that built eval infrastructure in 2024 now hold datasets of hundreds or thousands of real failure cases. Teams starting today start from zero but inherit better tooling and a clearer playbook.

The right time to start was eighteen months ago. The second-best time is today.

The uncomfortable counterpoint: eval infrastructure can become its own theater. A system that scores 0.90 on the golden dataset and 0.91 next week looks healthy. If the dataset has not grown in three months and production traffic has drifted, those scores measure yesterday's failure modes, not today's. High eval scores are not evidence of quality. They are evidence that the system handles the cases you remembered to test. Distinguishing between the two is a discipline, not an output of the tooling.

How many examples does the golden dataset need before CI evals are meaningful?

Fifty catches obvious regressions. Two hundred supports reliable trend detection. Example quality matters more than count — 50 adversarial, well-specified rows beat 500 happy-path samples every time. Grow by 10–20 examples per sprint, prioritized off real production failures, not synthetic edge cases.

Can the same model serve as both generator and judge?

No. Models rate their own outputs 5–7% higher than equivalent outputs from other models. That bias makes the score meaningless as a deployment gate. Use a stronger model — if the app runs GPT-4o-mini, judge with GPT-4o or Claude 3.5 Sonnet — or a separately fine-tuned evaluator. The structural rule: generator and judge are different runtimes.

How do you evaluate multi-turn conversations?

Two levels. Turn-level: is each individual response appropriate? Conversation-level: did the session achieve its goal, did the agent handle topic shifts, did context survive the full trajectory? Golden examples for multi-turn carry full conversation histories. Isolated prompts cannot represent the failure modes that emerge across turns.

What is the minimum viable production monitoring setup?

Log every request and response. Async-score a 10% sample using the judge model. Track average faithfulness and safety in a rolling 48-hour window. Alert when scores drop more than 3% sustained. That is the floor. Everything else is refinement against the failures you actually see.

Can evals compare prompt versions before deploy?

Yes, and they should be the deciding signal. Run both prompt versions against the golden dataset and compare scores directly. This is the only reliable mechanism that distinguishes 'feels better' from 'is better' on prompt changes. LangSmith and W&B Weave wire this comparison directly into the dataset they already manage.

Eval Infrastructure: Five Non-Negotiable Rules

[01]

Generator and judge are different models — always

Self-evaluation is biased 5–7% in the model's favor. Use a separate, ideally stronger model as the judge. No exceptions.

[02]

The eval dataset is production code — version-controlled, PR-reviewed

Additions, removals, and criteria changes route through the same review discipline as application code.

[03]

Every significant production failure becomes a dataset row within 24 hours

This is how coverage compounds. Skip it twice and the flywheel stops turning.

[04]

Judge scores are calibrated against human labels before they gate anything

Uncalibrated scores are directional at best. Calibrated scores are deployment gates.

[05]

Alert on trends, not individual scores

Single-output variation is noise. A 3% drop across 500 samples over 48 hours is signal.

The Ceiling of Automated Evaluation

LLM-as-judge agrees with human preference roughly 80% of the time in calibrated settings. That means 1 in 5 judgments would land differently than a human reviewer. For most teams, the trade is straightforward: 80% caught beats 0% caught by no evals. For high-stakes domains — medical, legal, financial — tighten the calibration and raise the human review ratio before the score becomes a deployment gate.

Key terms in this piece

LLM evaluation pipelineproduction AI evalsLLM-as-judgeeval infrastructureAI quality assurancehallucination detectionAI agent evaluation

Sources

[1]Confident AI: LLM Evaluation Metrics — Everything You Need for LLM Evaluation(confident-ai.com)↩
[2]The Pragmatic Engineer: LLM Evals — LangChain State of AI Agents 2026(newsletter.pragmaticengineer.com)↩
[3]Anthropic Engineering: Demystifying Evals for AI Agents(anthropic.com)↩
[4]AWS: Evaluating AI Agents — Real-World Lessons from Amazon(aws.amazon.com)↩
[5]Datadog: LLM Evaluation Framework Best Practices(datadoghq.com)↩
[6]LabelYourData: LLM-as-a-Judge — Human Agreement Rates and Bias Analysis(labelyourdata.com)↩
[7]Maxim AI: LLM Hallucinations in Production — Monitoring Strategies That Work(getmaxim.ai)↩

Share this article

X LinkedIn Hacker News

The 200 OK Is Not the Output. Your APM Cannot See the Failure.

AI Engineering PlatformadvancedNov 15, 20257 min read

By Viktor Bezdek · VP Engineering, Groupon

Level

What it actually does

Typical coverage

What it cannot catch

0 — YOLO

Ship. Watch user complaints in support channels.

Everything

1 — Spot checks

Engineer eyeballs a handful of outputs before the deploy.

~10%

Regressions, anything not sampled, anything not noticed

2 — Deterministic gates

Rule-based CI checks: schema, banned phrases, length bounds.

~30%

Semantic quality, faithfulness, anything a regex misses

3 — LLM-as-judge in CI

Automated scoring on the golden dataset before every deploy.

~70%

Drift in production, novel failures, distribution shift

4 — Continuous eval

Production traffic sampled and scored; alerts on degradation.

~90%

Adversarial inputs, red-team coverage, intentional jailbreaks

5 — Eval flywheel

Production failures route back into the dataset on contact.

95%+

Mostly tuning thresholds and watching for judge drift

name: LLM Quality Gate on: pull_request: paths: - 'prompts/**' - 'src/agents/**' - 'src/rag/**' jobs: eval: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Run golden dataset eval run: | python scripts/run_eval.py \ --dataset evals/golden.jsonl \ --model ${{ vars.APP_MODEL }} \ --judge ${{ vars.JUDGE_MODEL }} \ --threshold 0.82 env: OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }} ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} - name: Upload eval report uses: actions/upload-artifact@v4 with: name: eval-report path: eval_results/ if: always() - name: Comment scores on PR uses: actions/github-script@v7 with: script: | const fs = require('fs'); const summary = JSON.parse(fs.readFileSync('eval_results/summary.json')); github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: `## Eval Results\n\n` + `| Metric | Score | Threshold | Status |\n` + `|--------|-------|-----------|--------|\n` + summary.metrics.map(m => `| ${m.name} | ${m.score.toFixed(2)} | ${m.threshold} | ${m.pass ? '✅' : '❌'} |` ).join('\n') });

Tool

Primary strength

Where it fits

DeepEval

Pre-built metric library, CI integration, unit-test-style API

Teams starting from zero that need fast time-to-coverage

RAGAS

RAG-specific metrics — faithfulness, context relevance, answer relevance

Any system with retrieval. The de facto standard for RAG evals

W&B Weave

Eval playground, golden dataset management, experiment tracking

Teams already on W&B for training that want unified observability

LangSmith

Trace-level debugging, production traffic replay, human annotation

LangChain users that need end-to-end trace visibility

Confident AI

Hosted regression suites, A/B testing, dataset versioning

Teams that want managed eval infrastructure rather than building it

Datadog LLM Observability

Eval scoring next to APM, unified alerting surface

Teams on Datadog that need AI quality alongside infra metrics

# Pull 30 days of logged requests where users flagged the response. # No log? Start logging today. The dataset starts the moment you do. grep 'user_feedback=negative' logs/app.jsonl | \ jq '{input: .prompt, output: .completion, issue: .feedback_note}' \ > evals/failure_review.jsonl

def check_output(output: str, criteria: dict) -> dict: checks = {} # Length bound — silent verbosity and silent truncation both cost trust. checks['length'] = criteria['min_words'] <= len(output.split()) <= criteria['max_words'] # Banned content — enforced here, not advised in the prompt. checks['safe'] = not any(phrase in output.lower() for phrase in criteria['banned_phrases']) # Required elements — partial answers are silent failures. checks['complete'] = all(req in output for req in criteria.get('required_elements', [])) return checks

Teams that built eval infrastructure in 2024 now hold datasets of hundreds or thousands of real failure cases. Teams starting today start from zero but inherit better tooling and a clearer playbook.

The right time to start was eighteen months ago. The second-best time is today.

Determinism Is the Property You Are Trading Away

Most Teams Are at Level 1. The Field Is Operating at Level 3.

The Pipeline Is a Loop, Not a Gate

LLM-as-Judge Is the Only Thing That Scales — Until Bias Eats the Score

Known biases that distort the score

Mitigations that hold up under load

Six Dimensions. Anything Less Hides a Failure Class.

The Golden Dataset Is Production Code. Treat It That Way.

Seed with adversarial inputs, not happy paths

Specify success criteria a judge can verify

Version it like production code

Route every production failure back into the dataset within 24 hours

Wire the Eval Into CI/CD as a Hard Gate, Not a Suggestion

Production Monitoring Is the Layer Most Teams Skip — and Where Real Failures Live

Two Roles Automated Judges Cannot Take From Humans

Weekly Operator Review — Six Verifiable States

Tools at Each Layer — and Where They Stop Being Interchangeable

From Level 0 to Meaningful CI Coverage in Two Weeks

Audit the worst failures already on record (Day 1–2)

Write 50 golden examples drawn from those failures (Day 3–5)

Land three deterministic checks in CI (Day 6–7)

Add LLM-as-judge scoring for semantic quality (Day 8–10)

Wire both into CI behind a fail threshold (Day 11–12)

Stand up async production sampling (Day 13–14)

The Model Is Replaceable. The Eval Stack Compounds.

Eval Infrastructure: Five Non-Negotiable Rules

Generator and judge are different models — always

The eval dataset is production code — version-controlled, PR-reviewed

Every significant production failure becomes a dataset row within 24 hours

Judge scores are calibrated against human labels before they gate anything

Alert on trends, not individual scores

The Ceiling of Automated Evaluation

Related

Determinism Is the Property You Are Trading Away

Most Teams Are at Level 1. The Field Is Operating at Level 3.

The Pipeline Is a Loop, Not a Gate

LLM-as-Judge Is the Only Thing That Scales — Until Bias Eats the Score

Known biases that distort the score

Mitigations that hold up under load

Six Dimensions. Anything Less Hides a Failure Class.

The Golden Dataset Is Production Code. Treat It That Way.

Seed with adversarial inputs, not happy paths

Specify success criteria a judge can verify

Version it like production code

Route every production failure back into the dataset within 24 hours

Wire the Eval Into CI/CD as a Hard Gate, Not a Suggestion

Production Monitoring Is the Layer Most Teams Skip — and Where Real Failures Live

Two Roles Automated Judges Cannot Take From Humans

Weekly Operator Review — Six Verifiable States

Tools at Each Layer — and Where They Stop Being Interchangeable

From Level 0 to Meaningful CI Coverage in Two Weeks

Audit the worst failures already on record (Day 1–2)

Write 50 golden examples drawn from those failures (Day 3–5)

Land three deterministic checks in CI (Day 6–7)

Add LLM-as-judge scoring for semantic quality (Day 8–10)

Wire both into CI behind a fail threshold (Day 11–12)

Stand up async production sampling (Day 13–14)

The Model Is Replaceable. The Eval Stack Compounds.

Eval Infrastructure: Five Non-Negotiable Rules

Generator and judge are different models — always

The eval dataset is production code — version-controlled, PR-reviewed

Every significant production failure becomes a dataset row within 24 hours

Judge scores are calibrated against human labels before they gate anything

Alert on trends, not individual scores

The Ceiling of Automated Evaluation

Related