Skip to content
AI Native Builders

The LLM Evaluation Pipeline Your Production System Actually Needs

Most teams treat evals as an afterthought. Here's how to build the evaluation infrastructure that catches failures before your users do — from CI gates to production monitoring.

AI Engineering PlatformadvancedMar 26, 20266 min read
A judge's scale balanced between a robot and a human, on a plain white background with mint accentsEvaluation infrastructure: the unglamorous discipline that separates reliable AI from expensive chaos.

Your APM dashboard shows a 200 response in 1.2 seconds. Everything looks healthy. Meanwhile, your model just hallucinated a refund policy that doesn't exist, confidently told a customer the wrong deadline, or wrote code that compiles but silently corrupts data.

Traditional monitoring is blind to this class of failure. LLM outputs look like successful HTTP responses even when they're completely wrong. The only way to know if your AI system is doing what it should is to evaluate the outputs — systematically, continuously, and before users become your QA team.

This is what a production LLM evaluation pipeline looks like when it's done properly.

Why AI evaluation isn't like software testing

The properties that make LLMs powerful also make them hard to test.

Traditional software tests against deterministic behavior. Given input X, the function always returns Y. You write a test, it passes or fails, done.

LLM outputs are non-deterministic, context-sensitive, and often subjectively correct. A response can be factually accurate but uselessly verbose. It can be helpfully concise while being dangerously incomplete. A response can sound authoritative while being entirely fabricated.

This ambiguity means you need a fundamentally different testing philosophy — one that evaluates quality, not just correctness.

Traditional software testing
  • Deterministic: same input always produces same output

  • Binary pass/fail assertions

  • Tests are stable once written

  • APM and error rates catch most failures

  • Coverage measured by code paths

LLM evaluation
  • Non-deterministic: outputs vary across runs

  • Graded quality scores on multiple dimensions

  • Test datasets must grow with production failures

  • Outputs can be wrong while appearing healthy

  • Coverage measured by failure mode surface area

Where most teams actually are

An honest map of eval maturity in 2026.

According to LangChain's 2026 State of AI Agents report, approximately 57% of organizations surveyed now have agents in production. Quality is the top barrier to further deployment, cited by roughly 32% of respondents[2]. Yet most teams are operating at what the field calls Level 0 or Level 1 — manual testing, if anything, with no systematic coverage.

The maturity spectrum breaks down roughly like this:

LevelDescriptionTypical coverageWhat's missing
0 — YOLOShip and monitor user complaints0%Everything
1 — Spot checksManual review of sample outputs before deploys~10%Regression coverage, automation
2 — Deterministic gatesRule-based checks in CI: schema validation, banned phrases, length constraints~30%Semantic quality, LLM-as-judge
3 — LLM-as-judge in CIAutomated quality scoring on golden dataset before each deploy~70%Production monitoring, drift detection
4 — Continuous evalProduction traffic sampled and evaluated; alerts on score degradation~90%Red-teaming, adversarial coverage
5 — Eval flywheelProduction failures auto-feed back into the eval dataset; coverage compounds95%+Mostly tuning

Most teams reading this are at Level 1 or 2. The jump to Level 3 — LLM-as-judge in CI — is where reliability starts to meaningfully improve. The jump from 3 to 4 is where you stop getting surprised by production failures.

The path isn't complicated, but it requires treating your eval dataset with the same care as your production code.

The full evaluation pipeline

What a complete eval system looks like, from pre-commit to production.

LLM Evaluation Pipeline Architecture
A complete eval pipeline spans from pre-deploy CI gates to production monitoring and back into the eval dataset.

The key architectural insight is that the pipeline is a loop, not a linear gate. Production failures feed back into the eval dataset. The dataset grows smarter with every real-world failure you encounter. This is the flywheel that makes eval coverage compound over time rather than stagnate.

LLM-as-judge: the workhorse of semantic evaluation

How to use a model to evaluate a model — and where it breaks down.

LLM-as-judge is the practice of using a capable language model (often a stronger or separately-tuned one) to score the outputs of your application model. It's the most scalable approach to semantic quality evaluation available today.

The numbers are compelling: LLM-as-judge can offer roughly 500x to 5000x cost savings over full human review while achieving approximately 80% agreement with human preferences in calibrated settings[6] — a figure that compares favorably to human-to-human consistency on similar tasks. After careful calibration against human labels, precision and recall can approach 0.9 in some domains, though results vary by task complexity and domain specificity[1].

But it comes with systematic failure modes you need to plan for.

Known biases in LLM judges

  • Position bias: models including GPT-4 show roughly 40% inconsistency depending on which response appears first in a comparison, per published research

  • Verbosity bias: longer responses can receive approximately 15% score inflation regardless of quality in some evaluations

  • Self-enhancement bias: models tend to rate their own outputs somewhat higher than equivalent outputs from other models — estimates range from 5–7% in controlled studies

  • Domain gaps: agreement with human annotation often drops 10–15% in specialized technical domains

Mitigation strategies that work

  • Use chain-of-thought prompting in your judge: ask it to reason before scoring. This improves reliability by 10–15% and gives you debuggable reasoning

  • Run multiple judges (3–5 models) and majority-vote for critical evaluations — reduces bias by 30–40%

  • Calibrate your judge against human-labeled samples before trusting the scores

  • Maintain regression tests on the judge itself to detect when it starts drifting

What to actually measure

The dimensions that matter for production quality.

The metrics you choose depend on what your application does, but most production LLM systems need to measure across at least three layers: output quality, behavioral safety, and operational efficiency.

Faithfulness
Does the output stay grounded in source context? Critical for RAG systems.
Safety
Does the output avoid harmful, biased, or policy-violating content?
Relevance
Does the output actually address what was asked? Tone-deaf answers score poorly here.
Efficiency
How many steps and tokens did it take? Agents can succeed expensively.
Completeness
Does the output cover all required sub-goals? Partial answers are silent failures.
Consistency
Does the system produce similar quality across variation in phrasing and context?

For agent systems specifically, you also need to track trajectory quality — not just whether the agent arrived at the right answer, but whether it took a sensible path to get there. An agent that completes a five-step task in fifty steps is not a success worth celebrating. It's a reliability risk wearing a different mask.

Building and maintaining your golden dataset

The eval dataset is infrastructure. Treat it like code.

Your golden dataset is the set of (input, expected behavior) pairs that your CI pipeline runs against before every deploy. It's the single most important asset in your eval infrastructure — and the most commonly neglected.

Golden datasets fail in two ways: they're too small to catch regressions reliably, or they go stale as the application evolves. Both are expensive.

  1. 1

    Start with edge cases, not happy paths

    Your first 50 examples should be adversarial: ambiguous inputs, boundary conditions, multi-step instructions. Happy path examples don't catch regressions.

  2. 2

    Define machine-verifiable success criteria

    For each example, specify what 'correct' means in terms a judge can evaluate. 'Good response' is not a criterion. 'Response cites at least one source, stays within scope, and avoids speculative claims' is.

  3. 3

    Version and review it like production code

    Store your eval dataset in version control. Require PR review for additions and removals. Log the reasoning for why each example is there.

  4. 4

    Feed production failures back in

    Every significant production failure is a dataset addition. Triage the failure, confirm the expected behavior, and add the (input, criteria) pair within 24 hours. This is what makes coverage compound.

Wiring evals into CI/CD

Making quality a hard gate, not a suggestion.

The goal is to make deploying a regression harder than fixing it. That means your CI pipeline needs to fail the build when quality scores drop below a threshold — not just report the score and let the engineer decide.

Here's a practical structure for a CI eval job:

.github/workflows/eval.yml
name: LLM Quality Gate

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/agents/**'
      - 'src/rag/**'

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run golden dataset eval
        run: |
          python scripts/run_eval.py \
            --dataset evals/golden.jsonl \
            --model ${{ vars.APP_MODEL }} \
            --judge ${{ vars.JUDGE_MODEL }} \
            --threshold 0.82
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

      - name: Upload eval report
        uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: eval_results/
        if: always()

      - name: Comment scores on PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const summary = JSON.parse(fs.readFileSync('eval_results/summary.json'));
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## Eval Results\n\n` +
                    `| Metric | Score | Threshold | Status |\n` +
                    `|--------|-------|-----------|--------|\n` +
                    summary.metrics.map(m =>
                      `| ${m.name} | ${m.score.toFixed(2)} | ${m.threshold} | ${m.pass ? '✅' : '❌'} |`
                    ).join('\n')
            });

Production monitoring: the layer most teams skip

CI gates are necessary but not sufficient. Production is where the real failures live.

Pre-deployment evals tell you whether a change broke known behavior. They can't tell you what breaks when real users interact with the system in ways you didn't anticipate — which is, predictably, all the time.

Production monitoring for LLM quality works differently from latency monitoring. You're not polling a health endpoint; you're asynchronously sampling live requests, running them through a judge, and alerting on sustained score degradation.

The key word is sustained. Individual outputs vary. A single low score doesn't mean anything. A 3% drop in faithfulness scores over 48 hours across a sample of 500 requests means something is wrong.

5–10%
of production traffic recommended to sample for async eval — a practical starting point; adjust based on volume and cost
48h
rolling window commonly used for trend-based alerting — calibrate to your deployment cadence
~3%
score degradation threshold worth investigating — as a starting point; set based on your application's quality tolerance
~0.9
human-agreement precision achievable after judge calibration in favorable conditions — domain-specific results may vary
Production Eval: Async Monitoring Loop
Production traffic is sampled asynchronously — eval scoring happens off the critical path to avoid adding latency.

Where humans still need to be in the loop

Automation handles scale. Humans handle the hard calls.

There's a temptation to fully automate evaluation and never look at outputs again. Resist it.

Human review serves two roles that automated judges can't replace. First, calibration: your judge scores are only meaningful relative to human judgment. If you stop sampling and reviewing real outputs, the calibration drifts and your scores become self-referential noise. Second, novel failure discovery: automated judges can only catch failure modes you've defined. Humans catch the weird ones — the outputs that are technically within spec but clearly wrong in context.

A lightweight human review workflow looks like this:

Weekly human eval review checklist

  • Review 20–30 randomly sampled production outputs across quality dimensions

  • Review all outputs flagged by the judge as near-threshold (borderline scores)

  • Label 5–10 outputs to re-calibrate judge against human preference

  • Identify any novel failure patterns not covered by current metrics

  • Add confirmed failures to the golden dataset with documented criteria

  • Verify judge consistency: re-run 10 previously scored examples and check for drift

Tools worth considering

What to reach for at each stage of the pipeline.

The ecosystem has matured considerably. Most teams don't need to build eval infrastructure from scratch — the harder problem is choosing the right tools for each layer and connecting them coherently.

ToolPrimary strengthBest for
DeepEvalPre-built metric library, CI integration, unit test style APITeams starting from scratch who want fast time-to-coverage
RAGASRAG-specific metrics: faithfulness, context relevance, answer relevanceAny system with retrieval — the de facto standard for RAG evals
W&B WeaveEvaluations playground, golden dataset management, experiment trackingTeams already on W&B for model training who want unified observability
LangSmithTrace-level debugging, production traffic replay, human annotationLangChain users who need end-to-end trace visibility
Confident AIHosted regression suites, A/B testing, dataset versioningTeams who want managed eval infrastructure without building it
Datadog LLM ObservabilityEval scoring integrated with APM, unified alertingTeams already on Datadog who need AI quality alongside infra metrics

Where to start if you're at Level 0

A concrete sequence that actually fits into a sprint.

The gap between no evals and useful evals is smaller than it feels. Here's what a team of two engineers can realistically accomplish in two weeks — enough to get to Level 2 and have meaningful CI coverage.

  1. 1

    Audit your worst failures (Day 1–2)

    bash
    # Pull last 30 days of logged requests where users reported issues
    # If you don't have this, start logging everything today
    grep 'user_feedback=negative' logs/app.jsonl | \
      jq '{input: .prompt, output: .completion, issue: .feedback_note}' \
      > evals/failure_review.jsonl
  2. 2

    Write 50 golden examples from those failures (Day 3–5)

  3. 3

    Add three deterministic checks to CI (Day 6–7)

    python
    def check_output(output: str, criteria: dict) -> dict:
        checks = {}
        # Check length bounds
        checks['length'] = criteria['min_words'] <= len(output.split()) <= criteria['max_words']
        # Check for banned content
        checks['safe'] = not any(phrase in output.lower() for phrase in criteria['banned_phrases'])
        # Check required elements present
        checks['complete'] = all(req in output for req in criteria.get('required_elements', []))
        return checks
  4. 4

    Add LLM-as-judge scoring for semantic quality (Day 8–10)

  5. 5

    Wire both into CI with a fail threshold (Day 11–12)

  6. 6

    Set up async production sampling (Day 13–14)

The compounding case for investing now

Why the teams pulling ahead built eval infrastructure early.

LangChain's coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 — jumping from outside the top 30 to the top 5 — by changing nothing about the model[3]. They only changed the harness around it. Same model. Different evaluation and correction infrastructure. Dramatically better results.

This is the fundamental insight most teams are slow to internalize: the underlying model matters less than the system around it. The model is a replaceable component. The eval infrastructure, the golden dataset, the feedback loops — those compound. They get better every time you have a production failure and handle it correctly.

Teams that built eval infrastructure in 2024 now have datasets of hundreds or thousands of real failure cases. Teams starting today are starting from zero but have the benefit of better tooling and a clearer playbook.

The right time to start was eighteen months ago. The second right time is now.

How many examples do I need in my golden dataset before CI evals are meaningful?

Fifty is enough to catch obvious regressions. Two hundred is enough for reliable trend detection. The quality of examples matters more than quantity — 50 adversarial, well-specified examples beat 500 happy-path samples. Grow it by 10–20 examples per sprint, prioritizing real production failures.

Should I use the same model as my judge that I use for generation?

Never use the same model to judge its own outputs. Use a stronger model (if your app uses GPT-4o-mini, judge with GPT-4o or Claude 3.5 Sonnet) or a separately fine-tuned evaluator. Models systematically rate their own outputs higher — a 5–7% bias that makes your scores meaningless.

How do I handle evals for multi-turn conversations?

Evaluate at two levels: turn-level quality (is each individual response appropriate?) and conversation-level quality (did the session achieve its goal? did the agent handle topic shifts correctly?). Golden examples for multi-turn should include full conversation histories, not just isolated prompts.

What's the minimum viable production monitoring setup?

Log every request and response. Set up async scoring on a 10% sample using your judge model. Track average faithfulness and safety scores in a rolling 48-hour window. Alert when scores drop more than 3%. That's it — the rest is refinement.

Can I use evals to compare prompt versions before deploying?

Yes, and you should. Run both prompt versions against your golden dataset and compare scores directly. This is the only reliable way to know if a prompt change is an improvement or just feels like one. Tools like LangSmith and W&B Weave make this comparison straightforward.

Eval infrastructure ground rules

Never use the same model as both generator and judge

Models rate their own outputs higher. Always use a separate, ideally stronger model as your evaluator.

Your eval dataset is production code — version control it

Treat additions, removals, and criteria changes with the same review process as application code changes.

Every significant production failure becomes a dataset addition within 24 hours

This is how eval coverage compounds. Skip this step and your dataset stagnates.

Calibrate judge scores against human labels before trusting them

Uncalibrated scores are directionally useful at best. Calibrated scores are actionable gates.

Alert on trends, not individual scores

Single output variation is noise. A 3% drop across 500 samples over 48 hours is a signal worth acting on.

On the limits of automated evaluation

LLM-as-judge achieves roughly 80% agreement with human preferences in calibrated settings — which means approximately 1 in 5 judgments would differ from what a human would decide. For most teams, this is acceptable at scale: the 80% it catches correctly far exceeds the 0% caught by having no evals. For high-stakes applications (medical, legal, financial), calibrate tighter and increase the proportion of human review before trusting automated scores as deployment gates.

Key terms in this piece
LLM evaluation pipelineproduction AI evalsLLM-as-judgeeval infrastructureAI quality assurancehallucination detectionAI agent evaluation
Sources
  1. [1]Confident AI: LLM Evaluation Metrics — Everything You Need for LLM Evaluation(confident-ai.com)
  2. [2]The Pragmatic Engineer: LLM Evals — LangChain State of AI Agents 2026(newsletter.pragmaticengineer.com)
  3. [3]Anthropic Engineering: Demystifying Evals for AI Agents(anthropic.com)
  4. [4]AWS: Evaluating AI Agents — Real-World Lessons from Amazon(aws.amazon.com)
  5. [5]Datadog: LLM Evaluation Framework Best Practices(datadoghq.com)
  6. [6]LabelYourData: LLM-as-a-Judge — Human Agreement Rates and Bias Analysis(labelyourdata.com)
  7. [7]Maxim AI: LLM Hallucinations in Production — Monitoring Strategies That Work(getmaxim.ai)
Share this article