A practical path for adding the first useful evaluation set to an AI app without waiting for a full evaluation platform.
The first eval usually fails because it tries to become a platform. The team picks a framework, debates scorers, names ten dimensions, and ends the day with no release gate. Meanwhile the next prompt change ships on taste.
An afternoon eval should be smaller and more annoying. Pick one task. Write ten to twenty cases that represent real user intent, edge cases, and known mistakes. Decide what counts as pass, fail, and needs human review. Run the current prompt against those cases. Save the examples that failed. That is enough to stop guessing.
This article stays in the corpus because evals are the bridge between AI prototype and production behavior. LangSmith's docs separate offline evaluation before shipping from online evaluation on live traces. Braintrust frames evals as systematic quality measurement that catches regressions before users see them. OpenAI's current evaluation guidance emphasizes typical, edge, and adversarial cases. None of that requires a large team to begin.
Do not evaluate the whole app. Evaluate the behavior that would hurt if it regressed.
The fastest eval target is the task that already made you nervous. A support summarizer hallucinated next steps. A document agent ignored a deadline. A support bot answered policy questions from stale context. A code helper returned a command that worked on the author's machine and broke CI. Start there.
A useful eval case has four parts: input, expected behavior, failure labels, and reviewer notes. The expected behavior does not always need a single exact answer. For extraction, it might. For support drafting, it may be a checklist of facts that must be present and facts that must not appear. For tool use, it may be the selected tool, arguments, and refusal behavior when required data is missing.
The first set should include ordinary cases, edge cases, and adversarial cases. OpenAI's eval best-practice guidance makes that mix explicit. Ordinary cases prevent the prompt from getting weird while chasing edge performance. Edge cases protect the product promise. Adversarial cases expose prompt injection, ambiguous instructions, missing context, and bad refusals. Ten cases can teach more than a dashboard with no sharp examples.
Enough to expose known failures without turning the first eval into a data-labeling project.
The review bucket keeps uncertain outputs visible instead of forcing fake precision.
A narrow eval becomes a gate faster than a broad quality score nobody trusts.
| Field | What to write | Why it matters |
|---|---|---|
| Input | The exact user message, retrieved context, or tool state | Reproducibility starts with the full prompt surface |
| Expected behavior | Required facts, forbidden claims, tool choice, or refusal rule | The judge needs a product contract, not a vibe |
| Failure label | Hallucination, missing context, wrong tool, unsafe action, bad format | Labels turn failures into fixable clusters |
| Reviewer note | Why this case exists and what changed last time | Future reviewers know the history behind the case |
| Release decision | Pass, fail, or needs human review | The eval blocks or clears releases instead of producing trivia |
Generic quality ratings do not block the failures users actually notice.
The weakest first eval asks, 'Was the answer good?' That question creates a meeting, not a gate. A stronger eval asks whether the answer completed the task under the product contract. Did the support summary include the refund deadline? Did the agent refuse when the account ID was missing? Did the tool call use the user's workspace, not the default workspace? Did the answer avoid inventing policy?
You can score many of those checks deterministically. Output format, required fields, missing citations, selected tool, argument shape, and refusal phrases can be checked with code. Other checks need human review or an LLM judge. Use a judge only when the judgment is genuinely semantic. If a simple assertion can catch the failure, write the assertion.
The first release threshold should be blunt. For example: no P0 failure can regress, pass rate must not drop below the baseline, and every needs-review case must have a human decision attached to the release. The point is not to make the model perfect. The point is to make behavior changes visible before the change reaches users.
Prompt change looks better in a few manual runs
Averages hide which examples got worse
Failures are discussed without persistent labels
Production incidents do not become future tests
Prompt change runs against saved cases before merge
Failed examples and labels travel with the release decision
Regression threshold is written before the run
Bad traces become new cases for the next iteration
Choose one task whose regression would create user harm.
Write ten to twenty cases from real or realistic inputs.
Include ordinary, edge, and adversarial examples.
Define pass, fail, and needs-review before running the model.
Use deterministic checks for format, fields, tool choice, and refusals where possible.
Save failed outputs with labels and reviewer notes.
Set a release threshold that blocks P0 regressions.
Add one production trace failure to the eval set each week.
Put the input, expected behavior, and label in a format the team can review in a pull request.
Run the current production prompt and model before changing anything. The baseline is your comparison point.
Run the same cases against the candidate prompt, model, or retrieval change. Block the release when protected cases regress.
A first eval is a net, not a proof.
The limitation is obvious: ten to twenty cases do not cover the product. They cover the first set of risks you can name. That is still a major improvement over manual vibes because the same cases run every time behavior changes. Repetition is the value.
The next improvement is not always more cases. Sometimes it is better labels. If every failure is called 'bad answer,' the eval cannot guide fixes. Split the labels into retrieval miss, hallucinated fact, unsafe action, wrong tool, bad refusal, wrong tone, and invalid format only when those labels change what you do next.
Thresholds should stay close to the labels. A hallucinated policy answer may be a release blocker even if the total pass rate is high. A tone miss may be acceptable for one release if the core task improved and the team has a follow-up. The reviewer needs that distinction in the release note. Otherwise the eval becomes a single number with no operational meaning.
The most useful afternoon result is often a failure cluster. If four of the first fifteen cases fail because retrieved context is missing, the fix is not prompt polish. It is retrieval coverage, source selection, or a product fallback when the answer cannot be grounded.
This article should stay because it gives builders the smallest eval practice that can survive contact with a real release. It fits the site position: evidence before production claims.
How many cases should the first eval include?
Ten to twenty cases are enough to start if they represent known failures, edge inputs, and one or two adversarial examples. Add cases from production traces over time.
Should the first eval use an LLM judge?
Only for semantic judgments that code cannot check. Use deterministic checks for schema, required fields, tool choice, citations, and refusal behavior whenever possible.
What makes an eval release-ready?
It has a baseline, a written threshold, saved failed examples, and a release decision. Without those, it is an experiment, not a gate.