How large does an evaluation test set need to be?

It depends on the task. For narrow classification or extraction tasks, fifty to a hundred well-chosen examples can be enough to detect regressions. For more open-ended generation tasks, you typically want at least two hundred examples covering the breadth of real usage. The test set should grow over time as new failure modes are discovered in production.

Is using an LLM to evaluate another LLM reliable?

Reliable enough to be useful, but not perfect. LLM-as-judge methods correlate strongly with human ratings on most quality dimensions, but they have known biases (preferring longer outputs, preferring outputs in the same style as the judge model). The right practice is to validate your judge on a small set of human-rated examples and to use multiple judges or grading criteria for high-stakes evaluations.

How often should the evaluation harness run?

On every change that could affect outputs. In practice, this means running it on every pull request, every model version bump, every retrieval pipeline change, and on a daily scheduled basis to catch upstream model drift. The runs need to be cheap enough that nobody is tempted to skip them.

Can evaluation harnesses replace human review entirely?

Not in any system we have worked on. Automated evaluation catches the regressions you have learned to look for. Human review catches the failure modes you have not yet thought to encode. The two are complementary; the harness reduces the volume of human review needed, but does not eliminate it.

Evaluation harnesses are the deliverable

Most teams shipping AI features in production have the same uncomfortable moment about three months in. The model is doing something it was not doing last week. Quality has drifted. A customer has complained about an answer that the team would have caught if anyone had been looking. Nobody can tell when the drift started, why it happened, or whether the change last Tuesday made it worse.

This is what AI products look like without an evaluation harness. The model is a black box, the prompt is a string the team is afraid to touch, and every change feels like an act of faith. The discipline that turns this into a manageable engineering practice is evaluation, and the artefact that makes evaluation work is the eval harness.

Why evaluation is not the same as testing

Traditional software testing is binary. A function either returns the expected output or it does not. The expected output is known in advance, and the test passes or fails accordingly.

AI output is not binary. A model can return an answer that is technically correct but tonally wrong. Or factually right but missing a key nuance. Or accurate today but slightly worse than the version it returned last week. None of these failures show up in a binary test. They show up only when someone is looking carefully at the actual outputs, and they accumulate silently in the gaps between releases.

Evaluation, properly done, is the systematic practice of looking carefully. An eval harness is the infrastructure that makes the looking repeatable, automatable, and impossible to skip.

If you cannot tell whether your model is getting better or worse week to week, you are not running a model. You are praying.

What a real eval harness contains

An evaluation harness is not a single tool. It is a small collection of components that together let your team answer one question with confidence: did this change make things better, worse, or the same? The components are:

A test set drawn from real usage

The foundation of any useful eval harness is a corpus of real inputs that the model has seen in production. Not synthetic data, not handcrafted examples — real user queries, real customer documents, real workflows. The test set should be large enough to be statistically meaningful (typically a few hundred examples at minimum) and curated to cover the failure modes you actually care about.

For one of our retrieval-heavy engagements, the test set started as twenty examples collected by hand. It grew to two hundred over the first month and three hundred by the second. Every customer complaint that surfaced a new failure mode was added to the test set. The test set is now the most valuable asset of that project — more valuable than the prompts, more valuable than the model choice, more valuable than the orchestration code.

Graders that match what you actually care about

Each test case needs a way to score the model's output. The naive approach is to compare against a "golden" answer with exact match or similarity scoring. This works for narrow tasks (classification, extraction) and fails for everything else.

For more complex outputs, the grader is usually one of three things:

A rule-based check — does the output contain the required entity? Does it cite a source? Does it stay within length limits? These are cheap, deterministic, and useful for the dimensions you can encode mechanically.
An LLM-as-judge — using a separate model call to rate the output against criteria you define. This is more expensive and slightly noisy, but it is the only practical way to evaluate quality dimensions like tone, helpfulness, or coherence at scale.
Human review — for the highest-stakes dimensions, real human judgement is the gold standard. The eval harness should make it easy to spot-check a random sample of outputs each week.

Most production eval harnesses use a combination of all three. Rule-based checks catch the obvious failures fast. LLM-as-judge handles the nuanced quality dimensions at scale. Human review catches what the others miss.

A regression detector

The eval harness should run automatically on every meaningful change — every new prompt, every model version bump, every retrieval pipeline tweak — and produce a clear delta. If the new version improves the average score by two points, that is visible. If it regresses by two points on a specific subset of cases, that is also visible. The team should never have to wonder whether a change made things better.

This is the discipline that distinguishes AI engineering from AI tinkering. Tinkering is changing prompts and hoping. Engineering is changing prompts and measuring.

The surprising things eval harnesses catch

Once an eval harness is in place, the regressions it catches are rarely the ones the team expected. From our engagements, a non-exhaustive list of real findings:

A new model version that scored higher overall but performed worse on a specific high-value customer segment
A prompt refinement intended to improve clarity that quietly reduced the rate at which the model cited sources
A retrieval pipeline upgrade that improved precision in English but regressed sharply in non-English queries
A temperature change that made outputs more confident but materially less accurate on edge cases
A new tool integration that worked perfectly on test inputs but failed silently on the long tail of real production queries

None of these would have been caught by spot-checking or by unit tests. They were caught because someone had built the harness that made them visible.

When the harness becomes the product

The deeper insight, after running this discipline for a while, is that the eval harness itself becomes the most defensible asset of an AI product. The model can change. The prompts can change. The retrieval pipeline can change. The eval harness, and the curated test set behind it, is what tells you whether each of those changes is a step forward or a step backward.

This is why we describe evaluation harnesses as "the deliverable" on our generative AI engagements. Not the model, not the prompts, not the integration. The harness — because it is the artefact that lets the team operate the product confidently for years after we have handed it off.

If you are running an AI feature in production without an evaluation harness, you are flying blind. Build one before you ship the next prompt change. It is the single highest-leverage piece of engineering you can do for the long-term reliability of the product.

Work with us

Have a project that needs senior engineering attention?

We work with founders and enterprise teams across Dubai, the US, and India. If something here resonates with what you're building, we'd be glad to talk.

Start a conversation →

Evaluation harnesses are the deliverable

Why evaluation is not the same as testing

What a real eval harness contains

A test set drawn from real usage

Graders that match what you actually care about

A regression detector

The surprising things eval harnesses catch

When the harness becomes the product

Have a project that needs senior engineering attention?

Frequently asked questions

How large does an evaluation test set need to be?

Is using an LLM to evaluate another LLM reliable?

How often should the evaluation harness run?

Can evaluation harnesses replace human review entirely?

Related reading

Have an ambitious idea? We'd love to hear it.

Evaluation harnesses are the deliverable

Why evaluation is not the same as testing

What a real eval harness contains

A test set drawn from real usage

Graders that match what you actually care about

A regression detector

The surprising things eval harnesses catch

When the harness becomes the product

Have a project that needs senior engineering attention?

Frequently asked questions

How large does an evaluation test set need to be?

Is using an LLM to evaluate another LLM reliable?

How often should the evaluation harness run?

Can evaluation harnesses replace human review entirely?

Related reading

Why your agent stack should be boring before it's clever

What we look for in retrieval pipelines

Our generative AI practice

Have an ambitious idea? We'd love to hear it.