Most LLM applications are deployed without a meaningful evaluation system. The developer prompts the model a few times, the outputs look reasonable, and it ships. Then users start complaining about specific failure cases, the developer adjusts the prompt, checks a few examples again, and ships again. This cycle is not engineering. It is guessing.
Evals are what turns LLM development from guessing into engineering. They let you measure whether a change actually improved things, catch regressions when you update your prompt or switch models, and understand the failure modes of your application before users do. This guide covers how to build an eval system that is actually useful, not just theoretically correct.
Why Most Teams Skip Evals and Why That Is a Mistake
The honest reason teams skip evals is that they are not free to build. A good eval suite requires curated test cases, a clear definition of what “correct” looks like for your task, and the infrastructure to run evaluations systematically. For a team moving fast, this feels like overhead.
The cost calculation looks wrong until you have been burned once. The burn looks like this: you improve your prompt for one class of inputs, you ship it, and then you find out three days later that you accidentally broke a different class of inputs that was working fine. Without evals, you have no way to know this until users tell you. With evals, you know before shipping.
The other hidden cost of not having evals is the inability to change models confidently. When a better, cheaper model becomes available (and in 2026, this happens regularly), you have no way to know if it behaves equivalently on your specific task without running it against real examples. Teams without evals either miss model improvements or make the switch without adequate validation.
Types of Evals
Different tasks require different evaluation approaches. Understanding the trade-offs helps you choose the right type for each part of your application.
Exact Match
Exact match evals compare the model’s output to a known correct answer. If the output equals the expected output (or matches a pattern), the eval passes.
Best for: classification tasks, structured data extraction, simple question-answering with factual answers, code generation where functional correctness can be tested by running the code.
How to implement:
def eval_classification(model_output: str, expected_label: str) -> dict:
"""Simple exact match for classification tasks."""
# Normalize both to handle minor formatting differences
predicted = model_output.strip().lower()
expected = expected_label.strip().lower()
return {
"passed": predicted == expected,
"predicted": predicted,
"expected": expected
}
def run_classification_eval(test_cases: list[dict]) -> dict:
results = []
for case in test_cases:
output = call_llm(case["input"])
result = eval_classification(output, case["expected_label"])
results.append(result)
passed = sum(1 for r in results if r["passed"])
return {
"accuracy": passed / len(results),
"passed": passed,
"total": len(results),
"failures": [r for r in results if not r["passed"]]
}
Limitations: Exact match is brittle for free-text outputs. “New York” and “NYC” are both correct answers to “what city is this?” but exact match would fail one of them. Use exact match only when the output space is well-defined and small.
Fuzzy Match and Semantic Similarity
For tasks where the correct answer is a phrase or short text but multiple phrasings are acceptable, fuzzy matching or embedding similarity gives you more useful signal than exact match.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def semantic_similarity_eval(predicted: str, expected: str, threshold: float = 0.85) -> dict:
embeddings = model.encode([predicted, expected])
similarity = float(np.dot(embeddings[0], embeddings[1]) /
(np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])))
return {
"passed": similarity >= threshold,
"similarity": similarity,
"predicted": predicted,
"expected": expected
}
The threshold requires calibration: too high and you fail acceptable outputs; too low and you pass bad ones. Run your test cases manually against different thresholds to find the right value for your task.
LLM-as-Judge
LLM-as-judge evals use an LLM to evaluate the output of another LLM (or the same model on a different input). This is the most flexible approach and works for tasks where “correct” cannot be fully specified programmatically.
Best for: open-ended generation tasks (summaries, explanations, responses), tasks where quality matters but multiple correct outputs exist, evaluating dimensions like tone, helpfulness, or accuracy that resist simple rules.
A basic LLM-as-judge implementation:
JUDGE_PROMPT = """You are evaluating the quality of an AI assistant's response.
Task: {task_description}
User input: {user_input}
AI response: {ai_response}
Evaluate the response on the following criteria:
1. Accuracy: Is the information correct? (1-5)
2. Completeness: Does it address all parts of the question? (1-5)
3. Clarity: Is it easy to understand? (1-5)
Return a JSON object with this exact structure:
{{
"accuracy": <score>,
"completeness": <score>,
"clarity": <score>,
"overall": <average of the three>,
"reasoning": "<one sentence explaining the scores>"
}}"""
def llm_judge_eval(task_description: str, user_input: str, ai_response: str) -> dict:
prompt = JUDGE_PROMPT.format(
task_description=task_description,
user_input=user_input,
ai_response=ai_response
)
judge_response = call_judge_llm(prompt) # Use a capable model here
try:
scores = json.loads(judge_response)
return {"passed": scores["overall"] >= 3.5, **scores}
except json.JSONDecodeError:
return {"passed": False, "error": "Failed to parse judge response", "raw": judge_response}
Important caveats about LLM-as-judge:
The judge can be biased toward outputs that look good superficially (well-formatted, confident) even when they are wrong. Counter this by including factual reference material in the judge prompt when available.
The judge is biased toward its own preferences. If you judge Claude outputs using Claude, you will get systematically higher scores than if you use GPT-4o as the judge for Claude outputs. This is not a reason to avoid LLM-as-judge, but it means you should be consistent in your judge choice and be cautious about comparing scores across different evaluation setups.
Use a capable model as the judge. Using a small, cheap model as a judge for complex reasoning tasks gives noisy results. Spending more on the judge pays off in eval reliability.
Human Evaluation
Human eval is the ground truth, but it is slow and expensive. Reserve it for:
- Building your initial test set (humans label the correct answers)
- Spot-checking LLM-as-judge reliability (do humans agree with the judge?)
- Evaluating when the stakes are high and automated evals are not reliable enough
The practical approach: build a lightweight internal annotation tool (a simple web form or even a Google Sheet) that shows an evaluator the input, the AI output, and asks for a score or pass/fail judgment. Sample 50-100 outputs per week across your production traffic for ongoing quality monitoring.
Building an Eval Harness
An eval harness is the infrastructure that makes running evals practical: loading test cases, running the model, collecting results, and reporting.
Minimal Eval Harness Structure
import json
import time
from pathlib import Path
from datetime import datetime
class EvalHarness:
def __init__(self, eval_name: str, test_cases_path: str):
self.eval_name = eval_name
self.test_cases = self._load_test_cases(test_cases_path)
self.results = []
def _load_test_cases(self, path: str) -> list[dict]:
with open(path) as f:
return json.load(f)
def run(self, model_fn, eval_fn) -> dict:
"""Run all test cases through the model and eval function."""
start_time = time.time()
for case in self.test_cases:
case_start = time.time()
try:
model_output = model_fn(case["input"])
eval_result = eval_fn(model_output, case)
self.results.append({
"case_id": case.get("id", "unknown"),
"input": case["input"],
"expected": case.get("expected"),
"output": model_output,
"passed": eval_result["passed"],
"latency_ms": int((time.time() - case_start) * 1000),
**{k: v for k, v in eval_result.items() if k != "passed"}
})
except Exception as e:
self.results.append({
"case_id": case.get("id", "unknown"),
"passed": False,
"error": str(e)
})
return self._summarize(time.time() - start_time)
def _summarize(self, total_time: float) -> dict:
passed = sum(1 for r in self.results if r["passed"])
return {
"eval_name": self.eval_name,
"timestamp": datetime.now().isoformat(),
"total": len(self.results),
"passed": passed,
"failed": len(self.results) - passed,
"pass_rate": passed / len(self.results),
"total_time_s": round(total_time, 2),
"avg_latency_ms": sum(r.get("latency_ms", 0) for r in self.results) / len(self.results),
"failures": [r for r in self.results if not r["passed"]]
}
def save_results(self, output_dir: str = "eval_results"):
Path(output_dir).mkdir(exist_ok=True)
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
path = f"{output_dir}/{self.eval_name}_{timestamp}.json"
with open(path, "w") as f:
json.dump({"summary": self._summarize(0), "results": self.results}, f, indent=2)
return path
Test Case Format
Store test cases as JSON files in your repository:
[
{
"id": "classification_001",
"input": "My order hasn't arrived after 3 weeks. This is unacceptable.",
"expected_label": "complaint",
"expected_urgency": "high",
"notes": "Clear complaint, time-sensitive"
},
{
"id": "classification_002",
"input": "Just checking in on my order status from last Tuesday.",
"expected_label": "inquiry",
"expected_urgency": "low",
"notes": "Neutral inquiry, no distress signals"
}
]
Version these files in your repository alongside your prompts. When you change a prompt, run the evals and commit the results together. This gives you a history of how your changes affected quality.
What to Measure
Not all metrics matter equally for every application. Here is what to prioritize.
For any LLM application:
- Pass rate on your test suite (absolute number, tracked over time)
- Latency percentiles (p50, p95, p99) — LLM applications have long tail latency
- Error rate (API errors, parsing failures, timeouts)
For classification and extraction tasks:
- Accuracy on labeled test cases
- Precision and recall per class (not just overall accuracy)
- False positive and false negative rates on critical classes
For generation tasks (summaries, responses, drafts):
- LLM-as-judge scores on your quality dimensions
- Human agreement rate with your automated eval
- Specific failure mode rates (hallucination, off-topic, wrong format)
For agentic applications:
- Task completion rate (does the agent complete the task it was given?)
- Step efficiency (how many LLM calls does it take on average?)
- Error recovery rate (when a tool fails, does it recover correctly?)
Common Pitfalls
Building a test set from the same distribution as your training prompts. If you design your prompts with certain examples in mind, and then evaluate against those same examples, you will overestimate performance on real user inputs. Include examples from actual user interactions as soon as you have them.
Ignoring the failure analysis. Aggregate pass rates are a lagging indicator. The value of evals is in the failure cases: read every failure, understand why it failed, and fix the specific issue. A 90% pass rate with 10% mysterious failures is worse than an 85% pass rate with fully understood failure modes.
Running evals manually and rarely. Evals only work if they are run continuously and automatically. Integrate them into your CI/CD pipeline so they run on every prompt change. A pull request that includes a prompt change should show the eval results.
Only measuring what is easy to measure. Exact match accuracy is easy to measure. Helpfulness is not. If helpfulness matters for your application, invest in an eval approach that measures it even if it is harder to automate. Measuring the wrong thing with high precision is worse than measuring the right thing approximately.
Starting Simple
You do not need a perfect eval system before you deploy. You need a minimum viable eval system that is better than nothing.
Start here: collect 20 examples from your domain (the harder and more diverse, the better). Label the correct output for each. Write a script that runs your current prompt against all 20 examples and reports how many pass. Run it before and after every prompt change.
That is it. A 20-case test suite run manually is infinitely better than nothing. It catches obvious regressions and forces you to define what “correct” means for your task. Build from there.
The teams with the best LLM applications are not the ones with the cleverest models or the most sophisticated prompts. They are the ones who have built the tightest feedback loop between production behavior and product improvement. Evals are that feedback loop.