How to Evaluate LLM Outputs in Production: A Practical Guide

Most LLM applications are deployed without a meaningful evaluation system. The developer prompts the model a few times, the outputs look reasonable, and it ships. Then users start complaining about specific failure cases, the developer adjusts the prompt, checks a few examples again, and ships again. This cycle is not engineering. It is guessing.

Evals are what turns LLM development from guessing into engineering. They let you measure whether a change actually improved things, catch regressions when you update your prompt or switch models, and understand the failure modes of your application before users do. This guide covers how to build an eval system that is actually useful, not just theoretically correct.

Why Most Teams Skip Evals and Why That Is a Mistake

The honest reason teams skip evals is that they are not free to build. A good eval suite requires curated test cases, a clear definition of what “correct” looks like for your task, and the infrastructure to run evaluations systematically. For a team moving fast, this feels like overhead.

The cost calculation looks wrong until you have been burned once. The burn looks like this: you improve your prompt for one class of inputs, you ship it, and then you find out three days later that you accidentally broke a different class of inputs that was working fine. Without evals, you have no way to know this until users tell you. With evals, you know before shipping.

The other hidden cost of not having evals is the inability to change models confidently. When a better, cheaper model becomes available (and in 2026, this happens regularly), you have no way to know if it behaves equivalently on your specific task without running it against real examples. Teams without evals either miss model improvements or make the switch without adequate validation.

Types of Evals

Different tasks require different evaluation approaches. Understanding the trade-offs helps you choose the right type for each part of your application.

Exact Match

Exact match evals compare the model’s output to a known correct answer. If the output equals the expected output (or matches a pattern), the eval passes.

Best for: classification tasks, structured data extraction, simple question-answering with factual answers, code generation where functional correctness can be tested by running the code.

How to implement:

def eval_classification(model_output: str, expected_label: str) -> dict:
    """Simple exact match for classification tasks."""
    # Normalize both to handle minor formatting differences
    predicted = model_output.strip().lower()
    expected = expected_label.strip().lower()

    return {
        "passed": predicted == expected,
        "predicted": predicted,
        "expected": expected
    }

def run_classification_eval(test_cases: list[dict]) -> dict:
    results = []
    for case in test_cases:
        output = call_llm(case["input"])
        result = eval_classification(output, case["expected_label"])
        results.append(result)

    passed = sum(1 for r in results if r["passed"])
    return {
        "accuracy": passed / len(results),
        "passed": passed,
        "total": len(results),
        "failures": [r for r in results if not r["passed"]]
    }

Limitations: Exact match is brittle for free-text outputs. “New York” and “NYC” are both correct answers to “what city is this?” but exact match would fail one of them. Use exact match only when the output space is well-defined and small.

Fuzzy Match and Semantic Similarity

For tasks where the correct answer is a phrase or short text but multiple phrasings are acceptable, fuzzy matching or embedding similarity gives you more useful signal than exact match.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def semantic_similarity_eval(predicted: str, expected: str, threshold: float = 0.85) -> dict:
    embeddings = model.encode([predicted, expected])
    similarity = float(np.dot(embeddings[0], embeddings[1]) /
                       (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])))

    return {
        "passed": similarity >= threshold,
        "similarity": similarity,
        "predicted": predicted,
        "expected": expected
    }

The threshold requires calibration: too high and you fail acceptable outputs; too low and you pass bad ones. Run your test cases manually against different thresholds to find the right value for your task.

LLM-as-Judge

LLM-as-judge evals use an LLM to evaluate the output of another LLM (or the same model on a different input). This is the most flexible approach and works for tasks where “correct” cannot be fully specified programmatically.

Best for: open-ended generation tasks (summaries, explanations, responses), tasks where quality matters but multiple correct outputs exist, evaluating dimensions like tone, helpfulness, or accuracy that resist simple rules.

A basic LLM-as-judge implementation:

JUDGE_PROMPT = """You are evaluating the quality of an AI assistant's response.

Task: {task_description}
User input: {user_input}
AI response: {ai_response}

Evaluate the response on the following criteria:
1. Accuracy: Is the information correct? (1-5)
2. Completeness: Does it address all parts of the question? (1-5)
3. Clarity: Is it easy to understand? (1-5)

Return a JSON object with this exact structure:
{{
  "accuracy": <score>,
  "completeness": <score>,
  "clarity": <score>,
  "overall": <average of the three>,
  "reasoning": "<one sentence explaining the scores>"
}}"""

def llm_judge_eval(task_description: str, user_input: str, ai_response: str) -> dict:
    prompt = JUDGE_PROMPT.format(
        task_description=task_description,
        user_input=user_input,
        ai_response=ai_response
    )

    judge_response = call_judge_llm(prompt)  # Use a capable model here
    try:
        scores = json.loads(judge_response)
        return {"passed": scores["overall"] >= 3.5, **scores}
    except json.JSONDecodeError:
        return {"passed": False, "error": "Failed to parse judge response", "raw": judge_response}

Important caveats about LLM-as-judge:

The judge can be biased toward outputs that look good superficially (well-formatted, confident) even when they are wrong. Counter this by including factual reference material in the judge prompt when available.

The judge is biased toward its own preferences. If you judge Claude outputs using Claude, you will get systematically higher scores than if you use GPT-4o as the judge for Claude outputs. This is not a reason to avoid LLM-as-judge, but it means you should be consistent in your judge choice and be cautious about comparing scores across different evaluation setups.

Use a capable model as the judge. Using a small, cheap model as a judge for complex reasoning tasks gives noisy results. Spending more on the judge pays off in eval reliability.

Human Evaluation

Human eval is the ground truth, but it is slow and expensive. Reserve it for:

Building your initial test set (humans label the correct answers)
Spot-checking LLM-as-judge reliability (do humans agree with the judge?)
Evaluating when the stakes are high and automated evals are not reliable enough

The practical approach: build a lightweight internal annotation tool (a simple web form or even a Google Sheet) that shows an evaluator the input, the AI output, and asks for a score or pass/fail judgment. Sample 50-100 outputs per week across your production traffic for ongoing quality monitoring.

Building an Eval Harness

An eval harness is the infrastructure that makes running evals practical: loading test cases, running the model, collecting results, and reporting.

Minimal Eval Harness Structure

import json
import time
from pathlib import Path
from datetime import datetime

class EvalHarness:
    def __init__(self, eval_name: str, test_cases_path: str):
        self.eval_name = eval_name
        self.test_cases = self._load_test_cases(test_cases_path)
        self.results = []

    def _load_test_cases(self, path: str) -> list[dict]:
        with open(path) as f:
            return json.load(f)

    def run(self, model_fn, eval_fn) -> dict:
        """Run all test cases through the model and eval function."""
        start_time = time.time()

        for case in self.test_cases:
            case_start = time.time()

            try:
                model_output = model_fn(case["input"])
                eval_result = eval_fn(model_output, case)

                self.results.append({
                    "case_id": case.get("id", "unknown"),
                    "input": case["input"],
                    "expected": case.get("expected"),
                    "output": model_output,
                    "passed": eval_result["passed"],
                    "latency_ms": int((time.time() - case_start) * 1000),
                    **{k: v for k, v in eval_result.items() if k != "passed"}
                })

            except Exception as e:
                self.results.append({
                    "case_id": case.get("id", "unknown"),
                    "passed": False,
                    "error": str(e)
                })

        return self._summarize(time.time() - start_time)

    def _summarize(self, total_time: float) -> dict:
        passed = sum(1 for r in self.results if r["passed"])
        return {
            "eval_name": self.eval_name,
            "timestamp": datetime.now().isoformat(),
            "total": len(self.results),
            "passed": passed,
            "failed": len(self.results) - passed,
            "pass_rate": passed / len(self.results),
            "total_time_s": round(total_time, 2),
            "avg_latency_ms": sum(r.get("latency_ms", 0) for r in self.results) / len(self.results),
            "failures": [r for r in self.results if not r["passed"]]
        }

    def save_results(self, output_dir: str = "eval_results"):
        Path(output_dir).mkdir(exist_ok=True)
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        path = f"{output_dir}/{self.eval_name}_{timestamp}.json"
        with open(path, "w") as f:
            json.dump({"summary": self._summarize(0), "results": self.results}, f, indent=2)
        return path

Test Case Format

Store test cases as JSON files in your repository:

[
  {
    "id": "classification_001",
    "input": "My order hasn't arrived after 3 weeks. This is unacceptable.",
    "expected_label": "complaint",
    "expected_urgency": "high",
    "notes": "Clear complaint, time-sensitive"
  },
  {
    "id": "classification_002",
    "input": "Just checking in on my order status from last Tuesday.",
    "expected_label": "inquiry",
    "expected_urgency": "low",
    "notes": "Neutral inquiry, no distress signals"
  }
]

Version these files in your repository alongside your prompts. When you change a prompt, run the evals and commit the results together. This gives you a history of how your changes affected quality.

What to Measure

Not all metrics matter equally for every application. Here is what to prioritize.

For any LLM application:

Pass rate on your test suite (absolute number, tracked over time)
Latency percentiles (p50, p95, p99) — LLM applications have long tail latency
Error rate (API errors, parsing failures, timeouts)

For classification and extraction tasks:

Accuracy on labeled test cases
Precision and recall per class (not just overall accuracy)
False positive and false negative rates on critical classes

For generation tasks (summaries, responses, drafts):

LLM-as-judge scores on your quality dimensions
Human agreement rate with your automated eval
Specific failure mode rates (hallucination, off-topic, wrong format)

For agentic applications:

Task completion rate (does the agent complete the task it was given?)
Step efficiency (how many LLM calls does it take on average?)
Error recovery rate (when a tool fails, does it recover correctly?)

Common Pitfalls

Building a test set from the same distribution as your training prompts. If you design your prompts with certain examples in mind, and then evaluate against those same examples, you will overestimate performance on real user inputs. Include examples from actual user interactions as soon as you have them.

Ignoring the failure analysis. Aggregate pass rates are a lagging indicator. The value of evals is in the failure cases: read every failure, understand why it failed, and fix the specific issue. A 90% pass rate with 10% mysterious failures is worse than an 85% pass rate with fully understood failure modes.

Running evals manually and rarely. Evals only work if they are run continuously and automatically. Integrate them into your CI/CD pipeline so they run on every prompt change. A pull request that includes a prompt change should show the eval results.

Only measuring what is easy to measure. Exact match accuracy is easy to measure. Helpfulness is not. If helpfulness matters for your application, invest in an eval approach that measures it even if it is harder to automate. Measuring the wrong thing with high precision is worse than measuring the right thing approximately.

Starting Simple

You do not need a perfect eval system before you deploy. You need a minimum viable eval system that is better than nothing.

Start here: collect 20 examples from your domain (the harder and more diverse, the better). Label the correct output for each. Write a script that runs your current prompt against all 20 examples and reports how many pass. Run it before and after every prompt change.

That is it. A 20-case test suite run manually is infinitely better than nothing. It catches obvious regressions and forces you to define what “correct” means for your task. Build from there.

The teams with the best LLM applications are not the ones with the cleverest models or the most sophisticated prompts. They are the ones who have built the tightest feedback loop between production behavior and product improvement. Evals are that feedback loop.

Build Your Eval Harness Today

A 20-case test suite is all you need to start. Run it against Claude’s API and OpenAI’s API side by side: at that scale the cost is cents, not dollars, and you get real comparative data. For teams applying LLMs to code generation, Cursor gives you the fastest path from prompt iteration to applied output in your actual codebase.

Disclosure: This article contains affiliate and referral links to Anthropic, OpenAI, and Cursor. We earn a commission when you sign up through these links at no cost to you.

Why Most Teams Skip Evals and Why That Is a Mistake#

Types of Evals#

Exact Match#

Fuzzy Match and Semantic Similarity#

LLM-as-Judge#

Human Evaluation#

Building an Eval Harness#

Minimal Eval Harness Structure#

Test Case Format#

What to Measure#

Common Pitfalls#

Starting Simple#

Build Your Eval Harness Today#

Get the AI tools that actually work

Related Articles

Best LLM APIs for Production 2026: A Buying Guide

Claude API vs OpenAI API 2026: The Developer's Honest Guide

Prompt Engineering Techniques That Actually Work in 2026