Everyone knows to tell the model to “think step by step.” That was 2022. In 2026, the basics are table stakes, and the developers extracting the most value from LLMs are using techniques that go well beyond the starter guides. This article covers what actually works: the patterns that experienced LLM engineers use in production, the failure modes they have learned to avoid, and the reasoning behind why these techniques work rather than just a list of tips to copy.

The framing matters: prompt engineering is not magic. It is communication design. The better you understand what the model is doing when it processes your prompt, the better your prompts get.


Chain-of-Thought for Complex Reasoning Tasks

Chain-of-thought (CoT) prompting is well-known, but most developers underuse it. The pattern is powerful but its power is specific: it helps when the correct answer requires intermediate steps that the model would otherwise try to skip.

When CoT Actually Helps

CoT is most valuable for:

  • Multi-step mathematical or logical reasoning
  • Tasks where the answer depends on intermediate conclusions
  • Classification tasks where the correct category requires evaluating multiple criteria
  • Code debugging where identifying the root cause requires tracing execution

CoT does not help much for:

  • Factual retrieval (adding CoT to “what year was Paris founded” does nothing useful)
  • Simple classification with clear patterns
  • Tasks where the model already reliably gets the answer right

Zero-Shot CoT vs. Instructed Reasoning

“Think step by step” (zero-shot CoT) works. But for production use, instructed reasoning often works better because you control the structure:

Analyze this customer support ticket and classify its urgency.

Reasoning process:
1. First identify the specific problem the customer is describing
2. Assess whether there is data loss or system unavailability involved
3. Check whether the customer is describing a workaround or is completely blocked
4. Determine the business impact based on the account tier mentioned

After completing this analysis, provide your urgency classification: Critical, High, Medium, or Low.

The difference: you are not just asking the model to reason, you are specifying the dimensions of reasoning that matter. This produces more consistent results because you have defined what good reasoning looks like for your task.

Scratchpad Pattern

For complex tasks, give the model explicit permission to use a scratchpad before producing its final answer:

You may think through this problem in a <scratchpad> section before giving your final answer.
Your scratchpad is for working through the problem; only the content after </scratchpad> will be shown to the user.

This is particularly useful when you want the model to reason thoroughly but produce a clean, concise final output.


Structured Output Extraction That Does Not Fail

Extracting structured data from unstructured text is one of the most common LLM use cases and one of the most common sources of production failures. Here is how to make it reliable.

The Schema Definition Pattern

Define the output schema explicitly in the prompt, with examples of valid and invalid values:

Extract the following fields from the job posting below. Return valid JSON only, with no additional text.

Schema:
{
  "title": string,           // Job title, e.g., "Senior Software Engineer"
  "company": string,         // Company name
  "location": string | null, // City, State or "Remote" or null if not specified
  "salary_min": number | null, // Minimum salary in USD, null if not specified
  "salary_max": number | null, // Maximum salary in USD, null if not specified
  "required_years": number | null, // Minimum years of experience, null if not specified
  "remote_friendly": boolean  // true if remote work is mentioned as an option
}

Job posting:
{job_posting_text}

The inline comments serve a purpose: they define the type, give an example, and specify the null condition. Models follow this pattern reliably.

Validate Outputs Programmatically

Never trust that LLM output will be valid JSON, even if the model usually produces it. Always wrap parsing in a try-except and handle failures explicitly:

import json
from pydantic import BaseModel, ValidationError

class JobPosting(BaseModel):
    title: str
    company: str
    location: str | None
    salary_min: float | None
    salary_max: float | None
    required_years: int | None
    remote_friendly: bool

def extract_job_posting(text: str) -> JobPosting | None:
    response = llm.invoke(extraction_prompt.format(job_posting_text=text))

    try:
        # Strip any markdown code fences if present
        content = response.content.strip()
        if content.startswith("```"):
            content = content.split("```")[1]
            if content.startswith("json"):
                content = content[4:]

        data = json.loads(content)
        return JobPosting(**data)
    except (json.JSONDecodeError, ValidationError) as e:
        # Log the error and the raw response for debugging
        logger.error(f"Failed to parse job posting: {e}, raw: {response.content}")
        return None

OpenAI’s structured outputs with Pydantic make this cleaner on that API. For other APIs, this validation pattern is essential.

Two-Pass Extraction for Difficult Documents

For documents where extraction reliability is critical, use a two-pass approach:

  1. First pass: extract the raw data
  2. Second pass: validate and correct the extraction using a verification prompt
I extracted the following data from a job posting.
Please verify this extraction is correct and fix any obvious errors.

Original text: {original_text}
Extracted data: {extracted_json}

Return the corrected JSON, or the same JSON unchanged if it is correct.

This adds latency and cost but substantially improves accuracy on messy real-world documents.


Few-Shot Examples for Consistent Formatting

Few-shot examples are the most underused tool in the prompt engineering toolkit. They are particularly powerful for establishing consistent formatting and tone across outputs.

Why Few-Shot Works

The model is looking for patterns. When you provide examples that follow a specific format, you are telling the model: “this is the format I want, replicate it.” This is often more effective than describing the format in words, especially for nuanced formatting requirements.

Designing Good Few-Shot Examples

A few principles that matter:

Diversity over quantity. Three diverse examples covering different cases are more valuable than ten similar examples. The model learns the general pattern, not just how to handle one specific input type.

Real examples beat synthetic ones. Use actual inputs from your domain. The more realistic the examples, the better the pattern transfer.

Show your failure modes. If the model tends to add preamble text you do not want, include an example where the input might tempt it to add preamble, and show the correct output without it.

Consistent formatting in examples. The model will replicate inconsistencies in your examples. If your three examples have three different output formats, the model will pick one arbitrarily.

Example: Consistent Changelog Generation

Generate a changelog entry for this code diff.

Example 1:
Diff: Added null check before accessing user.email
Changelog: fix: prevent crash when user email is null

Example 2:
Diff: Updated payment processing to use Stripe API v3, removed legacy PayPal integration
Changelog: feat: migrate payment processing to Stripe v3, remove PayPal integration

Example 3:
Diff: Changed button color from #333 to #6c63ff
Changelog: style: update primary button color to brand purple

Now generate a changelog entry for this diff:
{diff_content}

The model learns: conventional commit format, concise phrasing, verb tense, level of detail. None of this was described explicitly.


System Prompt Design for Agents

Agent system prompts are fundamentally different from regular completion prompts. An agent will use the system prompt as its operating context for many turns, many tool calls, and potentially unexpected situations. Here is what a production agent system prompt needs.

The Core Components

A well-structured agent system prompt contains:

Identity and role: What the agent is, not just what it does. This shapes how it handles ambiguous situations.

Capabilities inventory: What tools and capabilities the agent has. Be explicit. If it does not appear in this list, do not assume the model will remember it.

Behavioral constraints: What the agent should not do, phrased as clear rules. Models follow explicit constraints better than they infer constraints from general guidance.

Output format defaults: How the agent should format responses when there is no specific instruction.

Uncertainty handling: Explicit instruction on what to do when the agent does not know something or cannot complete a task. Without this, models often hallucinate a response rather than saying they cannot help.

Example Structure

You are a customer support agent for Acme Corp. You help customers with billing questions, account management, and product usage.

## Your capabilities
- Look up account information using the lookup_account tool
- Create and update support tickets using the ticket_management tool
- Search the product documentation using the docs_search tool
- Escalate issues to human agents using the escalate tool

## Rules
- Never share one customer's account information with another customer
- Do not make commitments about refunds or credits without verifying eligibility first
- If you cannot resolve an issue with your available tools, escalate to a human agent
- Do not discuss competitor products

## When you do not know something
Say "I don't have information about that" and offer to search the documentation or escalate to a human agent. Do not guess or fabricate information.

## Response format
Be concise. Use the customer's name when you have it. Use bullet points for lists of steps or options.

What Not to Put in a System Prompt

Avoid putting very long examples in the system prompt for agents. Every token in the system prompt counts against your context window and is processed on every call. Keep examples in a few-shot block attached to specific tool calls, not in the global system prompt.

Avoid over-specifying behavior for edge cases you have not thought through. A system prompt with 200 rules is harder for the model to follow than one with 20 clear rules. Start minimal and add constraints as you discover the specific failure modes you need to address.


Avoiding Common Failure Modes

The Instruction-Following Decay Problem

In long conversations and complex workflows, models drift from their original instructions. They start taking shortcuts, softening constraints, or forgetting specified formats. This is especially pronounced after many turns.

The fix: re-anchor the model periodically by including key constraints in the user turn as well as the system prompt. For critical constraints, add them at the bottom of the system prompt (“Always remember: [rule]”) and consider a periodic reminder injection in long conversations.

The Sycophancy Trap

Models are trained to be helpful, which can manifest as telling users what they want to hear rather than what is accurate. This is particularly dangerous for evaluation tasks (having the model grade work), research tasks (having it critique an argument you authored), or decision-support tasks.

Counter-sycophancy with explicit instructions:

Evaluate this business plan critically. I want you to identify real weaknesses and risks, not just summarize the strengths.
Be honest even if the feedback is discouraging. I will make a better decision with accurate information than with false encouragement.

Also: ask the model to “steelman the opposing view” before concluding. This forces engagement with counterarguments.

The Verbose-When-Concise-Is-Needed Problem

Models default to comprehensive answers. For production use cases where concise output matters (summaries, labels, classifications), be specific about length:

Not: “Summarize this article briefly.” Better: “Summarize this article in exactly two sentences.” Even better: “Summarize this article in exactly two sentences. Do not include introductory phrases like ‘This article’ or ‘The piece.’”

The Ambiguous Instruction Problem

When a prompt can be interpreted multiple ways, the model will pick one interpretation, often not the one you intended. If you are seeing inconsistent outputs on similar inputs, the problem is usually ambiguity in the prompt.

The fix: identify the ambiguity (what question could a reasonable person read differently?) and resolve it explicitly. Add a clarifying sentence or an example that demonstrates the correct interpretation.


Putting It Together

The most effective prompt engineering practice is not about memorizing techniques. It is about building a feedback loop: write a prompt, test it against 20-30 real examples from your domain, identify failure cases, diagnose why they fail, and fix the specific issue.

Most prompt failures fall into a small number of categories: ambiguous instructions, missing constraints for edge cases, no examples for a non-obvious format requirement, or the task requiring reasoning that needs to be made explicit. When you identify which category your failure falls into, the fix becomes obvious.

The techniques in this article are the tools. Your domain knowledge and testing discipline are what make them work.