- Chain-of-thought helps most for multi-step reasoning tasks; specifying the exact reasoning dimensions (not just 'think step by step') produces more consistent results
- Always validate LLM structured outputs programmatically with Pydantic; even reliable models produce invalid JSON in edge cases that will break production systems
- Three diverse few-shot examples covering different cases outperform ten similar examples — diversity teaches the model the general pattern, not one specific case
- Agent system prompts need explicit uncertainty handling instructions; without them, models hallucinate answers rather than admitting they cannot complete a task
- Sycophancy is a real production risk: counter it with explicit instructions to identify weaknesses and by asking the model to steelman opposing views before concluding
Everyone knows to tell the model to “think step by step.” That was 2022. In 2026, the basics are table stakes, and the developers extracting the most value from LLMs are using techniques that go well beyond the starter guides. This article covers what actually works: the patterns that experienced LLM engineers use in production, the failure modes they have learned to avoid, and the reasoning behind why these techniques work rather than just a list of tips to copy.
The framing matters: prompt engineering is not magic. It is communication design. The better you understand what the model is doing when it processes your prompt, the better your prompts get.
Chain-of-Thought for Complex Reasoning Tasks
Chain-of-thought (CoT) prompting is well-known, but most developers underuse it. The pattern is powerful but its power is specific: it helps when the correct answer requires intermediate steps that the model would otherwise try to skip.
When CoT Actually Helps
CoT is most valuable for:
- Multi-step mathematical or logical reasoning
- Tasks where the answer depends on intermediate conclusions
- Classification tasks where the correct category requires evaluating multiple criteria
- Code debugging where identifying the root cause requires tracing execution
CoT does not help much for:
- Factual retrieval (adding CoT to “what year was Paris founded” does nothing useful)
- Simple classification with clear patterns
- Tasks where the model already reliably gets the answer right
Zero-Shot CoT vs. Instructed Reasoning
“Think step by step” (zero-shot CoT) works. But for production use, instructed reasoning often works better because you control the structure:
Analyze this customer support ticket and classify its urgency.
Reasoning process:
1. First identify the specific problem the customer is describing
2. Assess whether there is data loss or system unavailability involved
3. Check whether the customer is describing a workaround or is completely blocked
4. Determine the business impact based on the account tier mentioned
After completing this analysis, provide your urgency classification: Critical, High, Medium, or Low.
The difference: you are not just asking the model to reason, you are specifying the dimensions of reasoning that matter. This produces more consistent results because you have defined what good reasoning looks like for your task.
Scratchpad Pattern
For complex tasks, give the model explicit permission to use a scratchpad before producing its final answer:
You may think through this problem in a <scratchpad> section before giving your final answer.
Your scratchpad is for working through the problem; only the content after </scratchpad> will be shown to the user.
This is particularly useful when you want the model to reason thoroughly but produce a clean, concise final output.
Structured Output Extraction That Does Not Fail
Extracting structured data from unstructured text is one of the most common LLM use cases and one of the most common sources of production failures. Here is how to make it reliable.
The Schema Definition Pattern
Define the output schema explicitly in the prompt, with examples of valid and invalid values:
Extract the following fields from the job posting below. Return valid JSON only, with no additional text.
Schema:
{
"title": string, // Job title, e.g., "Senior Software Engineer"
"company": string, // Company name
"location": string | null, // City, State or "Remote" or null if not specified
"salary_min": number | null, // Minimum salary in USD, null if not specified
"salary_max": number | null, // Maximum salary in USD, null if not specified
"required_years": number | null, // Minimum years of experience, null if not specified
"remote_friendly": boolean // true if remote work is mentioned as an option
}
Job posting:
{job_posting_text}
The inline comments serve a purpose: they define the type, give an example, and specify the null condition. Models follow this pattern reliably.
Validate Outputs Programmatically
Never trust that LLM output will be valid JSON, even if the model usually produces it. Always wrap parsing in a try-except and handle failures explicitly:
import json
from pydantic import BaseModel, ValidationError
class JobPosting(BaseModel):
title: str
company: str
location: str | None
salary_min: float | None
salary_max: float | None
required_years: int | None
remote_friendly: bool
def extract_job_posting(text: str) -> JobPosting | None:
response = llm.invoke(extraction_prompt.format(job_posting_text=text))
try:
# Strip any markdown code fences if present
content = response.content.strip()
if content.startswith("```"):
content = content.split("```")[1]
if content.startswith("json"):
content = content[4:]
data = json.loads(content)
return JobPosting(**data)
except (json.JSONDecodeError, ValidationError) as e:
# Log the error and the raw response for debugging
logger.error(f"Failed to parse job posting: {e}, raw: {response.content}")
return None
OpenAI’s structured outputs with Pydantic make this cleaner on that API. For other APIs, this validation pattern is essential.
Two-Pass Extraction for Difficult Documents
For documents where extraction reliability is critical, use a two-pass approach:
- First pass: extract the raw data
- Second pass: validate and correct the extraction using a verification prompt
I extracted the following data from a job posting.
Please verify this extraction is correct and fix any obvious errors.
Original text: {original_text}
Extracted data: {extracted_json}
Return the corrected JSON, or the same JSON unchanged if it is correct.
This adds latency and cost but substantially improves accuracy on messy real-world documents.
Few-Shot Examples for Consistent Formatting
Few-shot examples are the most underused tool in the prompt engineering toolkit. They are particularly powerful for establishing consistent formatting and tone across outputs.
Why Few-Shot Works
The model is looking for patterns. When you provide examples that follow a specific format, you are telling the model: “this is the format I want, replicate it.” This is often more effective than describing the format in words, especially for nuanced formatting requirements.
Designing Good Few-Shot Examples
A few principles that matter:
Diversity over quantity. Three diverse examples covering different cases are more valuable than ten similar examples. The model learns the general pattern, not just how to handle one specific input type.
Real examples beat synthetic ones. Use actual inputs from your domain. The more realistic the examples, the better the pattern transfer.
Show your failure modes. If the model tends to add preamble text you do not want, include an example where the input might tempt it to add preamble, and show the correct output without it.
Consistent formatting in examples. The model will replicate inconsistencies in your examples. If your three examples have three different output formats, the model will pick one arbitrarily.
Example: Consistent Changelog Generation
Generate a changelog entry for this code diff.
Example 1:
Diff: Added null check before accessing user.email
Changelog: fix: prevent crash when user email is null
Example 2:
Diff: Updated payment processing to use Stripe API v3, removed legacy PayPal integration
Changelog: feat: migrate payment processing to Stripe v3, remove PayPal integration
Example 3:
Diff: Changed button color from #333 to #6c63ff
Changelog: style: update primary button color to brand purple
Now generate a changelog entry for this diff:
{diff_content}
The model learns: conventional commit format, concise phrasing, verb tense, level of detail. None of this was described explicitly.
System Prompt Design for Agents
Agent system prompts are fundamentally different from regular completion prompts. An agent will use the system prompt as its operating context for many turns, many tool calls, and potentially unexpected situations. Here is what a production agent system prompt needs.
The Core Components
A well-structured agent system prompt contains:
Identity and role: What the agent is, not just what it does. This shapes how it handles ambiguous situations.
Capabilities inventory: What tools and capabilities the agent has. Be explicit. If it does not appear in this list, do not assume the model will remember it.
Behavioral constraints: What the agent should not do, phrased as clear rules. Models follow explicit constraints better than they infer constraints from general guidance.
Output format defaults: How the agent should format responses when there is no specific instruction.
Uncertainty handling: Explicit instruction on what to do when the agent does not know something or cannot complete a task. Without this, models often hallucinate a response rather than saying they cannot help.
Example Structure
You are a customer support agent for Acme Corp. You help customers with billing questions, account management, and product usage.
## Your capabilities
- Look up account information using the lookup_account tool
- Create and update support tickets using the ticket_management tool
- Search the product documentation using the docs_search tool
- Escalate issues to human agents using the escalate tool
## Rules
- Never share one customer's account information with another customer
- Do not make commitments about refunds or credits without verifying eligibility first
- If you cannot resolve an issue with your available tools, escalate to a human agent
- Do not discuss competitor products
## When you do not know something
Say "I don't have information about that" and offer to search the documentation or escalate to a human agent. Do not guess or fabricate information.
## Response format
Be concise. Use the customer's name when you have it. Use bullet points for lists of steps or options.
What Not to Put in a System Prompt
Avoid putting very long examples in the system prompt for agents. Every token in the system prompt counts against your context window and is processed on every call. Keep examples in a few-shot block attached to specific tool calls, not in the global system prompt.
Avoid over-specifying behavior for edge cases you have not thought through. A system prompt with 200 rules is harder for the model to follow than one with 20 clear rules. Start minimal and add constraints as you discover the specific failure modes you need to address.
Avoiding Common Failure Modes
The Instruction-Following Decay Problem
In long conversations and complex workflows, models drift from their original instructions. They start taking shortcuts, softening constraints, or forgetting specified formats. This is especially pronounced after many turns.
The fix: re-anchor the model periodically by including key constraints in the user turn as well as the system prompt. For critical constraints, add them at the bottom of the system prompt (“Always remember: [rule]”) and consider a periodic reminder injection in long conversations.
The Sycophancy Trap
Models are trained to be helpful, which can manifest as telling users what they want to hear rather than what is accurate. This is particularly dangerous for evaluation tasks (having the model grade work), research tasks (having it critique an argument you authored), or decision-support tasks.
Counter-sycophancy with explicit instructions:
Evaluate this business plan critically. I want you to identify real weaknesses and risks, not just summarize the strengths.
Be honest even if the feedback is discouraging. I will make a better decision with accurate information than with false encouragement.
Also: ask the model to “steelman the opposing view” before concluding. This forces engagement with counterarguments.
The Verbose-When-Concise-Is-Needed Problem
Models default to comprehensive answers. For production use cases where concise output matters (summaries, labels, classifications), be specific about length:
Not: “Summarize this article briefly.” Better: “Summarize this article in exactly two sentences.” Even better: “Summarize this article in exactly two sentences. Do not include introductory phrases like ‘This article’ or ‘The piece.’”
The Ambiguous Instruction Problem
When a prompt can be interpreted multiple ways, the model will pick one interpretation, often not the one you intended. If you are seeing inconsistent outputs on similar inputs, the problem is usually ambiguity in the prompt.
The fix: identify the ambiguity (what question could a reasonable person read differently?) and resolve it explicitly. Add a clarifying sentence or an example that demonstrates the correct interpretation.
Putting It Together
The most effective prompt engineering practice is not about memorizing techniques. It is about building a feedback loop: write a prompt, test it against 20-30 real examples from your domain, identify failure cases, diagnose why they fail, and fix the specific issue.
Most prompt failures fall into a small number of categories: ambiguous instructions, missing constraints for edge cases, no examples for a non-obvious format requirement, or the task requiring reasoning that needs to be made explicit. When you identify which category your failure falls into, the fix becomes obvious.
The techniques in this article are the tools. Your domain knowledge and testing discipline are what make them work.