Most prompt engineering advice is written for demos. It optimises for the impressive output screenshot, not the system that needs to work reliably 50,000 times a day with real user inputs that were never anticipated in the notebook.

This is what we have learned shipping LLM features to production across 13 client products -- the patterns that hold up and the ones that do not.

Pattern 1: Structured Output Over Freeform Text

If your application needs to parse the LLM's response, make the model return structured data, not prose you then try to extract from. Freeform output is non-deterministic in structure. JSON output with a schema is not.

Use the response_format parameter (OpenAI) or structured output features (Claude) to enforce a schema at the API level. Define the schema with Pydantic or Zod so your application code receives typed data, not strings.

from pydantic import BaseModel
from openai import OpenAI

class ClassificationResult(BaseModel):
    category: str
    confidence: float
    reasoning: str

client = OpenAI()
completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-11-20",
    messages=[{"role": "user", "content": ticket_text}],
    response_format=ClassificationResult,
)
result = completion.choices[0].message.parsed

This pattern eliminates an entire category of production bugs: response parsing failures that crash your pipeline at 2am.

Pattern 2: Chain-of-Thought for Complex Reasoning

For tasks that require multi-step reasoning -- contract analysis, medical coding, complex classification -- asking the model to reason before answering consistently improves accuracy. The mechanism is simple: instruct the model to think through the problem step-by-step before producing the final output.

system_prompt = """
You are a contract analysis assistant.

When analysing a contract clause, follow this process:
1. Identify what obligation or right the clause establishes
2. Identify which party bears the obligation or holds the right
3. Identify any conditions or exceptions
4. Assess the risk level for our client (low / medium / high)
5. Produce your final structured assessment

Always show your reasoning before your final assessment.
"""

The reasoning steps are not just for accuracy -- they are audit trail. In regulated industries, being able to show why a model made a decision is as important as the decision itself.

Pattern 3: Self-Critique Loops

For high-stakes outputs (medical summaries, legal drafts, customer-facing content), a single LLM pass is insufficient. A self-critique loop runs the output through a second prompt that reviews it against a checklist before it reaches the user.

def generate_with_critique(draft_prompt, critique_criteria):
    # First pass: generate
    draft = llm.generate(draft_prompt)

    # Second pass: critique
    critique_prompt = f"""
Review the following output against these criteria:
{critique_criteria}

Output to review:
{draft}

Return: {{ "passes": true/false, "issues": [...], "revised_output": "..." }}
"""
    result = llm.generate_structured(critique_prompt)

    if result.passes:
        return draft
    return result.revised_output

This adds latency -- typically 1-3 seconds -- but for outputs where a single error damages user trust, the trade-off is worth it.

Pattern 4: Explicit Failure Modes in System Prompts

Every system prompt should specify what to do when the model cannot or should not answer. Leaving this implicit means the model will improvise -- and LLM improvisation is not what you want in a production system.

If the user's request falls outside your defined scope:
- Do NOT attempt to answer
- Say exactly: "This is outside what I can help with here."
- Offer the escalation path: "For this, please contact [email protected]"

If you are uncertain about a fact:
- Say so explicitly: "I'm not certain about this -- please verify with [source]."
- Do NOT guess or extrapolate

Pattern 5: Versioned Prompts as Code

Prompts are code. They need version control, change review, and rollback capability. We store all production system prompts in the application repository alongside the code that calls them. Prompt changes go through the same pull request process as code changes.

This sounds obvious. Most teams do not do it. The result is prompts that drift over time, regressions that cannot be attributed to a specific change, and no way to roll back a prompt that started producing bad outputs after a model update.

The Pattern Most Teams Miss

Evaluation. You cannot improve what you do not measure. Build a golden dataset of 50-100 representative inputs and expected outputs before you ship. Run every prompt change against this dataset. This is the difference between "we think this is better" and "this is 12% more accurate than the previous version."

Prompt engineering without evaluation is intuition. Prompt engineering with evaluation is engineering.