Prompt Engineering for LLMs: Best Practices
Prompt engineering is the art and science of writing instructions to language models that produce reliable, high-quality outputs. A well-crafted prompt is often more effective than fine-tuning or model selection; research from 2025–2026 shows that expert prompts boost task accuracy by 20–40% compared to naive instructions. This guide teaches you to write prompts that elicit specific outputs, minimize hallucination, and scale consistently across thousands of API calls.
The Anatomy of an Effective Prompt
An effective prompt consists of four elements: role assignment, task definition, context, and output specification. The system message (role system) sets the assistant's persona and constraints; the user message contains the actual request. A comparison of weak vs. strong prompts illustrates why structure matters:
| Weak Prompt | Strong Prompt |
|---|---|
| "Explain Python." | "You are a Python educator teaching absolute beginners. Explain list comprehensions in exactly 100 words, using one concrete example. Output only the explanation, no preamble." |
| "Summarize this text." | "Summarize the following customer support transcript in 2–3 bullet points, extracting only factual issues and resolutions. Use active voice and omit opinions." |
Notice the strong prompts define the audience, task boundary, output format, and constraints. Vague prompts invite vague, verbose, or off-target responses.
System Prompts: Define the Assistant's Role
The system message is the most powerful lever for controlling model behavior. Use it to define who the assistant is, what constraints apply, and how to handle edge cases:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are a Python code reviewer for junior developers.
Your role is to identify bugs, suggest improvements, and explain the 'why' behind each suggestion.
Be encouraging. Keep explanations concise (max 3 sentences per suggestion).
If the code is correct, say so explicitly."""
},
{
"role": "user",
"content": """
def calculate_average(numbers):
total = 0
for n in numbers:
total += n
return total / len(numbers)
"""
}
]
)
print(response.choices[0].message.content)
This system prompt shapes every aspect of the response: tone (encouraging), depth (concise), and specificity (focus on bugs and improvements). Without it, the model might produce overly verbose analysis, sarcasm, or miss subtle issues.
Techniques: Few-Shot Prompting and Examples
Models learn quickly from examples in the prompt itself. Few-shot prompting—providing 2–5 labeled examples of the task—dramatically improves output consistency. Here is a sentiment analysis example:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a sentiment classifier. Classify user input as POSITIVE, NEGATIVE, or NEUTRAL."
},
{
"role": "user",
"content": """Example 1: "This coffee is amazing!" → POSITIVE
Example 2: "The service was slow and the food was cold." → NEGATIVE
Example 3: "The weather is cloudy." → NEUTRAL
Now classify this:
"I love the new feature but the UI is confusing."
"""
}
]
)
print(response.choices[0].message.content)
The model sees the pattern (input followed by arrow and label) and applies it to the new case. Few-shot prompting reduces hallucination and improves accuracy far more than increasing model size alone.
Controlling Tone and Output Format
Explicitly specify how the model should present information. Vague tone requests ("write something nice") produce variable results; precise ones ("explain as if to a five-year-old, using animal analogies") yield consistent outputs:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": """You are a technical writer for senior engineers.
Your style: direct, precise, no hand-holding.
Your output format: structured as numbered steps or bullet points.
Your target: explain concepts in 2–4 sentences per point."""
},
{
"role": "user",
"content": "How does async/await work in Python?"
}
]
)
print(response.choices[0].message.content)
Specifying "numbered steps", "bullet points", or "JSON format" controls structure. Specifying audience ("senior engineers" vs. "children") controls depth and jargon. The model respects these constraints remarkably well when stated clearly.
Chaining: Breaking Complex Tasks into Steps
For complex tasks, chain prompts: ask the model to think through the problem step by step, then synthesize. This "chain-of-thought" approach mimics human reasoning and improves accuracy in math, logic, and code generation by 15–30%:
from openai import OpenAI
client = OpenAI()
# Step 1: Analyze the problem
analysis = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": "A train departs at 10 AM traveling 60 mph. Another leaves at 11 AM at 80 mph. When do they meet? Think through this step by step."
}
]
)
print("Analysis:", analysis.choices[0].message.content)
# Step 2: Solve and verify
solution = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "user",
"content": f"Based on this analysis:\n{analysis.choices[0].message.content}\n\nProvide the final answer and verify it."
}
]
)
print("Solution:", solution.choices[0].message.content)
Breaking complex requests into smaller steps forces the model to "think" explicitly, catching errors and improving reasoning.
Avoiding Common Pitfalls
Negative instructions often backfire. Saying "Do not be verbose" is weaker than "Keep your answer to exactly 2 sentences." Instructing "Do not mention cost" may cause the model to avoid discussing pricing when relevant. Instead, use affirmative constraints: "Limit your response to 100 words" or "Focus on technical accuracy; skip marketing language."
Ambiguous pronouns ("It works well") confuse the model. Replace with explicit nouns: "The async pattern works well." Supply context for acronyms on first mention: "REST (Representational State Transfer) is a standard for..." Long, convoluted instructions are also error-prone; break them into bullet points. Test your prompts with multiple variations and measure success by concrete metrics (accuracy, format compliance, tone match) rather than subjective feel.
Prompt Testing and Iteration
Effective prompt engineering requires measurement. Create a small test set of expected inputs and outputs, then rate responses on accuracy, tone, and format:
from openai import OpenAI
client = OpenAI()
test_cases = [
{
"input": "I really enjoyed the book.",
"expected_sentiment": "POSITIVE"
},
{
"input": "The movie was boring.",
"expected_sentiment": "NEGATIVE"
}
]
system_prompt = "Classify sentiment as POSITIVE, NEGATIVE, or NEUTRAL. Output only the label."
correct = 0
for case in test_cases:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": case["input"]}
]
)
actual = response.choices[0].message.content.strip()
if actual == case["expected_sentiment"]:
correct += 1
accuracy = correct / len(test_cases) * 100
print(f"Accuracy: {accuracy}%")
Run 5–10 test cases through candidate prompts, measure accuracy, and iterate. Small prompt tweaks can double your success rate.
Key Takeaways
- Use system prompts to define the assistant's role, constraints, and output format.
- Provide 2–5 labeled examples (few-shot prompting) to establish patterns the model will follow.
- Be affirmative and specific: "Keep to 2 sentences" beats "Do not be verbose."
- Use chain-of-thought for complex reasoning: ask the model to think step by step, then synthesize.
- Test prompts on a small dataset and measure accuracy, tone, and format compliance.
- Replace vague instructions with precise constraints: audience, tone, word count, output structure.
Frequently Asked Questions
How many examples should I include in few-shot prompts?
Start with 2–3 examples; most tasks see diminishing returns beyond 5–7. More examples consume more tokens, increasing cost and latency. For very clear tasks, zero-shot (no examples) may suffice; for ambiguous ones, 5 examples ensure consistency.
Does prompt engineering work with all LLM models?
The techniques described here work with GPT-4, GPT-4o, Claude, Gemini, and other frontier models. Smaller open-source models (7B parameters) may be more rigid; larger models (70B+) are more flexible. Test your prompts on the specific model you deploy.
Can I use prompts to prevent hallucination entirely?
No prompt eliminates hallucination completely, but several techniques reduce it: asking the model to cite sources, specifying "If you are unsure, say so," limiting domain scope, and validating outputs programmatically (e.g., checking code compiles). Combining multiple strategies (prompt + output validation) is more reliable than any single approach.
What is the difference between a system prompt and a user message?
The system prompt is context and instruction for the entire conversation; it does not consume the model's reasoning capacity in the same way as user messages. User messages are the actual requests. Use system prompts for standing instructions (persona, rules); use user messages for task-specific input. You may have one system message and many user/assistant message pairs.