Skip to main content

Chat Messages API: Complete Tutorial

The OpenAI Chat Messages API is the foundation of every conversational AI application. Unlike simple question-answer patterns, the Chat API maintains conversation context by passing the entire message history with each request. This allows the model to understand previous exchanges, build on prior statements, and maintain a coherent personality across turns. Understanding how to structure messages, manage context window limits, and architect multi-turn workflows is essential for building production chatbots and conversational agents.

The Three Message Roles

Every message in a chat conversation has a role that defines who is speaking: user (human input), assistant (model response), or system (instruction for the model). The order and roles shape how the model interprets context:

from openai import OpenAI

client = OpenAI()

messages = [
{
"role": "system",
"content": "You are a helpful coding assistant. Explain concepts clearly and provide working examples."
},
{
"role": "user",
"content": "What is a generator in Python?"
},
{
"role": "assistant",
"content": "A generator is a function that uses 'yield' instead of 'return'. It returns an iterator that produces values lazily, one at a time, conserving memory."
},
{
"role": "user",
"content": "Can you show me an example?"
}
]

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)

print(response.choices[0].message.content)

The system message sets the assistant's behavior globally. User and assistant messages form a conversational transcript that the model reads in sequence. This transcript is the entire "memory" of the conversation; the model has no hidden state between API calls.

Building and Maintaining Conversation History

In a real application, you accumulate messages as the user interacts with your bot. A simple chatbot loop collects user input, appends it to the history, sends all messages to the API, and appends the response:

from openai import OpenAI

client = OpenAI()

# Initialize conversation with system instruction
messages = [
{
"role": "system",
"content": "You are a patient math tutor. Explain step by step and ask clarifying questions if needed."
}
]

# Simulate a multi-turn conversation
user_inputs = [
"What is 15 percent of 200?",
"How would you calculate that?",
"What if I wanted 25 percent instead?"
]

for user_input in user_inputs:
# Add user message
messages.append({"role": "user", "content": user_input})

# Get response from API with full history
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)

assistant_message = response.choices[0].message.content

# Add assistant response to history
messages.append({"role": "assistant", "content": assistant_message})

print(f"User: {user_input}")
print(f"Assistant: {assistant_message}\n")

Each API call includes the entire history. The model sees the progression of questions and uses prior context to inform its response. On the third turn, "I wanted 25 percent instead" is interpreted relative to the previous question about 15 percent, because the full conversation history is visible to the model.

Managing Context Window Limits

Every model has a maximum context length (tokens it can process). The GPT-4o model supports 128,000 tokens as of 2026, but every message in your history consumes tokens. For long conversations, you must trim old messages to stay within limits and manage costs.

A simple strategy is to keep a rolling window of the most recent N messages:

from openai import OpenAI

client = OpenAI()

MAX_HISTORY = 10 # Keep only the 10 most recent messages (excluding system)

messages = [
{
"role": "system",
"content": "You are a helpful assistant. Be concise."
}
]

# Simulate a long conversation
for turn in range(20):
user_input = f"Question {turn + 1}: Tell me a fact."
messages.append({"role": "user", "content": user_input})

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)

assistant_message = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_message})

# Keep only the system message and the most recent MAX_HISTORY messages
if len(messages) > MAX_HISTORY + 1: # +1 for the system message
messages = [messages[0]] + messages[-(MAX_HISTORY):]

print(f"Turn {turn + 1}: {len(messages)} messages in history")

After turn 10, the oldest user/assistant pairs are discarded, keeping only the recent exchanges. This prevents runaway context and keeps API costs predictable. The tradeoff is that very old context is lost; for applications requiring long-term memory, you would store conversation history in a database and retrieve relevant past messages explicitly (a technique called "retrieval-augmented generation," covered in Article 10).

Token Counting: Know Your Costs

Messages are charged by token, not by request count. Understanding token usage prevents surprise bills. Python's tiktoken library (or the OpenAI client) can estimate tokens before sending a request:

import tiktoken
from openai import OpenAI

client = OpenAI()

# Encoding for GPT-4o model
encoding = tiktoken.encoding_for_model("gpt-4o-mini")

messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain quantum computing in 50 words."}
]

# Estimate tokens
total_tokens = 0
for message in messages:
tokens = len(encoding.encode(message["content"]))
total_tokens += tokens
print(f"{message['role']}: {tokens} tokens")

print(f"Total input tokens: {total_tokens}")

# Send request and check actual usage
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)

print(f"Actual input tokens: {response.usage.prompt_tokens}")
print(f"Output tokens: {response.usage.completion_tokens}")
print(f"Total tokens: {response.usage.total_tokens}")

Always check response.usage to monitor actual token consumption. For high-volume applications, token counting prevents overspending.

Special Cases: Alternating Roles and Branching Conversations

In some workflows, the assistant sends multiple consecutive messages (rare but valid), or you want to explore branches (e.g., "if the user had asked X instead"). The message structure handles both:

from openai import OpenAI

client = OpenAI()

# Example: Assistant sends two consecutive messages (e.g., summary then follow-up)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarize the French Revolution in two sentences."},
{"role": "assistant", "content": "The French Revolution (1789–1799) was a period of radical social and political upheaval in France, marked by the overthrow of monarchy and feudalism. It introduced democracy, nationalism, and individual rights, reshaping European society."},
{"role": "assistant", "content": "Key outcomes: the Declaration of the Rights of Man, abolition of privileges, and the Reign of Terror."},
{"role": "user", "content": "What year did it start?"}
]

response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)

print(response.choices[0].message.content)

The model gracefully handles consecutive assistant messages, treating them as elaborations. This pattern is useful for multi-step reasoning.

Key Takeaways

  • Every message has a role (system, user, assistant) that defines who is speaking and how the model processes it.
  • Build conversation history by appending user input, calling the API with all prior messages, and appending the response.
  • The model has no hidden state; the entire message list is its only context.
  • Keep history size bounded using a rolling window to prevent token overflow and manage costs.
  • Use tiktoken to estimate tokens before requests and check response.usage to monitor actual consumption.
  • The system message applies globally; it is not reset between turns.

Frequently Asked Questions

Can I remove messages from the middle of the conversation?

Yes, you can delete or edit any message before sending the next request, but be cautious: the model sees the conversation as a continuous transcript. Removing the middle of a dialogue might create logical gaps. Deleting old messages from a rolling window is safe; editing a prior user statement mid-conversation is usually confusing.

What happens if I exceed the context window limit?

The OpenAI API returns a length_error and rejects the request. The solution is to truncate your message history or use a more capable model (e.g., upgrade from gpt-4o-mini to gpt-4o, which has the same context but better compression). Alternatively, use retrieval-augmented generation to fetch only relevant past messages instead of storing all of them.

Should I include the system message in every request?

Yes, always include the system message. It is not "remembered" between requests; you must resend it each time. It consumes tokens, but it is essential for consistency.

How do I handle multiple concurrent conversations?

Store each conversation's message list separately. In a web application, you might use a user ID as a key: conversations[user_id] = [messages...]. Each user gets their own list; requests are independent.

Can I use assistant messages without calling the API?

Yes. You can manually craft assistant messages to seed the conversation with examples, corrections, or hypothetical responses. This is useful for few-shot prompting or simulation.

Further Reading