Skip to main content

Using Ollama for Simple Local Inference

Ollama abstracts away the complexity of model loading and inference, letting you run pre-quantized models with a simple HTTP API. Instead of managing PyTorch, VRAM, and tokenizers manually, you run ollama run mistral in one terminal and query it from Python in another.

This tutorial shows how to install Ollama, use the official Python client library to make API calls, stream responses, manage conversation context, and batch requests. Ollama is ideal for rapid prototyping or deploying models on constrained hardware.

Installing Ollama and the Python Client

Step 1: Install Ollama from https://ollama.ai/download for macOS, Linux, or Windows.

Step 2: Start the Ollama server in a terminal:

ollama serve
# Output: Listening on 127.0.0.1:11434

The server runs as a background process, listening on localhost:11434.

Step 3: Install the Python client:

pip install ollama

Verify the connection:

from ollama import Client

client = Client(host='http://localhost:11434')
response = client.list()
print(response) # Lists all downloaded models

Running Your First Inference

Pull a model and run a prompt:

from ollama import Client

client = Client(host='http://localhost:11434')

# Pull a model (one-time, ~4 GB download)
client.pull('mistral')

# Run inference
response = client.generate(
model='mistral',
prompt='Explain machine learning in one sentence.',
stream=False
)

print(response['response'])
# Output: Machine learning is a type of artificial intelligence ...

The stream=False parameter blocks until the full response is generated. For long responses, set stream=True to get real-time output.

Streaming Responses

Streaming is essential for interactive applications. Each chunk is yielded as it's generated:

from ollama import Client

client = Client(host='http://localhost:11434')

# Stream output in real-time
print("Streaming response:", end="")
for chunk in client.generate(
model='mistral',
prompt='Tell me a story about a robot.',
stream=True
):
print(chunk['response'], end="", flush=True)
print()

The flush=True parameter ensures each chunk displays immediately in the terminal. Without it, output buffers and feels slow. This pattern is ideal for chatbots or web UIs where users expect real-time feedback.

Managing Conversation Context

To build a chatbot, maintain a conversation history and pass the full context with each new message:

from ollama import Client

client = Client(host='http://localhost:11434')

# Conversation history
messages = [
{
'role': 'user',
'content': 'What is Python?'
}
]

# First turn
response = client.chat(model='mistral', messages=messages)
assistant_message = response['message']['content']
messages.append({
'role': 'assistant',
'content': assistant_message
})
print(f"Assistant: {assistant_message}")

# Second turn (model sees full history)
messages.append({
'role': 'user',
'content': 'Is it good for machine learning?'
})

response = client.chat(model='mistral', messages=messages)
assistant_message = response['message']['content']
messages.append({
'role': 'assistant',
'content': assistant_message
})
print(f"Assistant: {assistant_message}")

The messages list acts as the full conversation context. Each call includes all prior turns so the model has continuity. Context length is model-dependent (Mistral-7B: 8,192 tokens; Llama-2-7B: 4,096 tokens). Exceeding context size truncates earlier messages.

Controlling Generation Parameters

Fine-tune inference behavior with temperature, top-p, and other parameters:

from ollama import Client

client = Client(host='http://localhost:11434')

response = client.generate(
model='mistral',
prompt='Write a creative story about space.',
stream=False,
options={
'temperature': 0.9, # High randomness (0–1, default 0.7)
'top_p': 0.95, # Nucleus sampling
'top_k': 40, # Keep top 40 tokens by probability
'num_predict': 200, # Max output tokens
'num_ctx': 2048, # Context window size
}
)

print(response['response'])

Parameter meanings:

  • temperature — Randomness (0 = deterministic, 1 = very random)
  • top_p — Cumulative probability threshold for nucleus sampling
  • top_k — Keep only top K tokens by probability
  • num_predict — Maximum output tokens (similar to max_length)
  • num_ctx — Context size (use model's maximum for best results)

Working with Different Models

Ollama automatically quantizes models to 4-bit, 5-bit, or 8-bit depending on VRAM availability. List available models:

from ollama import Client

client = Client(host='http://localhost:11434')

# List all downloaded models
models = client.list()
for model in models['models']:
print(f"{model['name']}{model['size'] / 1e9:.1f} GB")

Pull and run different models:

# Lightweight models (runs on CPU)
client.pull('phi') # 2.7 GB, very fast
client.pull('neural-chat') # 4 GB, good quality

# Mid-range models (needs 6 GB+ VRAM)
client.pull('mistral') # 4.1 GB, balanced
client.pull('llama2') # 3.8 GB, good accuracy

# Larger models (needs 24+ GB VRAM)
client.pull('neural-chat:latest') # Explicitly pull latest version

Model sizes are compressed (quantized); uncompressed they're 2–3 larger in VRAM.

Error Handling and Debugging

Handle timeouts and connection errors gracefully:

from ollama import Client
import time

client = Client(host='http://localhost:11434')

# Retry logic for flaky connections
max_retries = 3
for attempt in range(max_retries):
try:
response = client.generate(
model='mistral',
prompt='Quick test.',
stream=False
)
print(response['response'])
break
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2)
else:
raise

# Check if model exists before running
models = client.list()
available = [m['name'].split(':')[0] for m in models['models']]
if 'mistral' in available:
# Safe to run
response = client.generate(model='mistral', prompt='Hi')
else:
print("Mistral not downloaded. Run: ollama pull mistral")

Ollama vs. Transformers: When to Use Each

FactorOllamaTransformers
Setup complexityVery simple (one install)Moderate (install PyTorch)
CustomizationLimited (pre-set quantization)Full (any precision, LoRA)
Speed (7B model)40 tokens/sec (GPU)50 tokens/sec (GPU)
VRAM (7B model)4–6 GB (quantized)14 GB (full precision)
Multi-GPUNot supportedFully supported
Fine-tuningNot supportedFully supported
Best forPrototyping, single-userProduction, customization

Use Ollama for quick experimentation and small deployments. Use Transformers when you need fine-tuning, multi-GPU, or exact control over inference.

Key Takeaways

  • Ollama simplifies local inference with pre-quantized models and a REST API.
  • The Python client connects to the Ollama server via Client(host='http://localhost:11434').
  • Stream responses with stream=True for real-time output in interactive applications.
  • Maintain conversation context by passing the full message history to each chat() call.
  • Adjust temperature, top_p, and num_predict to control generation randomness and length.

Frequently Asked Questions

What's the difference between generate() and chat()?

generate() takes a raw prompt string. chat() takes a list of messages with roles (user/assistant), providing context management. Use chat() for conversations, generate() for one-off completions.

Can I run multiple Ollama servers on different ports?

Yes. Start each server with OLLAMA_PORT=11435 ollama serve (or 11436, etc.) and connect with Client(host='http://localhost:11435').

How do I reduce Ollama's memory usage further?

Set num_ctx to a smaller value (e.g., 512 instead of 2048) to reduce context buffer size. Use smaller models like Phi-3 (2.7 GB). Run on CPU, which spills to disk if VRAM is exceeded (slow but possible).

Can I use Ollama with a GPU other than NVIDIA?

Ollama supports NVIDIA GPUs natively. AMD GPU support is experimental; Apple Metal is fully supported. For other GPUs, fall back to CPU inference.

What happens if I exceed the context window?

The model stops generating or truncates earlier messages. Monitor tokens_evaluated in the response to track usage.

Further Reading