Using Ollama for Simple Local Inference
Ollama abstracts away the complexity of model loading and inference, letting you run pre-quantized models with a simple HTTP API. Instead of managing PyTorch, VRAM, and tokenizers manually, you run ollama run mistral in one terminal and query it from Python in another.
This tutorial shows how to install Ollama, use the official Python client library to make API calls, stream responses, manage conversation context, and batch requests. Ollama is ideal for rapid prototyping or deploying models on constrained hardware.
Installing Ollama and the Python Client
Step 1: Install Ollama from https://ollama.ai/download for macOS, Linux, or Windows.
Step 2: Start the Ollama server in a terminal:
ollama serve
# Output: Listening on 127.0.0.1:11434
The server runs as a background process, listening on localhost:11434.
Step 3: Install the Python client:
pip install ollama
Verify the connection:
from ollama import Client
client = Client(host='http://localhost:11434')
response = client.list()
print(response) # Lists all downloaded models
Running Your First Inference
Pull a model and run a prompt:
from ollama import Client
client = Client(host='http://localhost:11434')
# Pull a model (one-time, ~4 GB download)
client.pull('mistral')
# Run inference
response = client.generate(
model='mistral',
prompt='Explain machine learning in one sentence.',
stream=False
)
print(response['response'])
# Output: Machine learning is a type of artificial intelligence ...
The stream=False parameter blocks until the full response is generated. For long responses, set stream=True to get real-time output.
Streaming Responses
Streaming is essential for interactive applications. Each chunk is yielded as it's generated:
from ollama import Client
client = Client(host='http://localhost:11434')
# Stream output in real-time
print("Streaming response:", end="")
for chunk in client.generate(
model='mistral',
prompt='Tell me a story about a robot.',
stream=True
):
print(chunk['response'], end="", flush=True)
print()
The flush=True parameter ensures each chunk displays immediately in the terminal. Without it, output buffers and feels slow. This pattern is ideal for chatbots or web UIs where users expect real-time feedback.
Managing Conversation Context
To build a chatbot, maintain a conversation history and pass the full context with each new message:
from ollama import Client
client = Client(host='http://localhost:11434')
# Conversation history
messages = [
{
'role': 'user',
'content': 'What is Python?'
}
]
# First turn
response = client.chat(model='mistral', messages=messages)
assistant_message = response['message']['content']
messages.append({
'role': 'assistant',
'content': assistant_message
})
print(f"Assistant: {assistant_message}")
# Second turn (model sees full history)
messages.append({
'role': 'user',
'content': 'Is it good for machine learning?'
})
response = client.chat(model='mistral', messages=messages)
assistant_message = response['message']['content']
messages.append({
'role': 'assistant',
'content': assistant_message
})
print(f"Assistant: {assistant_message}")
The messages list acts as the full conversation context. Each call includes all prior turns so the model has continuity. Context length is model-dependent (Mistral-7B: 8,192 tokens; Llama-2-7B: 4,096 tokens). Exceeding context size truncates earlier messages.
Controlling Generation Parameters
Fine-tune inference behavior with temperature, top-p, and other parameters:
from ollama import Client
client = Client(host='http://localhost:11434')
response = client.generate(
model='mistral',
prompt='Write a creative story about space.',
stream=False,
options={
'temperature': 0.9, # High randomness (0–1, default 0.7)
'top_p': 0.95, # Nucleus sampling
'top_k': 40, # Keep top 40 tokens by probability
'num_predict': 200, # Max output tokens
'num_ctx': 2048, # Context window size
}
)
print(response['response'])
Parameter meanings:
temperature— Randomness (0 = deterministic, 1 = very random)top_p— Cumulative probability threshold for nucleus samplingtop_k— Keep only top K tokens by probabilitynum_predict— Maximum output tokens (similar tomax_length)num_ctx— Context size (use model's maximum for best results)
Working with Different Models
Ollama automatically quantizes models to 4-bit, 5-bit, or 8-bit depending on VRAM availability. List available models:
from ollama import Client
client = Client(host='http://localhost:11434')
# List all downloaded models
models = client.list()
for model in models['models']:
print(f"{model['name']} — {model['size'] / 1e9:.1f} GB")
Pull and run different models:
# Lightweight models (runs on CPU)
client.pull('phi') # 2.7 GB, very fast
client.pull('neural-chat') # 4 GB, good quality
# Mid-range models (needs 6 GB+ VRAM)
client.pull('mistral') # 4.1 GB, balanced
client.pull('llama2') # 3.8 GB, good accuracy
# Larger models (needs 24+ GB VRAM)
client.pull('neural-chat:latest') # Explicitly pull latest version
Model sizes are compressed (quantized); uncompressed they're 2–3 larger in VRAM.
Error Handling and Debugging
Handle timeouts and connection errors gracefully:
from ollama import Client
import time
client = Client(host='http://localhost:11434')
# Retry logic for flaky connections
max_retries = 3
for attempt in range(max_retries):
try:
response = client.generate(
model='mistral',
prompt='Quick test.',
stream=False
)
print(response['response'])
break
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2)
else:
raise
# Check if model exists before running
models = client.list()
available = [m['name'].split(':')[0] for m in models['models']]
if 'mistral' in available:
# Safe to run
response = client.generate(model='mistral', prompt='Hi')
else:
print("Mistral not downloaded. Run: ollama pull mistral")
Ollama vs. Transformers: When to Use Each
| Factor | Ollama | Transformers |
|---|---|---|
| Setup complexity | Very simple (one install) | Moderate (install PyTorch) |
| Customization | Limited (pre-set quantization) | Full (any precision, LoRA) |
| Speed (7B model) | 40 tokens/sec (GPU) | 50 tokens/sec (GPU) |
| VRAM (7B model) | 4–6 GB (quantized) | 14 GB (full precision) |
| Multi-GPU | Not supported | Fully supported |
| Fine-tuning | Not supported | Fully supported |
| Best for | Prototyping, single-user | Production, customization |
Use Ollama for quick experimentation and small deployments. Use Transformers when you need fine-tuning, multi-GPU, or exact control over inference.
Key Takeaways
- Ollama simplifies local inference with pre-quantized models and a REST API.
- The Python client connects to the Ollama server via
Client(host='http://localhost:11434'). - Stream responses with
stream=Truefor real-time output in interactive applications. - Maintain conversation context by passing the full message history to each
chat()call. - Adjust temperature, top_p, and num_predict to control generation randomness and length.
Frequently Asked Questions
What's the difference between generate() and chat()?
generate() takes a raw prompt string. chat() takes a list of messages with roles (user/assistant), providing context management. Use chat() for conversations, generate() for one-off completions.
Can I run multiple Ollama servers on different ports?
Yes. Start each server with OLLAMA_PORT=11435 ollama serve (or 11436, etc.) and connect with Client(host='http://localhost:11435').
How do I reduce Ollama's memory usage further?
Set num_ctx to a smaller value (e.g., 512 instead of 2048) to reduce context buffer size. Use smaller models like Phi-3 (2.7 GB). Run on CPU, which spills to disk if VRAM is exceeded (slow but possible).
Can I use Ollama with a GPU other than NVIDIA?
Ollama supports NVIDIA GPUs natively. AMD GPU support is experimental; Apple Metal is fully supported. For other GPUs, fall back to CPU inference.
What happens if I exceed the context window?
The model stops generating or truncates earlier messages. Monitor tokens_evaluated in the response to track usage.
Further Reading
- Ollama GitHub Repository — Source code and issues.
- Ollama Models Library — Browse 50+ quantized models.
- Ollama Python Client Documentation — Full API reference.
- Quantization Paper: GGML — Technical details of Ollama's underlying format.