Skip to main content

Serving Python LLMs via REST API

Serving a local LLM as a REST API lets multiple clients query the model over HTTP. FastAPI is the standard choice for this: it's fast, handles async I/O, supports streaming responses, and auto-generates API documentation.

This tutorial covers building a minimal LLM API, streaming responses for long outputs, managing concurrent requests, containerizing with Docker, and error handling. By the end, you'll deploy a local LLM accessible to any client via HTTP.

Building a Minimal FastAPI LLM Service

Install FastAPI and Uvicorn:

pip install fastapi uvicorn torch transformers

Create main.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import logging

logging.basicConfig(level=logging.INFO)

app = FastAPI(title="Local LLM API")

# Load model once at startup
model = None
tokenizer = None

@app.on_event("startup")
def load_model():
global model, tokenizer
logging.info("Loading model...")
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
logging.info("Model loaded")

# Request schema
class GenerateRequest(BaseModel):
prompt: str
max_length: int = 100
temperature: float = 0.7
top_p: float = 0.95

# Response schema
class GenerateResponse(BaseModel):
prompt: str
response: str
tokens_generated: int

@app.post("/generate", response_model=GenerateResponse)
def generate(request: GenerateRequest):
if model is None:
raise HTTPException(status_code=503, detail="Model not loaded")

try:
inputs = tokenizer(request.prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
output_ids = model.generate(
**inputs,
max_length=request.max_length,
temperature=request.temperature,
top_p=request.top_p,
do_sample=True
)

response_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
tokens_generated = output_ids.shape[1] - inputs.input_ids.shape[1]

return GenerateResponse(
prompt=request.prompt,
response=response_text,
tokens_generated=tokens_generated
)
except Exception as e:
logging.error(f"Generation failed: {e}")
raise HTTPException(status_code=500, detail=str(e))

@app.get("/health")
def health():
return {"status": "ok", "model_loaded": model is not None}

if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

Run the server:

python main.py
# Output: Uvicorn running on http://0.0.0.0:8000

Test the API:

curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "What is Python?", "max_length": 100}'

Or from Python:

import requests

response = requests.post(
"http://localhost:8000/generate",
json={"prompt": "What is Python?", "max_length": 100}
)
print(response.json()["response"])

Streaming Responses for Long Outputs

For long generations, stream response chunks to avoid timeouts:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import json

app = FastAPI()
model = None
tokenizer = None

@app.on_event("startup")
def load_model():
global model, tokenizer
model_name = "mistral-community/Mistral-7B-Instruct-v0.3"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
model = model.cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)

async def generate_stream(prompt: str, max_tokens: int = 100):
"""Generator function that yields tokens one at a time"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
# Use model.generate with output_scores to get per-token logits
for token_id in model.generate(
**inputs,
max_length=max_tokens,
do_sample=True,
temperature=0.7
):
# Decode single token
token_text = tokenizer.decode([token_id], skip_special_tokens=False)

# Stream as JSON lines (newline-delimited JSON)
yield json.dumps({"token": token_text}) + "\n"

@app.post("/generate-stream")
async def generate_stream_endpoint(prompt: str):
"""Stream tokens to client as they're generated"""
return StreamingResponse(
generate_stream(prompt),
media_type="application/x-ndjson" # Newline-delimited JSON
)

Client receives tokens in real-time:

import requests
import json

response = requests.post(
"http://localhost:8000/generate-stream?prompt=What+is+Python%3F",
stream=True
)

for line in response.iter_lines():
if line:
token = json.loads(line)["token"]
print(token, end="", flush=True)

Streaming is essential for UX; without it, users wait 10–30 seconds for the full response. With streaming, tokens appear as they're generated.

Managing Concurrent Requests

By default, FastAPI queues all requests. For CPU-bound inference, serialize requests with a lock:

from fastapi import FastAPI, BackgroundTasks
from threading import Lock
import logging

app = FastAPI()
model = None
tokenizer = None
inference_lock = Lock() # Serialize inference

@app.post("/generate")
def generate(prompt: str):
"""Only one request runs inference at a time"""
with inference_lock:
logging.info(f"Generating for: {prompt}")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return {"response": response}

For GPU inference, queries run in parallel only if there's enough VRAM for multiple model copies. With one model, a lock serializes requests. For production, use a queue (Redis) + worker processes.

Containerizing with Docker

Create Dockerfile:

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

WORKDIR /app

# Install Python and dependencies
RUN apt-get update && apt-get install -y python3 python3-pip
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy app
COPY main.py .

# Expose port
EXPOSE 8000

# Run
CMD ["python3", "main.py"]

Create requirements.txt:

fastapi==0.104.1
uvicorn==0.24.0
torch==2.1.0
transformers==4.35.0

Build and run:

docker build -t llm-api .
docker run --gpus all -p 8000:8000 llm-api

The --gpus all flag exposes GPUs to the container. Without it, inference runs on CPU only.

Error Handling and Timeouts

Handle common errors gracefully:

from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import JSONResponse
import asyncio
import logging

app = FastAPI()

class ModelNotLoadedError(Exception):
pass

@app.exception_handler(ModelNotLoadedError)
async def model_not_loaded_handler(request: Request, exc: ModelNotLoadedError):
return JSONResponse(
status_code=503,
content={"error": "Model not loaded", "detail": str(exc)},
)

@app.post("/generate")
async def generate(prompt: str):
if model is None:
raise ModelNotLoadedError("Model initialization failed")

try:
# Timeout after 30 seconds
result = await asyncio.wait_for(
run_inference(prompt),
timeout=30.0
)
return result
except asyncio.TimeoutError:
logging.error(f"Inference timeout for: {prompt}")
raise HTTPException(status_code=504, detail="Generation timeout")
except Exception as e:
logging.error(f"Unexpected error: {e}")
raise HTTPException(status_code=500, detail="Internal server error")

async def run_inference(prompt: str):
"""Run inference in thread pool to avoid blocking"""
import concurrent.futures
loop = asyncio.get_event_loop()
with concurrent.futures.ThreadPoolExecutor() as executor:
return await loop.run_in_executor(
executor,
lambda: _generate(prompt)
)

def _generate(prompt: str):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output_ids = model.generate(**inputs, max_length=100)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)

Performance Monitoring

Monitor API performance with logging:

from fastapi import FastAPI
from time import time
import logging

logging.basicConfig(level=logging.INFO)

app = FastAPI()

@app.middleware("http")
async def log_request_duration(request, call_next):
start_time = time()
response = await call_next(request)
duration = time() - start_time
logging.info(f"{request.method} {request.url.path}{duration:.2f}s")
return response

Deployment Checklist

  • Load model on startup, not per-request
  • Use batch inference for throughput
  • Stream long responses to clients
  • Set request timeouts (30–60 seconds)
  • Log all errors and request latencies
  • Monitor GPU memory (nvidia-smi)
  • Use Docker for reproducible deployments
  • Set up health checks (/health) for monitoring

Key Takeaways

  • Build LLM APIs with FastAPI for async HTTP handling and auto-documentation.
  • Load the model once at startup, not for each request.
  • Stream responses for long outputs to improve UX.
  • Serialize inference with locks on single-GPU systems.
  • Containerize with Docker for production deployment.
  • Monitor latency and errors; set reasonable timeouts.

Frequently Asked Questions

How many concurrent requests can my API handle?

Limited by VRAM and inference latency. With one GPU and 3-second latency per request, you handle 1 request every 3 seconds. Batch requests (multiple prompts in one call) increase throughput by 2–3.

Should I use FastAPI or Flask for LLMs?

FastAPI is faster due to async support and automatic request validation. Flask works but is slower. Use FastAPI for production.

How do I scale to multiple users?

  1. Single-GPU: Use a queue (Celery + Redis) to serialize inference.
  2. Multi-GPU: Load the model on each GPU with device_map="auto".
  3. Multiple servers: Run separate API containers on each machine, load-balance with nginx.

Can I fine-tune models through the API?

Yes, but fine-tuning is resource-intensive. Add a /fine-tune endpoint that runs training asynchronously and returns a job ID. Training happens in a background worker; users poll for completion.

How do I add authentication to the API?

Use FastAPI's HTTPBearer or API keys:

from fastapi.security import HTTPBearer, HTTPAuthCredentials

security = HTTPBearer()

@app.post("/generate")
def generate(request: GenerateRequest, credentials: HTTPAuthCredentials = Depends(security)):
if credentials.credentials != "your-secret-key":
raise HTTPException(status_code=401, detail="Unauthorized")
# ... continue with inference

Further Reading