Skip to main content

API Pagination: Efficient Large Dataset Handling

Pagination breaks large result sets into manageable chunks, enabling clients to fetch data incrementally. However, different pagination approaches have vastly different performance characteristics. Offset-limit (the simplest, where you SELECT * FROM users OFFSET 1000 LIMIT 50) degrades catastrophically at scale—retrieving page 10,000 requires the database to scan and discard 500,000 rows. Cursor-based pagination uses a pointer to the last seen row and fetches "next N rows after this pointer," scaling linearly regardless of result set size. This article covers three pagination strategies and shows how to choose based on your data size and access patterns.

I optimized a user list endpoint at a SaaS platform that was timing out when customers scrolled past page 100. Offset-limit queries with large offsets were taking 30 seconds. Switching to cursor-based pagination reduced latency to 200 milliseconds at any page. Users no longer hit timeouts, and our database CPU dropped by 40%.

Pagination Strategies: Comparison and Tradeoffs

StrategyProsConsBest For
Offset-LimitSimple, familiar, statelessPerformance cliff past page 100, backward navigation hardSmall datasets <100k rows, admin interfaces
Cursor-BasedScales linearly, consistent ordering, real-time data safeStateless only forward, requires unique ordering, harder to implementMost production APIs, mobile apps, feeds
KeysetFastest, indexes well, handles real-time dataComplex, requires unique non-null keysHigh-scale distributed systems, real-time feeds

Offset-Limit Pagination: Simple but Slow at Scale

The simplest pagination: skip N rows, return M rows. Works until N gets large.

from fastapi import FastAPI, Query
from sqlalchemy import Column, Integer, String, DateTime, func
from sqlalchemy.orm import Session

app = FastAPI()

@app.get('/users')
def list_users(
skip: int = Query(0, ge=0),
limit: int = Query(50, ge=1, le=100),
db: Session = Depends(get_db)
):
"""List users with offset-limit pagination."""

total = db.query(func.count(User.id)).scalar()

users = db.query(User).offset(skip).limit(limit).all()

return {
'items': [u.to_dict() for u in users],
'total': total,
'page': skip // limit + 1,
'pages': (total + limit - 1) // limit
}

# Client usage:
# Page 1: /users?skip=0&limit=50
# Page 2: /users?skip=50&limit=50
# Page 100: /users?skip=4950&limit=50 <- SLOW! Database scans 5000 rows

The problem: OFFSET 5000 LIMIT 50 tells the database to scan and discard 5,000 rows, then return 50. With millions of rows, this is prohibitively slow.

When is offset-limit acceptable?

  • Datasets <100k rows
  • Admin interfaces (users never browse past page 10)
  • When you don't care about performance at high offsets

Cursor-Based Pagination: The Production Standard

Cursor-based pagination uses the last seen row's ID (or another unique, sortable field) as the starting point. "Give me the next 50 rows after user ID 1234." The database uses an index on ID, making it efficient regardless of position.

import base64
import json

@app.get('/users')
def list_users_cursor(
cursor: str = Query(None, description="Opaque pagination cursor"),
limit: int = Query(50, ge=1, le=100),
db: Session = Depends(get_db)
):
"""List users with cursor-based pagination."""

query = db.query(User).order_by(User.id)

# Decode cursor if provided
if cursor:
try:
cursor_data = json.loads(base64.b64decode(cursor).decode())
last_id = cursor_data['id']
# Start after the last seen ID
query = query.filter(User.id > last_id)
except Exception:
raise HTTPException(status_code=400, detail="Invalid cursor")

# Fetch limit + 1 to detect if there are more results
users = query.limit(limit + 1).all()

has_more = len(users) > limit
users = users[:limit]

# Generate next cursor from last user
next_cursor = None
if has_more and users:
next_cursor = base64.b64encode(json.dumps({
'id': users[-1].id
}).encode()).decode()

return {
'items': [u.to_dict() for u in users],
'cursor': next_cursor,
'has_more': has_more
}

# Client usage:
# Page 1: /users?limit=50
# Page 2: /users?limit=50&cursor=eyJpZCI6IDUwfQ==
# Page N: /users?limit=50&cursor=<opaque-cursor> <- Always O(limit) time

The database uses the index on User.id to quickly jump to ID 1234 and return the next 50 rows. No scanning of 5,000 unused rows.

Cursor-Based Pagination with Multiple Sort Fields

When sorting by non-unique fields (e.g., creation date), combine multiple fields in the cursor:

@app.get('/posts')
def list_posts_by_date(
cursor: str = Query(None),
limit: int = Query(20, ge=1, le=100),
db: Session = Depends(get_db)
):
"""List posts sorted by date with cursor pagination."""

query = db.query(Post).order_by(Post.created_at.desc(), Post.id.desc())

if cursor:
try:
cursor_data = json.loads(base64.b64decode(cursor).decode())
# Filter posts created before the cursor
query = query.filter(
(Post.created_at < cursor_data['created_at']) |
(
(Post.created_at == cursor_data['created_at']) &
(Post.id < cursor_data['id'])
)
)
except Exception:
raise HTTPException(status_code=400, detail="Invalid cursor")

posts = query.limit(limit + 1).all()
has_more = len(posts) > limit
posts = posts[:limit]

next_cursor = None
if has_more and posts:
next_cursor = base64.b64encode(json.dumps({
'created_at': posts[-1].created_at.isoformat(),
'id': posts[-1].id
}).encode()).decode()

return {
'items': [p.to_dict() for p in posts],
'cursor': next_cursor,
'has_more': has_more
}

The compound filter (created_at < cursor_at) OR (created_at == cursor_at AND id < cursor_id) ensures no gaps or duplicates when posts have the same creation timestamp.

Keyset Pagination (Seek-Based)

Keyset pagination is like cursor-based but even more efficient for distributed systems. Instead of filtering by ID, you seek to a key in a sorted order and return results from there. Common in Elasticsearch and columnar databases.

@app.get('/events')
def list_events_keyset(
search_after: str = Query(None, description="Seek position from previous page"),
limit: int = Query(50, ge=1, le=100),
db: Session = Depends(get_db)
):
"""
Keyset pagination: each page contains the highest key from previous page.
"""
query = db.query(Event).order_by(Event.timestamp.desc(), Event.id.desc())

if search_after:
try:
data = json.loads(base64.b64decode(search_after).decode())
# Start after the given key
query = query.filter(
(Event.timestamp < data['timestamp']) |
(
(Event.timestamp == data['timestamp']) &
(Event.id < data['id'])
)
)
except Exception:
raise HTTPException(status_code=400, detail="Invalid search_after")

results = query.limit(limit + 1).all()
has_more = len(results) > limit
results = results[:limit]

search_after_value = None
if has_more and results:
search_after_value = base64.b64encode(json.dumps({
'timestamp': results[-1].timestamp.isoformat(),
'id': results[-1].id
}).encode()).decode()

return {
'data': [r.to_dict() for r in results],
'search_after': search_after_value
}

Handling Real-Time Data with Pagination

When data is being added/removed concurrently (e.g., a live feed), cursor-based pagination prevents duplicate or missed items:

# Using created_at as cursor (safer than ID for real-time data)
@app.get('/feed')
def get_feed(
until_time: str = Query(None, description="ISO 8601 timestamp"),
limit: int = Query(20),
db: Session = Depends(get_db)
):
"""
Get feed items until a specific time.
Safe for concurrent inserts/deletes because we use timestamp, not row count.
"""
query = db.query(FeedItem).order_by(FeedItem.created_at.desc())

if until_time:
until_dt = datetime.fromisoformat(until_time)
query = query.filter(FeedItem.created_at < until_dt)

items = query.limit(limit + 1).all()
has_more = len(items) > limit
items = items[:limit]

next_until = None
if has_more and items:
next_until = items[-1].created_at.isoformat()

return {
'items': [i.to_dict() for i in items],
'next_until': next_until
}

Key Takeaways

  • Offset-limit is simple but slow at high offsets (O(n) where n is offset). Avoid for large datasets.
  • Cursor-based pagination is O(limit) regardless of position. Use for most production APIs.
  • Encode cursors as base64 JSON to hide implementation details from clients.
  • For real-time data (feeds, live updates), use timestamp-based cursors to handle concurrent inserts safely.
  • Always fetch limit+1 rows to determine if more results exist.

Frequently Asked Questions

Can clients navigate backward with cursor pagination?

Not easily. Cursor pagination is inherently forward-only. If backward navigation matters, either cache cursors on the client or use keyset pagination with both forward and backward pointers. For most use cases, forward-only is sufficient.

How do I show total result count with cursor pagination?

You don't, and that's by design. Computing the total would require counting all rows, which is expensive at scale. Most modern UIs don't need the count; showing "has more" is sufficient.

What if my data is unsorted or sorting changes?

Use a stable, immutable sort key. Never sort by updated_at (changes over time). Sort by created_at (immutable) plus ID for tiebreaking. If you must sort by updated_at, accept that pagination might have gaps.

How do I implement cursor pagination without exposing database IDs?

Use a hash: cursor = base64(hash(row_id, secret_key)). On the next request, verify the hash before using the row ID. Or use a UUID column instead of auto-increment IDs.

Should I fetch limit+1 or limit rows?

Fetch limit+1 to determine if more results exist. This tells you if there's a next page without a separate count query. Return only limit rows to clients.

Further Reading