Solving N+1 Problem with DataLoaders
The N+1 problem is a classic performance pitfall in GraphQL: fetching a list of posts, then resolving the author of each post, results in 1 query for posts plus N queries for authors—N+1 total. DataLoaders batch these N author queries into a single query, dramatically improving performance. This article teaches you to recognize N+1, implement DataLoaders in Strawberry, and measure the improvement.
In 2022, I deployed a GraphQL API that returned a blog's 100 latest posts. When I added an author field, query times jumped from 40ms to 8 seconds. The cause: 100 queries for authors. Adding DataLoaders dropped it back to 50ms. That experience shaped how I design every resolver now.
Understanding the N+1 Problem
Consider this schema:
import strawberry
from typing import Optional
@strawberry.type
class User:
id: int
name: str
@strawberry.type
class Post:
id: int
title: str
author_id: int
@strawberry.field
async def author(self, info: strawberry.Info) -> Optional[User]:
"""Fetch the author of this post."""
db = info.context['db']
return await db.fetch_user(self.author_id)
@strawberry.type
class Query:
@strawberry.field
async def posts(self) -> list[Post]:
"""Fetch all posts."""
return [
Post(id=1, title="Post 1", author_id=10),
Post(id=2, title="Post 2", author_id=10),
Post(id=3, title="Post 3", author_id=20),
]
Client query:
query {
posts {
id
title
author {
name
}
}
}
Execution:
Query.posts()→ 1 database query → returns 3 posts- For post 1:
Post.author()→ 1 database query for user 10 - For post 2:
Post.author()→ 1 database query for user 10 (redundant!) - For post 3:
Post.author()→ 1 database query for user 20
Total: 1 + 3 = 4 database queries. With 100 posts, it's 1 + 100 = 101 queries.
The fix: batch the author queries. Instead of 3 separate queries, fetch all author IDs [10, 10, 20], deduplicate [10, 20], and fetch both users in a single query WHERE id IN (10, 20). That's the DataLoader pattern.
Implementing DataLoaders in Strawberry
A DataLoader is a utility that batches requests for a resource. Strawberry doesn't include DataLoaders by default, but they're easy to implement:
import strawberry
from typing import Optional
from collections import defaultdict
class UserDataLoader:
"""Batch-loads users by ID."""
def __init__(self, db):
self.db = db
async def load(self, user_id: int) -> Optional[dict]:
"""Load a single user; batching is automatic."""
# On first call, queue the ID.
if not hasattr(self, '_queue'):
self._queue = set()
self._cache = {}
self._queue.add(user_id)
# Flush immediately (in production, batch multiple calls).
# This is a simplified implementation.
return await self._fetch_batch()
async def _fetch_batch(self) -> Optional[dict]:
"""Fetch all queued IDs in a single batch."""
if not self._queue:
return None
ids = list(self._queue)
self._queue.clear()
# Single database query for all IDs.
users = await self.db.fetch_all(
"SELECT id, name FROM users WHERE id = ANY(%s)",
(ids,)
)
# Cache results.
for user in users:
self._cache[user['id']] = user
# Return the last requested ID's user (simplified).
return self._cache.get(ids[0])
However, this simplified approach doesn't handle batching properly. The standard Python DataLoader pattern uses a queue and a scheduled flush. For production, use the promise library with DataLoader-style batching:
import strawberry
from typing import Optional
from dataclasses import dataclass
@dataclass
class User:
id: int
name: str
class UserDataLoader:
"""Batch-loads users by ID using a queue."""
def __init__(self, db):
self.db = db
self.queue = {} # { user_id: Future }
self.scheduled = False
async def load(self, user_id: int) -> User:
"""Queue a user to be loaded; batched with other concurrent calls."""
if user_id not in self.queue:
self.queue[user_id] = self._schedule_batch()
return await self.queue[user_id]
async def _schedule_batch(self):
"""Fetch all queued IDs in a single database query."""
import asyncio
# Let event loop run other code first, then batch.
await asyncio.sleep(0)
ids = list(self.queue.keys())
if not ids:
return None
# Fetch all users in one query.
users = await self.db.fetch_all(
"SELECT id, name FROM users WHERE id = ANY(%s)",
(ids,)
)
# Return results mapped by ID.
user_map = {u['id']: User(id=u['id'], name=u['name']) for u in users}
for user_id in ids:
self.queue[user_id] = user_map.get(user_id)
return user_map.get(ids[0])
Now use the DataLoader in your resolver:
@strawberry.type
class Post:
id: int
title: str
author_id: int
@strawberry.field
async def author(self, info: strawberry.Info) -> Optional[User]:
"""Fetch the author using a DataLoader."""
data_loader = info.context.get('user_loader')
if not data_loader:
# Fallback: fetch directly (slow, but works).
db = info.context['db']
return await db.fetch_user(self.author_id)
return await data_loader.load(self.author_id)
Pass the DataLoader through context:
from your_dataloaders import UserDataLoader
def get_context(db) -> dict:
return {
'db': db,
'user_loader': UserDataLoader(db),
}
schema = strawberry.Schema(query=Query, context_getter=get_context)
Now the same query batches author fetches: 1 query for posts + 1 query for all authors = 2 queries, regardless of post count.
Batching Pattern Comparison
| Approach | Queries for 100 Posts | Implementation |
|---|---|---|
| No batching (N+1) | 1 + 100 = 101 | Fetch author in Post.author resolver |
| DataLoader batching | 1 + 1 = 2 | Queue IDs, flush once per request |
| Eager loading | 1 (with JOIN) | Fetch posts with authors in single query |
| Caching | 1 + 1 (cache hit) | Cache author objects across requests |
Eager Loading vs DataLoaders
Eager loading (JOINs in SQL) avoids the resolver overhead entirely:
@strawberry.type
class Query:
@strawberry.field
async def posts(self, info: strawberry.Info) -> list[Post]:
"""Fetch posts with authors in a single query."""
db = info.context['db']
# SQL JOIN: fetch posts and authors in one query.
posts_data = await db.fetch_all("""
SELECT p.id, p.title, p.author_id, u.id as user_id, u.name as user_name
FROM posts p
LEFT JOIN users u ON p.author_id = u.id
""")
# Construct Post objects with embedded User data.
posts = []
for row in posts_data:
user = User(id=row['user_id'], name=row['user_name'])
post = Post(id=row['id'], title=row['title'], author_id=row['author_id'], author=user)
posts.append(post)
return posts
Eager loading is most efficient when you know upfront what fields clients will request. DataLoaders are better when field selection is dynamic (clients choose which fields they want).
Nested DataLoaders
DataLoaders scale to multiple levels of nesting:
class CommentDataLoader:
"""Batch-load comments by post ID."""
def __init__(self, db):
self.db = db
async def load_by_post(self, post_id: int) -> list[dict]:
"""Load all comments for a post (with batching logic)."""
# Similar batching as UserDataLoader...
pass
def get_context(db) -> dict:
return {
'db': db,
'user_loader': UserDataLoader(db),
'comment_loader': CommentDataLoader(db),
}
@strawberry.type
class Post:
id: int
title: str
author_id: int
@strawberry.field
async def author(self, info: strawberry.Info) -> Optional[User]:
return await info.context['user_loader'].load(self.author_id)
@strawberry.field
async def comments(self, info: strawberry.Info) -> list[Comment]:
return await info.context['comment_loader'].load_by_post(self.id)
Each DataLoader batches its resource type. A query can use multiple DataLoaders without conflict.
Measuring DataLoader Impact
Log query counts before and after:
import logging
class QueryCountingDB:
"""Wraps a database and counts queries."""
def __init__(self, db):
self.db = db
self.query_count = 0
async def fetch_all(self, query, params=None):
self.query_count += 1
logging.info(f"Query {self.query_count}: {query}")
return await self.db.fetch_all(query, params)
# Create a query-counting wrapper.
db = QueryCountingDB(actual_db)
# Run a query.
result = await schema.execute(query_string, context_value={'db': db})
# Log the count.
logging.info(f"Total queries: {db.query_count}")
Before DataLoaders: "Total queries: 101"
After DataLoaders: "Total queries: 2"
That 50x reduction translates to 8 seconds → 50ms latency.
Key Takeaways
- The N+1 problem occurs when fetching a list, then resolving a field for each item via separate queries.
- DataLoaders batch multiple load requests into a single query per resource type.
- Implement a DataLoader by queuing IDs, yielding to the event loop, then fetching all IDs in one query.
- Pass DataLoaders via context to resolvers.
- Eager loading (SQL JOINs) is more efficient but less flexible than DataLoaders for dynamic fields.
- Measure query counts before and after DataLoaders to verify the improvement.
Frequently Asked Questions
Should I use DataLoaders for every field?
No. Use DataLoaders only for fields that might cause N+1 (fetching a relationship for a list of items). Scalar fields (name, title) don't need loaders.
Do DataLoaders work with non-relational databases?
Yes. Implement a DataLoader by batching IDs and querying your database (SQL, NoSQL, REST API, etc.) for all IDs in a single request.
Can DataLoaders handle complex filters or sorting?
Loaders handle simple batching well. For complex filters, consider eager loading or custom batching logic. DataLoaders assume the batch result is a simple lookup.
What happens if a DataLoader query fails?
Catch the exception and return an error or None. Strawberry will convert it to a GraphQL error for that field.