MongoDB Queries in Python: Filter, Sort, Index
MongoDB queries range from simple filters (find({'age': {'$gt': 30}})) to complex aggregation pipelines that transform, group, and reshape data. Indexing dramatically speeds up queries by avoiding full collection scans. Learning to write efficient queries is essential for MongoDB performance—a poorly indexed query on a 10 million-document collection can take seconds instead of milliseconds.
After profiling hundreds of MongoDB queries in production, I discovered that 80% of slow queries lack proper indexes and 15% use inefficient aggregation stages. This guide shows you how to write fast, readable queries and diagnose performance issues.
Advanced Filtering with MongoDB Operators
Beyond basic comparison operators ($gt, $lt), MongoDB offers text search, array matching, and range queries.
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['ecommerce']
products = db['products']
# Text search (requires text index)
results = products.find(
{'$text': {'$search': 'laptop aluminum'}},
{'score': {'$meta': 'textScore'}}
).sort([('score', {'$meta': 'textScore'})])
for product in results:
print(f"{product['name']}: {product['score']}")
# Array operators: find products in a price range
expensive = products.find({
'price': {'$gte': 500, '$lte': 2000}
})
# Find products matching ANY tag (array contains value)
python_tools = products.find({'tags': {'$in': ['python', 'programming']}})
# Find documents where an array has AT LEAST 1 element matching a condition
high_rated = products.find({
'reviews': {'$elemMatch': {'rating': {'$gte': 4.5}, 'verified': True}}
})
# Regex search (case-insensitive)
sony_products = products.find({
'brand': {'$regex': '^Sony$', '$options': 'i'}
})
These operators unlock complex queries without looping in Python. The $elemMatch operator is particularly powerful: it filters array elements by multiple conditions within a single array element (useful for reviews with both rating and verification status).
Sorting, Limiting, and Skipping
Combine find() with sort(), limit(), and skip() for pagination and result ranking.
# Sort by price descending, then by rating ascending
results = products.find().sort([
('price', -1), # -1 = descending, 1 = ascending
('rating', 1)
])
# Limit to 10 results
top_10 = products.find().sort('price', -1).limit(10)
# Pagination: get page 3 (assuming 20 items per page)
page = 3
page_size = 20
results = products.find() \
.sort('created_at', -1) \
.skip((page - 1) * page_size) \
.limit(page_size)
# Count matching documents (does not retrieve them)
count = products.count_documents({'price': {'$lte': 100}})
print(f"Found {count} products under $100")
Always pair skip() with sort() on a stable field (like created_at), otherwise pagination returns inconsistent results if documents are inserted between requests.
Creating and Using Indexes
An index is a separate data structure that MongoDB maintains. Queries on indexed fields avoid scanning the entire collection.
# Create a single-field index (ascending)
products.create_index('name')
# Create a compound index (multiple fields)
products.create_index([('category', 1), ('price', -1)])
# Create a unique index (values cannot repeat)
users = db['users']
users.create_index('email', unique=True)
# Text index (for $text search)
products.create_index([('name', 'text'), ('description', 'text')])
# TTL index (auto-delete documents after a time period)
sessions = db['sessions']
sessions.create_index('created_at', expireAfterSeconds=3600) # Expire after 1 hour
# View all indexes on a collection
indexes = products.list_indexes()
for index in indexes:
print(f"Index: {index['name']}, Keys: {index['key']}")
# Drop an index
products.drop_index('name_1')
A compound index on (category, price) speeds up queries filtering by category and price, but also helps queries filtering by category alone. The order matters: index fields in the order they appear in your queries.
Aggregation Pipelines: Transform and Reshape Data
Aggregation pipelines process documents through a series of stages, transforming them along the way. They are far more powerful than simple find().
# Count products by category
pipeline = [
{'$group': {
'_id': '$category', # Group by category
'count': {'$sum': 1}, # Count documents
'avg_price': {'$avg': '$price'} # Average price per category
}},
{'$sort': {'count': -1}}, # Sort by count descending
{'$limit': 5} # Top 5 categories
]
results = products.aggregate(pipeline)
for result in results:
print(f"Category: {result['_id']}, Count: {result['count']}, Avg Price: {result['avg_price']:.2f}")
# Match, project, and group (multi-stage)
pipeline = [
{'$match': {'price': {'$gte': 100}}}, # Filter
{'$project': { # Select fields
'name': 1,
'price': 1,
'discount': {'$multiply': ['$price', 0.1]} # Add calculated field
}},
{'$group': {
'_id': None,
'total_discounted': {'$sum': '$discount'},
'avg_price': {'$avg': '$price'}
}}
]
result = list(products.aggregate(pipeline))
print(f"Total discount available: ${result[0]['total_discounted']:.2f}")
# Unwind arrays and group
reviews = db['reviews']
pipeline = [
{'$unwind': '$tags'}, # Flatten array into separate documents
{'$group': {
'_id': '$tags',
'count': {'$sum': 1}
}},
{'$sort': {'count': -1}}
]
tag_counts = list(reviews.aggregate(pipeline))
Aggregation pipelines are composed of stages (in order): $match (filter), $project (reshape), $group (aggregate), $sort, $limit, $unwind (flatten arrays), $lookup (join). Each stage passes its output to the next. This is how you solve complex queries without fetching and processing data in Python.
Query Profiling and Optimization
MongoDB's built-in profiler shows which queries are slow. Enable it and examine the execution stats.
import json
# Enable profiling (log all queries taking > 100 ms)
db.set_profiling_level(1, slow_ms=100)
# Run a query
products.find({'category': 'laptops'}).limit(10)
# Check the profiling data
profile = db['system.profile']
slow_queries = profile.find().sort('millis', -1).limit(5)
for query in slow_queries:
print(f"Query: {json.dumps(query['command'], indent=2)}")
print(f"Time: {query['millis']} ms")
print(f"Scanned: {query['executionStats']['totalDocsExamined']} docs\n")
# Disable profiling
db.set_profiling_level(0)
The key metric is totalDocsExamined vs. executionStats.nReturned. If you scanned 1 million documents to return 10, your index is missing or inefficient. A good index returns about the same number of documents as the result.
Pagination and Cursor Performance
For large result sets, use batch_size to reduce memory usage and network round-trips.
# Without batch_size (default 101 for queries, 1000 for aggregations)
cursor = products.find().batch_size(5000)
# Iterate efficiently
for product in cursor:
process(product)
# For aggregation, use allowDiskUse for large pipelines
pipeline = [...]
results = products.aggregate(
pipeline,
allowDiskUse=True # Use disk for large intermediate results
)
A larger batch_size reduces network latency but increases memory on the client. Start with 1000 and adjust based on document size and memory constraints.
Key Takeaways
- Use operators like
$gt,$elemMatch,$text, and$infor complex filters; combine withsort()andlimit()for ranking and pagination - Create indexes on fields used in
find()filters andsort()operations; compound indexes speed up multi-field queries - Aggregation pipelines (
$match,$group,$project) transform data without fetching into Python; use them for analytics and reshaping - Always verify query performance using MongoDB's profiler; a good index scans only the documents returned
- Enable
allowDiskUse=Truefor large aggregations to prevent out-of-memory errors
Frequently Asked Questions
How many indexes should I create?
Start with one index per major query pattern. Each index slows writes slightly (MongoDB must update the index), so do not over-index. Monitor slow queries and add indexes only for those. Typically, a collection has 3–5 indexes.
Why is my sort slow even with an index?
Sort uses the index only if the sort field is indexed AND the index key order matches your query and sort order. If you filter by category but sort by price, you need a compound index on (category, price). Without it, MongoDB must fetch all matching documents, then sort in memory.
Can I index an array field?
Yes. MongoDB creates a multikey index, where each array element becomes an index entry. Querying {'tags': 'python'} uses the index even though tags is an array.
What is the difference between find() and aggregate()?
find() is faster for simple queries (filter, project, sort). aggregate() is necessary for complex transformations like grouping and joining. Use find() for CRUD; use aggregate() for analytics.
How do I handle large result sets without running out of memory?
Use batch_size() to fetch documents in batches, and iterate through the cursor. Never do list(cursor) on millions of documents. For analytics on massive collections, use aggregation with allowDiskUse=True instead.