Skip to main content

Production Patterns: Monitoring, Failover, Scaling NoSQL

Production databases differ from development in three ways: they must survive failures (failover), adapt to growing data (sharding), and expose their health (monitoring). A development database on your laptop is acceptable to lose; a production database serving 100,000 users cannot. This guide teaches the patterns that keep NoSQL reliable at scale.

After operating MongoDB and Redis in production for five years, managing database failures during peak traffic, I learned that preparation prevents panic. Monitoring catches problems before they cascade. Failover keeps services alive. Sharding prevents single-database bottlenecks. This article consolidates those hard-won lessons.

Monitoring MongoDB Health

Monitor key metrics: connection count, operation latency, replication lag, disk usage, and query performance.

from pymongo import MongoClient
from pymongo.errors import ServerSelectionTimeoutError
import time

client = MongoClient('mongodb://localhost:27017/')

def check_mongodb_health():
"""Check MongoDB connection and basic metrics."""
try:
# Test connectivity
client.admin.command('ping')

# Get server status
status = client.admin.command('serverStatus')

metrics = {
'version': status['version'],
'uptime_seconds': status['uptime'],
'connections_current': status['connections']['current'],
'connections_available': status['connections']['available'],
'memory_resident_mb': status['mem']['resident'],
'memory_virtual_mb': status['mem']['virtual'],
'network_bytes_in': status['network']['bytesIn'],
'network_bytes_out': status['network']['bytesOut'],
}

return True, metrics

except ServerSelectionTimeoutError as e:
return False, {'error': f'Cannot connect to MongoDB: {e}'}

# Usage
healthy, metrics = check_mongodb_health()
if healthy:
print(f"MongoDB is healthy. Current connections: {metrics['connections_current']}")
else:
print(f"MongoDB is down: {metrics['error']}")

# In production, call this every 10 seconds and alert if unhealthy for 2+ checks

Monitoring Redis Health

Redis exposes metrics via the INFO command. Monitor key metrics: connected clients, memory usage, evictions, and command latency.

import redis
import time

r = redis.Redis(host='redis', port=6379, decode_responses=True)

def check_redis_health():
"""Check Redis connection and metrics."""
try:
# Test connectivity
r.ping()

# Get info
info = r.info('all')

metrics = {
'redis_version': info['redis_version'],
'uptime_seconds': info['uptime_in_seconds'],
'connected_clients': info['connected_clients'],
'used_memory_mb': info['used_memory'] / 1024 / 1024,
'used_memory_peak_mb': info['used_memory_peak'] / 1024 / 1024,
'evicted_keys': info.get('evicted_keys', 0),
'total_commands_processed': info['total_commands_processed'],
'keyspace_db0': info.get('db0', {}).get('keys', 0),
}

return True, metrics

except redis.ConnectionError as e:
return False, {'error': f'Cannot connect to Redis: {e}'}

# Usage
healthy, metrics = check_redis_health()
if healthy:
print(f"Redis is healthy. Memory: {metrics['used_memory_mb']:.0f} MB")
else:
print(f"Redis is down: {metrics['error']}")

Alert if memory exceeds 80% of max, evictions spike (indicates cache thrashing), or latency exceeds 100 ms.

Replication and Failover: MongoDB Replica Sets

A replica set is 3+ MongoDB nodes: one primary (accepts writes), two secondaries (read-only copies). If the primary fails, secondaries elect a new primary automatically.

MongoDB Replica Set
┌─────────────┐
│ Primary │ (accepts reads and writes)
├─────────────┤
│ Secondaries │ (read-only, replicate from primary)
├─────────────┤
│ Secondaries │
└─────────────┘

Configure replica set in production (3 nodes minimum):

# /etc/mongod.conf on all three nodes
replication:
replSetName: "rs0"

# Start MongoDB on each node
mongod --config /etc/mongod.conf

Initialize the replica set:

from pymongo import MongoClient
from pymongo.errors import ServerSelectionTimeoutError

# Connect to any node
client = MongoClient('mongodb://mongo1:27017,mongo2:27017,mongo3:27017/?replicaSet=rs0')

# Initialize (run once)
admin = client.admin

# Check replica set status
status = admin.command('replSetGetStatus')
print(f"Replica set: {status['set']}")
print(f"Primary: {status['primary']}")
for member in status['members']:
print(f" {member['name']}: {member['stateStr']}")

PyMongo automatically routes writes to the primary and reads to secondaries (with read preference). On primary failure, reads wait for a secondary to be elected.

Replication and Failover: Redis Sentinel

Redis Sentinel monitors Redis instances and promotes a replica to primary if the primary fails.

Redis Sentinel
┌─────────────────────────────────────┐
│ Sentinel Nodes (3+ required) │
│ Monitor primary and replicas │
│ Auto-promote replica on failure │
└─────────────────────────────────────┘
monitors

┌─────────────────────────────────────┐
│ Redis Primary (master) │
│ Accepts writes, replicates to slave │
└─────────────────────────────────────┘
replicates

┌─────────────────────────────────────┐
│ Redis Replica (slave) │
│ Read-only copy, can be promoted │
└─────────────────────────────────────┘

Configure Sentinel:

# /etc/sentinel.conf
port 26379
sentinel monitor mymaster 127.0.0.1 6379 2
sentinel down-after-milliseconds mymaster 5000
sentinel parallel-syncs mymaster 1
sentinel failover-timeout mymaster 10000

# Start Sentinel
redis-sentinel /etc/sentinel.conf

Connect to Sentinel from Python:

from redis.sentinel import Sentinel

# Connect to Sentinel (monitor nodes)
sentinel = Sentinel([('sentinel1', 26379), ('sentinel2', 26379), ('sentinel3', 26379)])

# Get master and slave connections
master = sentinel.master_for('mymaster', socket_timeout=0.1, db=0)
slave = sentinel.slave_for('mymaster', socket_timeout=0.1, db=0)

# Write to master
master.set('key', 'value')

# Read from slave (may be slightly stale due to replication lag)
value = slave.get('key')

Sentinel automatically failovers: if the primary goes down, a slave is promoted.

Sharding: Scaling MongoDB Horizontally

Sharding distributes data across multiple servers by a shard key. Data is split into chunks, each chunk lives on a shard (server).

MongoDB Sharded Cluster
┌────────────────────────────────────────┐
│ Mongos (router) │
│ Routes queries to correct shard │
└────────────────────────────────────────┘
routes to
↙ ↓ ↘
Shard 1 Shard 2 Shard 3
Keys Keys Keys
A–G H–M N–Z

Enable sharding in production (3+ config servers, multiple shards):

from pymongo import MongoClient
from pymongo.errors import OperationFailure

# Connect to mongos (router)
client = MongoClient('mongodb://mongos1:27017,mongos2:27017')

admin = client.admin

# Enable sharding on a database
try:
admin.command('enableSharding', 'myapp')
except OperationFailure:
pass # Already enabled

# Create a shard key index
db = client['myapp']
users = db['users']
users.create_index('user_id')

# Shard the collection
admin.command('shardCollection', 'myapp.users', key={'user_id': 1})

# Query (mongos routes to correct shard)
user = users.find_one({'user_id': 12345})

Choose shard keys carefully: use fields that distribute evenly (avoid dates) and match your query patterns.

Sharding: Scaling Redis with Cluster

Redis Cluster spreads keys across multiple master nodes. Data is automatically partitioned.

Redis Cluster
┌────────────────────────────────┐
│ Redis Client Library (redis-py)│
│ Routes to correct node │
└────────────────────────────────┘
routes to
↙ ↓ ↓ ↘
M1 M2 M3 M4
Slaves accompany masters for failover

Set up a Redis Cluster (6 nodes: 3 masters + 3 replicas):

# Start 6 Redis instances on different ports (6379–6384)
for port in {6379..6384}; do
redis-server --port $port --cluster-enabled yes &
done

# Create cluster
redis-cli --cluster create 127.0.0.1:6379 127.0.0.1:6380 127.0.0.1:6381 \
127.0.0.1:6382 127.0.0.1:6383 127.0.0.1:6384 --cluster-replicas 1

Connect from Python:

from redis.cluster import RedisCluster

# Connect to any node; client auto-discovers cluster
rc = RedisCluster(startup_nodes=[{'host': 'redis1', 'port': 6379}], decode_responses=True)

# Use normally; cluster handles routing
rc.set('key', 'value')
value = rc.get('key')

# Check cluster info
cluster_info = rc.cluster_info()
print(f"Cluster nodes: {cluster_info['cluster_slots']}")

Redis Cluster is operationally simpler than Sentinel but requires 6+ nodes for high availability.

Connection Pooling and Retry Logic

In production, always use connection pooling and handle transient failures gracefully.

from pymongo import MongoClient
from pymongo.errors import ServerSelectionTimeoutError, ConnectionFailure
import time

# MongoDB connection pool (auto-managed)
mongo_client = MongoClient(
'mongodb://mongo1:27017,mongo2:27017,mongo3:27017/?replicaSet=rs0',
maxPoolSize=50,
minPoolSize=10,
serverSelectionTimeoutMS=5000, # Timeout on slow servers
connectTimeoutMS=10000,
retryWrites=True, # Auto-retry writes
)

# Redis connection pool
import redis
redis_pool = redis.ConnectionPool(
host='redis',
port=6379,
max_connections=50,
socket_connect_timeout=5,
socket_keepalive=True,
health_check_interval=30,
)
r = redis.Redis(connection_pool=redis_pool, decode_responses=True)

def insert_with_retry(db, collection_name, document, max_retries=3):
"""Insert with exponential backoff on failure."""
for attempt in range(max_retries):
try:
result = db[collection_name].insert_one(document)
return result.inserted_id

except (ServerSelectionTimeoutError, ConnectionFailure) as e:
if attempt < max_retries - 1:
wait_time = 2 ** attempt # 1, 2, 4 seconds
print(f"Attempt {attempt + 1} failed. Retrying in {wait_time} seconds...")
time.sleep(wait_time)
else:
raise

# Usage
try:
doc_id = insert_with_retry(mongo_client.myapp, 'users', {'email': '[email protected]'})
print(f"Inserted: {doc_id}")
except Exception as e:
print(f"Failed after retries: {e}")

Backup and Recovery

Backup MongoDB and Redis regularly. Test recovery quarterly to ensure backups work.

MongoDB Backup

# Backup to directory (safe even with live writes)
mongodump --uri 'mongodb://replica_set/myapp' --out /backup/myapp-$(date +%Y%m%d)

# Restore from backup
mongorestore --uri 'mongodb://replica_set/myapp' /backup/myapp-20240601

Redis Backup

# Backup RDB (snapshot)
redis-cli BGSAVE # Backgrounded save

# Backup AOF (append-only file, continuous)
# Configured in redis.conf: appendonly yes

# Restore by stopping Redis, copying RDB/AOF, and restarting
cp /backup/dump.rdb /var/lib/redis/
redis-cli SHUTDOWN
redis-server

Alerting and On-Call Playbook

Define alerts for critical metrics:

import requests

def check_alerts(mongo_metrics, redis_metrics):
"""Check metrics against thresholds and alert."""
alerts = []

# MongoDB checks
if mongo_metrics.get('connections_current', 0) > 500:
alerts.append(f"MongoDB connections high: {mongo_metrics['connections_current']}")

if mongo_metrics.get('memory_resident_mb', 0) > 8000: # 8 GB threshold
alerts.append(f"MongoDB memory high: {mongo_metrics['memory_resident_mb']} MB")

# Redis checks
if redis_metrics.get('used_memory_mb', 0) > 32000: # 32 GB threshold
alerts.append(f"Redis memory high: {redis_metrics['used_memory_mb']} MB")

if redis_metrics.get('evicted_keys', 0) > 1000000: # 1M keys
alerts.append(f"Redis evictions: {redis_metrics['evicted_keys']} keys")

# Send alerts
for alert in alerts:
send_alert(alert)

def send_alert(message):
"""Send alert to Slack or email."""
requests.post(
'https://hooks.slack.com/services/YOUR/WEBHOOK',
json={'text': f'Database Alert: {message}'}
)

# Run every minute
while True:
mongo_healthy, mongo_metrics = check_mongodb_health()
redis_healthy, redis_metrics = check_redis_health()

if not mongo_healthy:
send_alert('MongoDB is down!')

if not redis_healthy:
send_alert('Redis is down!')

check_alerts(mongo_metrics, redis_metrics)
time.sleep(60)

Key Takeaways

  • Monitor MongoDB and Redis health continuously: connection count, memory, replication lag, latency
  • Use replica sets for MongoDB and Sentinel for Redis to survive single-node failures
  • Shard MongoDB by a well-distributed key and use Redis Cluster for horizontal scale
  • Always use connection pooling and retry logic with exponential backoff for transient failures
  • Backup regularly and test recovery quarterly; define alerts and on-call playbooks before deploying to production
  • For small scale (< 100 GB, < 10,000 requests/sec), skip replication and sharding; add them only when needed

Frequently Asked Questions

How many replicas do I need?

Minimum 3 nodes for a replica set or Sentinel cluster to survive any single failure. For critical systems, use 5 nodes (survive 2 simultaneous failures). Always use odd numbers to avoid split-brain.

When should I implement sharding?

Shard MongoDB when a single node cannot handle your data size or write throughput. Start unsharded; shard only when you hit limits. Redis sharding is simpler (use Redis Cluster from day 1 if you expect massive scale).

How do I upgrade MongoDB without downtime?

Rolling upgrade: one secondary at a time, then step down primary and upgrade it. Services connect via replica set, unaffected by individual node upgrades. Test upgrades in staging first.

What is acceptable replication lag?

Ideally < 100 ms. Lags over 1 second indicate network problems or primary overload. Very large lags (minutes) risk data loss on primary failure.

How often should I backup?

Backup as frequently as your RPO (recovery point objective) allows. For critical data, hourly backups. For less critical data, daily. Test recovery monthly.

Further Reading