Skip to main content

Kubernetes Health Probes: Keep Python Apps Healthy

Health probes are Kubernetes mechanisms that continuously monitor whether your Python application is alive and ready to serve traffic. A liveness probe checks if a pod's main process is healthy; if it fails, Kubernetes restarts the pod. A readiness probe checks if the pod is ready to accept traffic; if it fails, Kubernetes temporarily removes the pod from service endpoints. Together, they ensure Kubernetes automatically recovers from failures and prevents traffic from reaching unhealthy pods.

Understanding Liveness Probes and When Pods Restart

A liveness probe detects deadlock, infinite loops, or hung processes in your Python application. If the probe fails repeatedly, Kubernetes assumes the pod is unrecoverable and restarts it. This is crucial for long-running processes that can enter broken states even if the main process is technically still running.

I debugged a Python data pipeline that hung in a deadlock every few days due to a threading bug. Before liveness probes, the pod would hang silently, and team members would notice jobs were failing only the next morning. After adding a liveness probe that checked a timestamp endpoint, Kubernetes automatically restarted hung pods within seconds, and the impact became negligible.

Implementing HTTP-based Liveness Probes

The simplest liveness probe makes an HTTP request to a health endpoint in your Python application:

apiVersion: v1
kind: Pod
metadata:
name: python-health-check-pod
spec:
containers:
- name: app
image: my-registry/python-app:1.0.0
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

This probe makes an HTTP GET request to /health every 10 seconds, starting 30 seconds after the pod starts. If the probe fails 3 times consecutively, Kubernetes restarts the pod. Your Python Flask or FastAPI application should implement a lightweight /health endpoint:

from flask import Flask, jsonify
import time

app = Flask(__name__)
app_start_time = time.time()

@app.route("/health", methods=["GET"])
def health():
"""Liveness probe endpoint: check if app is responsive."""
elapsed = time.time() - app_start_time
return jsonify({
"status": "healthy",
"uptime_seconds": elapsed
}), 200

@app.route("/api/data", methods=["GET"])
def get_data():
return jsonify({"data": "example"}), 200

if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)

The /health endpoint is intentionally minimal—it does not hit the database or perform expensive operations. It just verifies the Python process is responsive.

Understanding Readiness Probes and Service Traffic

A readiness probe checks if your Python application is ready to accept traffic. Unlike liveness, failing a readiness probe does not restart the pod; instead, Kubernetes removes the pod from the Service's endpoints so no new traffic is routed to it. This is useful during startup (when your app is initializing), during graceful shutdown, or if your app detects it cannot handle requests temporarily.

Here's a readiness probe that checks both the application and a dependency (like a database):

apiVersion: apps/v1
kind: Deployment
metadata:
name: python-app-deployment
spec:
replicas: 3
selector:
matchLabels:
app: python-app
template:
metadata:
labels:
app: python-app
spec:
containers:
- name: app
image: my-registry/python-app:1.0.0
ports:
- containerPort: 8000
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 1

Your Python application implements a separate /ready endpoint that checks readiness:

from flask import Flask, jsonify
import psycopg2

app = Flask(__name__)

def check_database_connection():
"""Check if database is reachable."""
try:
conn = psycopg2.connect(
host="db-svc",
port=5432,
database="appdb",
user="appuser",
password="secret"
)
conn.close()
return True
except Exception as e:
print(f"Database connection failed: {e}")
return False

@app.route("/health", methods=["GET"])
def health():
"""Liveness: is the app running?"""
return jsonify({"status": "healthy"}), 200

@app.route("/ready", methods=["GET"])
def ready():
"""Readiness: is the app ready to serve traffic?"""
if check_database_connection():
return jsonify({"status": "ready"}), 200
else:
return jsonify({"status": "not_ready", "reason": "database_unavailable"}), 503

if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)

If the database is down, /ready returns 503, and Kubernetes temporarily removes the pod from service. Once the database recovers, the next readiness check succeeds, and traffic flows to the pod again.

TCP Socket and Exec Probes for Non-HTTP Applications

Not all Python applications are HTTP servers. For applications using custom protocols or databases, use TCP socket probes or exec probes.

TCP Socket Probe

livenessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 30
periodSeconds: 10

This probe attempts a TCP connection to port 5432. If the connection succeeds, the pod is alive.

Exec Probe

For applications where you can check health via a command, use an exec probe:

livenessProbe:
exec:
command:
- /bin/sh
- -c
- python -c "import redis; r = redis.Redis(host='redis-svc', port=6379); r.ping()"
initialDelaySeconds: 30
periodSeconds: 10

This probe runs a command inside the container. If the command exits with code 0, the probe passes; otherwise, it fails.

Tuning Probe Parameters for Your Python Application

Probe parameters control timing and sensitivity:

ParameterDefaultPurpose
initialDelaySeconds0Delay before first probe (allows app startup)
periodSeconds10Interval between probes
timeoutSeconds1Timeout for each probe attempt
failureThreshold3Consecutive failures before action
successThreshold1Consecutive successes to mark healthy

For a Python Flask app that takes 20 seconds to start, set initialDelaySeconds: 30 to allow startup before the first probe. For a busy app, increase periodSeconds to reduce overhead. For critical apps, increase failureThreshold to 5 to tolerate transient failures.

Graceful Shutdown: How Readiness Probes Support Deployments

During a rolling update, Kubernetes gradually replaces old pods with new ones. To minimize dropped requests, implement graceful shutdown: when Kubernetes terminates a pod, signal your app to stop accepting new requests but finish existing ones.

import signal
import time
from flask import Flask

app = Flask(__name__)
is_shutting_down = False

def signal_handler(sig, frame):
global is_shutting_down
print("Shutdown signal received")
is_shutting_down = True
time.sleep(5) # Wait for in-flight requests to finish
exit(0)

signal.signal(signal.SIGTERM, signal_handler)

@app.route("/api/data", methods=["GET"])
def get_data():
return jsonify({"data": "example"}), 200

@app.route("/ready", methods=["GET"])
def ready():
"""Return 503 if shutting down; otherwise 200."""
if is_shutting_down:
return jsonify({"status": "not_ready"}), 503
return jsonify({"status": "ready"}), 200

if __name__ == "__main__":
app.run(host="0.0.0.0", port=8000)

When Kubernetes sends SIGTERM (shutdown signal), your app sets is_shutting_down = True. Subsequent readiness probes fail, and Kubernetes removes the pod from the Service. In-flight requests finish within the grace period (default 30 seconds), and the pod terminates cleanly.

Key Takeaways

  • Liveness probes detect crashed or hung processes and trigger restarts.
  • Readiness probes check if a pod is ready to accept traffic and guide Service routing.
  • HTTP-based probes are simplest for web applications; implement lightweight /health and /ready endpoints.
  • TCP socket and exec probes work for non-HTTP applications.
  • Tune initialDelaySeconds, periodSeconds, and failureThreshold based on your application's startup time and failure tolerance.

Frequently Asked Questions

What happens if a liveness probe is too aggressive?

If failureThreshold is too low or periodSeconds is too short, transient network hiccups cause unnecessary restarts. This leads to cascading failures and reduced availability. Start with conservative defaults (periodSeconds: 10, failureThreshold: 3) and adjust based on observed behavior.

Should I use both liveness and readiness probes?

Yes, they serve different purposes. Liveness probes restart dead pods; readiness probes prevent routing traffic to unready pods. A pod can be alive but not ready (e.g., initializing dependencies). Always use readiness; use liveness only if your app can hang (threads, async code).

Can my /health endpoint access the database?

For liveness probes, no—keep it lightweight. Database queries add latency and increase failure points. For readiness probes, yes—checking dependency health is appropriate. Separate them into /health (lightweight) and /ready (comprehensive).

What do I do if my Python app takes 5 minutes to start?

Set initialDelaySeconds to longer than your startup time (e.g., 360 seconds for 5 minutes). Use a startup probe (available in Kubernetes 1.18+) for this scenario:

startupProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 30
periodSeconds: 10

How do I test health probes locally?

Use Minikube and kubectl logs to observe probe behavior:

kubectl get pod <pod-name> -o yaml  # See probe configuration
kubectl describe pod <pod-name> # See probe results and events
kubectl logs <pod-name> # See application logs

Further Reading