Skip to main content

Debugging & Monitoring Python Apps in Kubernetes

Debugging Python applications in Kubernetes is fundamentally different from debugging locally. Pods are ephemeral, logs are distributed across the cluster, and you cannot attach a debugger directly to a remote process without special tooling. This guide covers essential debugging techniques: viewing logs, executing commands in containers, port-forwarding, monitoring metrics, and analyzing performance issues in production.

Accessing Pod Logs: Where Your Python Errors Appear

Kubernetes stores pod output (stdout and stderr from your Python application) in pod logs. Access them with kubectl logs:

# View logs from a single pod
kubectl logs <pod-name>

# View logs from all pods in a Deployment
kubectl logs -l app=python-app

# Stream logs in real-time (like tail -f)
kubectl logs -f <pod-name>

# View logs from a previous instance (if pod was restarted)
kubectl logs <pod-name> --previous

# View last 100 lines
kubectl logs <pod-name> --tail=100

Ensure your Python application logs to stdout. Flask and most frameworks do this by default, but check your logging configuration:

import logging
import sys

# Configure logging to stdout
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(name)s - %(levelname)s - %(message)s",
stream=sys.stdout # Ensure logs go to stdout
)

logger = logging.getLogger(__name__)

logger.info("Application started")
logger.error("An error occurred")

If your app logs to a file instead of stdout, the logs are lost when the pod terminates. Always log to stdout for Kubernetes-native logging.

Executing Commands in Pods: Interactive Debugging

Use kubectl exec to run commands inside a running pod, similar to docker exec:

# Run a Python command in the pod
kubectl exec <pod-name> -- python -c "import sys; print(sys.version)"

# Start an interactive shell in the pod
kubectl exec -it <pod-name> -- /bin/bash

# Run a Python script
kubectl exec <pod-name> -- python /app/debug_script.py

# Check environment variables
kubectl exec <pod-name> -- printenv

I debugged a Python web application stuck in a loop by exec-ing into the pod and inspecting the process state:

kubectl exec -it python-app-abc123 -- /bin/bash

# Inside the container
ps aux # See running processes
netstat -tlnp # Check listening ports
python -c "import my_app; print(my_app.__file__)" # Verify correct module is loaded

This is invaluable for understanding why a pod is misbehaving without redeploying or changing code.

Port Forwarding: Access Services Locally

Port forwarding tunnels traffic from your local machine to a pod or Service in the cluster. This allows you to interact with your Python application as if it were running locally:

# Forward local port 8000 to pod port 8000
kubectl port-forward <pod-name> 8000:8000

# Forward to a Service (recommended)
kubectl port-forward svc/python-app-service 8000:8000

# Forward to multiple ports
kubectl port-forward <pod-name> 8000:8000 5432:5432

# Use background mode
kubectl port-forward svc/python-app-service 8000:8000 &

Once port-forwarded, access the pod from your local machine:

curl http://localhost:8000/api/data
python -c "import requests; print(requests.get('http://localhost:8000/api/data').json())"

This is essential for testing or debugging services that are not exposed externally (Services with ClusterIP type).

Understanding Pod Lifecycle Events and Debugging

Kubernetes records events for every pod state change. View events to understand why a pod failed:

# View events for a specific pod
kubectl describe pod <pod-name>

# View all events in the cluster
kubectl get events -n default --sort-by='.lastTimestamp'

# Watch events in real-time
kubectl get events -n default --watch

The describe output shows:

  • Pod status (Pending, Running, CrashLoopBackOff, etc.)
  • Recent events with timestamps and messages
  • Resource requests and limits
  • Mounted volumes

Common events and what they mean:

EventMeaning
PendingPod scheduled but waiting for node resources or image pull
ImagePullBackOffContainer image not found in registry or pull failed
CrashLoopBackOffContainer keeps crashing and restarting
OOMKilledPod exceeded memory limit
EvictedNode ran out of resources; pod was forcefully terminated

Monitoring Resource Usage: CPU and Memory

Use kubectl top to see real-time CPU and memory usage:

# CPU and memory for all pods
kubectl top pods -n default

# For a specific pod
kubectl top pod <pod-name>

# Watch in real-time
kubectl top pods -n default --watch

# Include node resource usage
kubectl top nodes

Output shows actual usage (e.g., 50m CPU, 128Mi memory) compared to requests and limits. If actual usage is near the limit, the pod risks being throttled (CPU) or killed (memory).

Setting Up Prometheus for Metrics Collection

For comprehensive monitoring, integrate Prometheus to collect application metrics. Add Prometheus Python client to your Flask app:

from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = Flask(__name__)

# Define metrics
request_count = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)

request_duration = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint']
)

@app.before_request
def before_request():
request.start_time = time.time()

@app.after_request
def after_request(response):
duration = time.time() - request.start_time

request_count.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()

request_duration.labels(
method=request.method,
endpoint=request.endpoint
).observe(duration)

return response

@app.route("/metrics", methods=["GET"])
def metrics():
"""Prometheus metrics endpoint."""
return generate_latest(), 200, {"Content-Type": CONTENT_TYPE_LATEST}

@app.route("/api/data", methods=["GET"])
def get_data():
return {"data": "example"}, 200

Configure Prometheus to scrape your application:

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'python-app'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: python-app
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "8001"

Once Prometheus is scraping, you can query metrics using its web UI or Grafana for visualization. Common Python application metrics to monitor:

  • Request rate (requests per second)
  • Error rate (% of requests that fail)
  • Latency (p50, p95, p99 response time)
  • Memory usage (heap, non-heap)
  • Cache hit rate (if applicable)

Tracing: Distributed Request Tracking

For debugging complex interactions between microservices, use distributed tracing with Jaeger or Zipkin. Install the OpenTelemetry SDK:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-jaeger

Instrument your Flask app:

from flask import Flask
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

app = Flask(__name__)

# Configure Jaeger
jaeger_exporter = JaegerExporter(
agent_host_name="jaeger-collector",
agent_port=6831,
)

trace.set_tracer_provider(TracerProvider(
resource=Resource.create({SERVICE_NAME: "python-api"})
))
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(jaeger_exporter)
)

tracer = trace.get_tracer(__name__)

@app.route("/api/data", methods=["GET"])
def get_data():
with tracer.start_as_current_span("fetch_data"):
# Your code here
return {"data": "example"}, 200

Jaeger traces requests end-to-end across multiple services, showing latency at each step and identifying bottlenecks.

Common Debugging Scenarios and Solutions

Scenario 1: Pod is stuck in Pending state

# Check scheduler events
kubectl describe pod <pod-name>

# Likely causes: insufficient resources, node selector mismatch, image pull issue
# Solutions: increase node capacity, verify resource requests, check image availability

Scenario 2: Application errors but pod shows Running

# Check logs
kubectl logs <pod-name>

# Errors in logs indicate application-level issues, not Kubernetes issues
# Verify the service is responding
kubectl port-forward <pod-name> 8000:8000
curl http://localhost:8000/api/data

Scenario 3: High memory usage and OOM kills

# Check memory requests and limits
kubectl describe pod <pod-name> | grep -A 5 "Limits\|Requests"

# Monitor actual usage
kubectl top pod <pod-name>

# If approaching limit, increase it or fix memory leak in the app

Key Takeaways

  • Always log to stdout so Kubernetes captures logs. Access them with kubectl logs.
  • Use kubectl exec to interactively debug running pods and inspect state.
  • Use kubectl port-forward to access Services locally for testing.
  • Use kubectl describe to view events and understand pod state changes.
  • Use kubectl top to monitor real-time CPU and memory usage.
  • Integrate Prometheus for application-level metrics and Jaeger for distributed tracing.

Frequently Asked Questions

How do I debug a Python application that crashes immediately?

Check logs: kubectl logs <pod-name>. If logs are empty or truncated, check the previous instance: kubectl logs <pod-name> --previous. If the pod is in CrashLoopBackOff, inspect events: kubectl describe pod <pod-name>. Common causes: missing dependencies, incorrect environment variables, or database connectivity issues.

Can I attach a Python debugger (pdb) to a Kubernetes pod?

Not directly in production pods. For development, you can use kubectl port-forward to expose a remote debugging port and attach your IDE. Example: use debugpy in your Python app, port-forward the debug port, and attach VS Code. However, this is not recommended for production.

How do I check if a pod has access to a Kubernetes Secret?

Exec into the pod and check: kubectl exec <pod-name> -- env | grep SECRET. Secrets mounted as files can be inspected: kubectl exec <pod-name> -- cat /etc/secrets/secret-file.

What is the best way to persist logs from crashed pods?

By default, logs from terminated pods are lost (unless the container image is retained on the node). Use a centralized logging system (ELK Stack, Splunk, Google Cloud Logging) to collect and persist logs. Configure your Python app to send logs to stdout; Kubernetes or a sidecar agent ships them to the centralized system.

How do I identify which pod replica is causing high latency?

If you have tracing (Jaeger), traces show per-pod latency. If using Prometheus, add a pod label to metrics: http_request_duration_seconds{pod=<pod-name>}. Otherwise, port-forward to individual pods and test them directly.

Further Reading