Observability Best Practices: Logs, Metrics, Traces

Observability is not a single tool or technique; it is a system design philosophy that prioritizes making your application's internal state visible. The three pillars—logs (events), metrics (numbers), and traces (causality)—are not alternatives; they are complementary. Logs answer "what happened?", metrics answer "how often and how fast?", and traces answer "which operation caused this?" Integrating all three requires discipline: consistency in naming conventions, correlation IDs that link logs and traces, and sampling strategies that balance coverage with cost.

This article distills the practices that separate observable systems from opaque ones: the anti-patterns to avoid, the naming conventions that scale, the correlation strategies that work, and the architectural decisions that pay dividends in production. These practices apply to Python applications of all sizes, from single-file scripts to microservice platforms.

What Is the Relationship Between Logs, Metrics, and Traces?

The three pillars have distinct purposes and should not replace one another.

Logs record discrete events: "user logged in", "payment failed", "cache miss". They are high-detail (full message text, local variables) and low-volume (important events only). Use logs for debugging specific issues and auditing sensitive operations.

Metrics measure aggregate behavior: "95% of requests complete within 200ms", "error rate is 0.5%", "database connections in use: 42". They are low-detail (a single number) and high-volume (sampled from millions of events). Use metrics to understand system health and trigger alerts.

Traces record the causality of a request: which services handled it, in what order, how long each took. They are medium-detail (operations and their relationships) and medium-volume (sampled by request). Use traces to understand why a specific request was slow or failed.

Example: Payment processing fails after 5 seconds
Logs tell you: "Stripe API returned 503 Service Unavailable" (event at 14:23:45)
Metrics tell you: Error rate spiked to 5%; p99 latency went from 200ms to 5000ms
Traces tell you: Request entered API gateway -> order service -> payment service -> Stripe API (which timed out)

The best observability systems use all three to answer "What happened?", "Is it a problem?", and "Where did it fail?" respectively.

How Do You Correlate Logs and Traces With a Correlation ID?

A correlation ID (or trace ID) is a unique identifier that flows through an entire request, linking all logs and spans to a single transaction. When you grep for a correlation ID, you see every log and trace related to that request, enabling rapid incident diagnosis.

import logging
import uuid
from flask import Flask, request, g
from opentelemetry import trace
from opentelemetry.propagate import extract

app = Flask(__name__)
logger = logging.getLogger(__name__)

@app.before_request
def setup_correlation():
    """Before each request, set up correlation ID."""
    # Extract from request header if present (from upstream service)
    correlation_id = request.headers.get('X-Correlation-ID')
    if not correlation_id:
        correlation_id = str(uuid.uuid4())
    
    # Store in Flask's g object (request-scoped)
    g.correlation_id = correlation_id
    
    # Extract OpenTelemetry context (for distributed tracing)
    ctx = extract(request.headers)
    g.trace_context = ctx

@app.after_request
def add_correlation_to_response(response):
    """Add correlation ID to response so client can report issues."""
    response.headers['X-Correlation-ID'] = g.correlation_id
    return response

def get_logger():
    """Return a logger that injects correlation ID into every log."""
    class CorrelationIDFilter(logging.Filter):
        def filter(self, record):
            record.correlation_id = getattr(g, 'correlation_id', 'unknown')
            return True
    
    logger = logging.getLogger(__name__)
    logger.addFilter(CorrelationIDFilter())
    return logger

@app.route('/orders', methods=['POST'])
def create_order():
    log = get_logger()
    log.info("Order creation started")
    
    order_data = request.json
    
    try:
        order = save_order(order_data)
        log.info("Order saved", extra={'order_id': order['id']})
        
        charge = charge_card(order['id'], order['amount'])
        log.info("Payment succeeded", extra={'charge_id': charge['id']})
    except Exception as e:
        log.error("Order creation failed", extra={'error': str(e)})
        raise
    
    return order

# Usage: if a customer reports an issue, grep the logs
# grep 'correlation_id=abc123def456' app.log
# Output: all logs for that request

When customers report an issue, ask them for the correlation ID from the error message or API response. Then grep your logs for that ID and see the complete request lifecycle.

How Do You Avoid High-Cardinality Dimensions in Metrics?

A cardinality dimension is a label or tag that can take many distinct values. High cardinality (millions of distinct values) makes metrics expensive to store and query. Common mistakes:

Using user ID as a label (millions of distinct values)
Using request ID as a label (unique per request)
Using IP address as a label (millions of distinct values)
Using full error messages as a label (infinite cardinality)

Instead, use bounded categories:

# BAD: High cardinality
request_counter.labels(
    user_id=request.user_id,  # Millions of values!
    request_id=request.id,    # Millions of values!
    error_message=str(e)      # Infinite cardinality!
).inc()

# GOOD: Low cardinality
request_counter.labels(
    method=request.method,           # ~10 values (GET, POST, etc.)
    endpoint=request.endpoint,       # ~50 values (routes)
    status=response.status_code,     # ~20 values (2xx, 3xx, 4xx, 5xx)
    error_type=type(e).__name__      # ~10 values (ValueError, TimeoutError, etc.)
).inc()

For dimensions with unbounded cardinality (user ID, request ID, error message), emit them as log fields or span attributes, not metric labels.

What Is Sampling and When Should You Use It?

Sampling is the practice of emitting only a fraction of events. For high-volume systems, sampling is essential: storing every trace and every error would be prohibitively expensive. Sampling strategies:

Fixed rate sampling: Sample 1% of all requests. Simple but misses rare slow requests.
Error sampling: Always sample errors; sample N% of successful requests. Ensures errors are never missed.
Tail sampling: Sample based on request outcome (duration, error status). Always capture slow requests.
Head sampling: Sample based on request context (user ID, endpoint). Captures consistent user journeys.

import random
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.sampler import ParentBasedTraceIdRatioBased

# Sample 10% of requests (head sampling, random)
sampler = ParentBasedTraceIdRatioBased(rate=0.1)
tracer_provider = TracerProvider(sampler=sampler)

# Error-based sampling: always sample errors, 1% of success
def error_based_sampler(request_context):
    if request_context.get('status_code') >= 400:
        return True  # Always sample errors
    return random.random() < 0.01  # Sample 1% of success

# Application code: sample logs too
import logging
def should_log_debug(module_name):
    """Log DEBUG messages from critical modules, INFO-only from others."""
    critical_modules = ['myapp.payment', 'myapp.auth']
    if module_name in critical_modules:
        return logging.DEBUG
    return logging.INFO

For a system processing 10,000 requests/second, sampling 10% yields 1,000 traces stored per second—a manageable volume. Without sampling, you would store 10,000 traces per second, which is expensive and unnecessary.

How Do You Design for Observability From the Start?

Observable systems are designed with observability in mind. Anti-patterns to avoid:

Anti-Pattern	Problem	Solution
Logging sensitive data	PII leaks; security risk	Redact passwords, tokens, API keys before logging
Logs with zero context	"Error occurred" tells you nothing	Include relevant fields: user_id, resource_id, operation type
No request correlation	Cannot tie logs across services	Propagate correlation ID in HTTP headers
Metric labels with user IDs	High cardinality; expensive storage	Use hashed user segments or numeric IDs
Ignoring exceptions	Silent failures; hard to debug	Always log/trace exceptions with full stack traces
No version tracking	Cannot correlate errors to releases	Tag logs, traces, metrics with release/version
One monolithic logger	Cannot enable debugging per module	Use hierarchical logger names (logging.getLogger(name))

# Example: Observable payment function
from opentelemetry import trace
import logging

logger = logging.getLogger(__name__)
tracer = trace.get_tracer(__name__)

def process_payment(user_id, amount, currency='USD'):
    """Process a payment with full observability."""
    # Span for the entire operation
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("user_id", user_id)
        span.set_attribute("amount", amount)
        span.set_attribute("currency", currency)
        
        logger.info(
            "Payment processing started",
            extra={'user_id': user_id, 'amount': amount}
        )
        
        try:
            # Validate inputs
            with tracer.start_as_current_span("validate_payment") as val_span:
                if amount <= 0:
                    logger.warning(
                        "Invalid payment amount",
                        extra={'user_id': user_id, 'amount': amount}
                    )
                    val_span.set_attribute("valid", False)
                    raise ValueError(f"Amount must be positive, got {amount}")
                val_span.set_attribute("valid", True)
            
            # Charge card
            with tracer.start_as_current_span("stripe_charge") as charge_span:
                charge = stripe.Charge.create(amount=amount, currency=currency)
                charge_span.set_attribute("charge_id", charge['id'])
            
            logger.info(
                "Payment succeeded",
                extra={'user_id': user_id, 'charge_id': charge['id']}
            )
            return charge
        
        except stripe.CardError as e:
            logger.warning(
                "Card declined",
                extra={'user_id': user_id, 'error_type': 'CardError', 'error_message': str(e)}
            )
            span.record_exception(e)
            raise
        
        except Exception as e:
            logger.error(
                "Unexpected error in payment processing",
                extra={'user_id': user_id, 'error_type': type(e).__name__}
            )
            span.record_exception(e)
            raise

Key Takeaways

Logs, metrics, and traces answer different questions; use all three together.
Correlation IDs link logs and traces across service boundaries.
Avoid high-cardinality metric dimensions (user IDs, request IDs, full error messages).
Sampling is essential for cost management in high-volume systems.
Design applications for observability from the start.

Frequently Asked Questions

How much should I log in production?

Log operational events (startup, shutdown, user actions) as INFO. Log failures and recoveries (retries, timeouts) as WARNING. Log errors as ERROR. Debug-level logs should be disabled in production (enable on-demand per module).

Should I store logs in a database or files?

For production, use a log aggregation service (Elasticsearch, Datadog, CloudWatch) that indexes logs for search and analysis. Local files are sufficient for development and testing.

How do I sample logs without missing important errors?

Use error-based sampling: always log errors, sample success paths. Or set different levels per logger: DEBUG for critical modules, INFO for others.

Can I correlate metrics to a specific user or request?

No. Metrics are aggregate numbers without per-user identifiers (by design, to protect privacy). If you need per-user debugging, use logs or traces, not metrics.

What is the recommended retention for logs, metrics, and traces?

Logs: 30 days (cheaper to search); errors: 90 days (for trend analysis). Metrics: 1-2 years (low storage cost). Traces: 7-30 days (high storage cost).

What Is the Relationship Between Logs, Metrics, and Traces?​

How Do You Correlate Logs and Traces With a Correlation ID?​

How Do You Avoid High-Cardinality Dimensions in Metrics?​

What Is Sampling and When Should You Use It?​

How Do You Design for Observability From the Start?​

Key Takeaways​

Frequently Asked Questions​

How much should I log in production?​

Should I store logs in a database or files?​

How do I sample logs without missing important errors?​

Can I correlate metrics to a specific user or request?​

What is the recommended retention for logs, metrics, and traces?​

Further Reading​