OpenTelemetry Python: End-to-End Tracing

OpenTelemetry is the open standard for collecting telemetry data (traces, metrics, logs) from your applications. It is a unified instrumentation framework that lets you emit traces, metrics, and logs to any backend (Jaeger, Datadog, CloudWatch, New Relic) without code changes. A trace is a record of a request's journey through your system: it starts in service A, calls service B, which calls a database, which calls service C. OpenTelemetry breaks this journey into spans (units of work), captures their relationships, and exports the graph to a backend for visualization and analysis.

Distributed tracing answers questions that logs alone cannot: "Why was this request slow?" (see the call tree and each operation's duration), "Which service failed?" (see the span that returned an error), "Did this data flow correctly across services?" (follow the trace ID through the system). For a microservice architecture, distributed tracing is non-negotiable. For a monolith, it is still valuable for understanding performance bottlenecks.

What Is a Span and How Does It Differ from a Log?

A span is a named, timed unit of work. It has a start time, end time, a name, and optional attributes (key-value metadata). A trace is a collection of related spans that together record the path of a single request through the system. A log is a text message with a timestamp and level; multiple logs might be emitted during a span.

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("process_payment") as span:
    span.set_attribute("user_id", 42)
    span.set_attribute("amount", 99.99)
    
    # Nested span for database call
    with tracer.start_as_current_span("db_query") as db_span:
        db_span.set_attribute("query", "SELECT * FROM orders")
        result = query_database("SELECT * FROM orders")
        db_span.set_attribute("rows_returned", len(result))
    
    # Nested span for external API call
    with tracer.start_as_current_span("stripe_charge") as stripe_span:
        stripe_span.set_attribute("api_version", "2023-10-16")
        charge = call_stripe_api(99.99)
        stripe_span.set_attribute("charge_id", charge['id'])
    
    return charge

# Visual output (displayed by Jaeger):
# process_payment [0ms - 150ms]
#   ├─ db_query [5ms - 35ms]
#   └─ stripe_charge [40ms - 145ms]

The span captures structure (nesting), timing, and attributes. Logs are emitted during execution and appear separately. Spans are purpose-built for tracing dependencies and performance; logs are for diagnostic detail.

How Do You Set Up OpenTelemetry in Python?

OpenTelemetry requires an SDK (to generate spans), exporters (to send spans to a backend), and instrumentation (to integrate with libraries like Flask, requests, database drivers). Install the base packages:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-jaeger opentelemetry-instrumentation-flask opentelemetry-instrumentation-requests

Then initialize OpenTelemetry at application startup:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.resources import SERVICE_NAME, Resource

# Define the service
resource = Resource.create({SERVICE_NAME: "my-python-app"})

# Create and configure the tracer provider
jaeger_exporter = JaegerExporter(
    agent_host_name='localhost',
    agent_port=6831
)
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(BatchSpanProcessor(jaeger_exporter))
trace.set_tracer_provider(trace_provider)

# Get a tracer
tracer = trace.get_tracer(__name__)

# Now emit spans
with tracer.start_as_current_span("startup") as span:
    span.set_attribute("config", "loaded")
    print("Application started")

This configuration sends spans to a Jaeger collector running on localhost:6831. In production, change the endpoint to your backend's address.

How Do You Instrument a Flask Application?

OpenTelemetry provides auto-instrumentation for Flask: it automatically creates spans for HTTP requests without code changes.

from flask import Flask, request
from opentelemetry import trace
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

app = Flask(__name__)

# Auto-instrument Flask and requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# Get the tracer for custom spans
tracer = trace.get_tracer(__name__)

@app.route('/order', methods=['POST'])
def create_order():
    # Flask auto-creates a span named 'POST /order'
    # Add custom spans inside
    order_data = request.json
    
    with tracer.start_as_current_span("validate_order") as span:
        span.set_attribute("order_id", order_data['id'])
        if not is_valid(order_data):
            span.set_attribute("valid", False)
            return {'error': 'Invalid order'}, 400
        span.set_attribute("valid", True)
    
    with tracer.start_as_current_span("save_to_database") as span:
        order = save_order(order_data)
        span.set_attribute("saved_id", order['id'])
    
    with tracer.start_as_current_span("notify_warehouse") as span:
        response = requests.post(
            'http://warehouse-api/orders',
            json=order
        )
        span.set_attribute("response_status", response.status_code)
    
    return {'order_id': order['id'], 'status': 'created'}, 201

if __name__ == '__main__':
    app.run()

When you make a POST request to /order, OpenTelemetry:

Creates a root span for the HTTP request (auto)
Captures the child spans you created (validate_order, save_to_database, notify_warehouse)
Records timing and attributes
Sends the complete trace graph to Jaeger

You view the trace in Jaeger's UI (http://localhost:16686) and see a waterfall showing all operations and their durations.

How Do You Add Context Propagation for Distributed Traces?

When your service calls another service, you must propagate the trace ID so the downstream service's spans are linked to the same trace. OpenTelemetry handles this automatically for instrumented libraries like requests, but you must pass the trace context in HTTP headers.

import requests
from opentelemetry import trace
from opentelemetry.propagate import inject

tracer = trace.get_tracer(__name__)

def call_downstream_service(user_id):
    with tracer.start_as_current_span("call_user_service") as span:
        span.set_attribute("user_id", user_id)
        
        # Create headers with trace context
        headers = {}
        inject(headers)  # Adds traceparent, tracestate headers
        
        # The downstream service will receive these headers and continue the trace
        response = requests.get(
            f'http://user-service/users/{user_id}',
            headers=headers
        )
        span.set_attribute("response_status", response.status_code)
        return response.json()

When the user-service receives the request, it extracts the trace context from the headers and continues the same trace. The result is a unified trace tree showing both services' work.

Trace ID: abc123xyz
  ├─ service-a: POST /order [0ms - 200ms]
  │  ├─ validate_order [5ms - 15ms]
  │  └─ call_user_service [20ms - 195ms]  (HTTP call to service-b)
  └─ service-b: GET /users/42 [20ms - 195ms]
     ├─ db_query [25ms - 80ms]
     └─ cache_set [85ms - 190ms]

RequestsInstrumentor auto-injects headers, so in most cases you need not manually call inject().

How Do You Record Exceptions in Spans?

When an exception occurs in a span, record it so the backend knows the span failed:

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer(__name__)

def risky_operation():
    with tracer.start_as_current_span("risky_operation") as span:
        try:
            result = do_something_risky()
            span.set_attribute("outcome", "success")
            return result
        except ValueError as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            raise
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, "Internal error"))
            raise

The backend displays the exception in the trace and marks the span as failed, making it easy to see which operation caused a trace to fail.

Key Takeaways

OpenTelemetry is the standard for distributed tracing, metrics, and logs.
A span is a named unit of work; a trace is a collection of related spans.
Auto-instrumentation (Flask, requests) creates spans without code changes.
Add custom spans with tracer.start_as_current_span().
Propagate trace context across service boundaries via HTTP headers.

Frequently Asked Questions

What is the difference between OpenTelemetry and Jaeger?

OpenTelemetry is an instrumentation API and SDK. Jaeger is a backend that stores and displays traces. You use OpenTelemetry to emit spans and Jaeger to visualize them.

Can I use OpenTelemetry without a backend?

Yes, for testing. Use the NoOpSpanProcessor to emit no-op spans. In production, configure an exporter to send spans to a backend.

How do I sample traces when volume is high?

Use a sampler to drop a percentage of traces: ProbabilitySampler(rate=0.1) samples 10% of traces. This reduces overhead while maintaining visibility of rare slow traces.

Should I record every function call as a span?

No. Spans are for logical units of work (HTTP request, database query, cache lookup). Record functions that cross latency boundaries (I/O, network) or represent business logic.

Can I use OpenTelemetry with logging and metrics?

Yes. OpenTelemetry provides APIs for traces, metrics, and logs. Use all three together for complete observability.

What Is a Span and How Does It Differ from a Log?​

How Do You Set Up OpenTelemetry in Python?​

How Do You Instrument a Flask Application?​

How Do You Add Context Propagation for Distributed Traces?​

How Do You Record Exceptions in Spans?​

Key Takeaways​

Frequently Asked Questions​

What is the difference between OpenTelemetry and Jaeger?​

Can I use OpenTelemetry without a backend?​

How do I sample traces when volume is high?​

Should I record every function call as a span?​

Can I use OpenTelemetry with logging and metrics?​

Further Reading​