home / skills / eyadsibai / ltk / distributed-tracing

distributed-tracing skill

safe

/plugins/ltk-devops/skills/distributed-tracing

This skill helps you implement distributed tracing with Jaeger and OpenTelemetry, enabling end-to-end visibility, context propagation, and performance

npx playbooks add skill eyadsibai/ltk --skill distributed-tracing

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.2 KB

---
name: distributed-tracing
description: Use when implementing distributed tracing, using Jaeger or Tempo, debugging microservices latency, or asking about "tracing", "Jaeger", "OpenTelemetry", "spans", "traces", "observability"
version: 1.0.0
---

# Distributed Tracing

Implement distributed tracing with Jaeger and OpenTelemetry for request flow visibility.

## Trace Structure

```
Trace (Request ID: abc123)
  ↓
Span (frontend) [100ms]
  ↓
Span (api-gateway) [80ms]
  ├→ Span (auth-service) [10ms]
  └→ Span (user-service) [60ms]
      └→ Span (database) [40ms]
```

## Key Components

| Component | Description |
|-----------|-------------|
| **Trace** | End-to-end request journey |
| **Span** | Single operation within a trace |
| **Context** | Metadata propagated between services |
| **Tags** | Key-value pairs for filtering |

## OpenTelemetry Setup (Python)

```python
from opentelemetry import trace
from opentelemetry.exporter.jaeger.thrift import JaegerExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor

# Initialize
provider = TracerProvider()
processor = BatchSpanProcessor(JaegerExporter(
    agent_host_name="jaeger",
    agent_port=6831,
))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

# Instrument Flask
app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

# Custom spans
@app.route('/api/users')
def get_users():
    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("get_users") as span:
        span.set_attribute("user.count", 100)
        return fetch_users()
```

## OpenTelemetry Setup (Node.js)

```javascript
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(new BatchSpanProcessor(
    new JaegerExporter({ endpoint: 'http://jaeger:14268/api/traces' })
));
provider.register();
```

## Context Propagation

```python
# Inject trace context into HTTP headers
from opentelemetry.propagate import inject

headers = {}
inject(headers)  # Adds traceparent header
response = requests.get('http://downstream/api', headers=headers)
```

## Sampling Strategies

```yaml
# Probabilistic - sample 1%
sampler:
  type: probabilistic
  param: 0.01

# Rate limiting - max 100/sec
sampler:
  type: ratelimiting
  param: 100
```

## Jaeger Queries

```
# Find slow requests
service=my-service duration > 1s

# Find errors
service=my-service error=true tags.http.status_code >= 500
```

## Correlated Logging

```python
def process_request():
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id
    logger.info("Processing", extra={"trace_id": format(trace_id, '032x')})
```

## Best Practices

1. **Sample appropriately** (1-10% in production)
2. **Add meaningful tags** (user_id, request_id)
3. **Propagate context** across all boundaries
4. **Log exceptions** in spans
5. **Use consistent naming** for operations
6. **Monitor tracing overhead** (<1% CPU impact)
7. **Correlate with logs** using trace IDs

Overview

This skill implements distributed tracing using OpenTelemetry with Jaeger or Tempo to visualize end-to-end request flows and debug microservice latency. It provides practical setup snippets for Python and Node.js, context propagation examples, sampling guidance, and query patterns for finding slow requests and errors. The focus is on actionable steps to instrument services, capture spans, and correlate traces with logs.

How this skill works

It instruments applications to generate traces composed of spans representing individual operations, then exports those spans to a tracing backend like Jaeger. Context propagation injects and extracts trace headers across HTTP or messaging boundaries so spans are linked across services. Sampling controls how many traces are recorded to balance visibility and cost. Queries and correlated logging use trace IDs and tags to find, filter, and diagnose issues.

When to use it

Implement end-to-end visibility for microservices architectures
Debug high-latency requests or cascading performance issues
Correlate logs and traces to root-cause failures
Validate downstream call sequences and timing
Monitor third-party or slow dependencies across service boundaries

Best practices

Sample conservatively in production (1–10%) and increase for critical flows
Add meaningful tags and attributes (user_id, request_id, http.status_code) for filtering
Propagate context across HTTP, gRPC, and message queues consistently
Name spans consistently and capture operation-level timing
Log exceptions and attach error details to spans
Measure tracing overhead and tune batch/export settings to keep impact low

Example use cases

Instrument a Flask service to auto-create spans and add a custom get_users span with attributes
Inject trace context into outgoing HTTP headers so downstream services map spans to the same trace
Configure probabilistic or rate-limiting samplers to control trace volume in production
Query Jaeger for slow requests (service=my-service duration > 1s) or errors (service=my-service error=true)
Correlate application logs with traces by logging the current trace_id alongside messages

FAQ

How do I propagate trace context across services?

Use OpenTelemetry propagators to inject trace headers into outgoing requests and extract them on the receiving side so spans link into a single trace.

What sampling strategy should I pick for production?

Start with probabilistic sampling around 1% and raise for critical endpoints; use rate-limiting if you need a fixed maximum throughput of traces.