home / skills / aj-geddes / useful-ai-prompts / distributed-tracing

distributed-tracing skill

/skills/distributed-tracing

This skill helps you implement distributed tracing across microservices using Jaeger and Zipkin to debug, analyze performance, and track request flows.

npx playbooks add skill aj-geddes/useful-ai-prompts --skill distributed-tracing

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.3 KB
---
name: distributed-tracing
description: Implement distributed tracing with Jaeger and Zipkin for tracking requests across microservices. Use when debugging distributed systems, tracking request flows, or analyzing service performance.
---

# Distributed Tracing

## Overview

Set up distributed tracing infrastructure with Jaeger or Zipkin to track requests across microservices and identify performance bottlenecks.

## When to Use

- Debugging microservice interactions
- Identifying performance bottlenecks
- Tracking request flows
- Analyzing service dependencies
- Root cause analysis

## Instructions

### 1. **Jaeger Setup**

```yaml
# docker-compose.yml
version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "5775:5775/udp"
      - "6831:6831/udp"
      - "16686:16686"
      - "14268:14268"
    networks:
      - tracing

networks:
  tracing:
```

### 2. **Node.js Jaeger Instrumentation**

```javascript
// tracing.js
const initTracer = require('jaeger-client').initTracer;
const opentracing = require('opentracing');

const initJaegerTracer = (serviceName) => {
  const config = {
    serviceName: serviceName,
    sampler: {
      type: 'const',
      param: 1
    },
    reporter: {
      logSpans: true,
      agentHost: process.env.JAEGER_AGENT_HOST || 'localhost',
      agentPort: process.env.JAEGER_AGENT_PORT || 6831
    }
  };

  return initTracer(config, {});
};

const tracer = initJaegerTracer('api-service');
module.exports = { tracer };
```

### 3. **Express Tracing Middleware**

```javascript
// middleware.js
const { tracer } = require('./tracing');
const opentracing = require('opentracing');

const tracingMiddleware = (req, res, next) => {
  const wireCtx = tracer.extract(
    opentracing.FORMAT_HTTP_HEADERS,
    req.headers
  );

  const span = tracer.startSpan(req.path, {
    childOf: wireCtx,
    tags: {
      [opentracing.Tags.SPAN_KIND]: opentracing.Tags.SPAN_KIND_RPC_SERVER,
      [opentracing.Tags.HTTP_METHOD]: req.method,
      [opentracing.Tags.HTTP_URL]: req.url
    }
  });

  req.span = span;

  res.on('finish', () => {
    span.setTag(opentracing.Tags.HTTP_STATUS_CODE, res.statusCode);
    span.finish();
  });

  next();
};

module.exports = tracingMiddleware;
```

### 4. **Python Jaeger Integration**

```python
# tracing.py
from jaeger_client import Config
from opentracing.propagation import Format

def init_jaeger_tracer(service_name):
    config = Config(
        config={
            'sampler': {'type': 'const', 'param': 1},
            'local_agent': {
                'reporting_host': 'localhost',
                'reporting_port': 6831,
            },
            'logging': True,
        },
        service_name=service_name,
    )
    return config.initialize_tracer()

# Flask integration
from flask import Flask, request

app = Flask(__name__)
tracer = init_jaeger_tracer('api-service')

@app.before_request
def before_request():
    ctx = tracer.extract(Format.HTTP_HEADERS, request.headers)
    request.span = tracer.start_span(
        request.path,
        child_of=ctx,
        tags={
            'http.method': request.method,
            'http.url': request.url,
        }
    )

@app.after_request
def after_request(response):
    request.span.set_tag('http.status_code', response.status_code)
    request.span.finish()
    return response

@app.route('/api/users/<user_id>')
def get_user(user_id):
    with tracer.start_span('fetch-user', child_of=request.span) as span:
        span.set_tag('user.id', user_id)
        # Fetch user from database
        return {'user': {'id': user_id}}
```

### 5. **Distributed Context Propagation**

```javascript
// propagation.js
const axios = require('axios');
const { tracer } = require('./tracing');
const opentracing = require('opentracing');

async function callDownstreamService(span, url, data) {
  const headers = {};

  // Inject trace context
  tracer.inject(span, opentracing.FORMAT_HTTP_HEADERS, headers);

  try {
    const response = await axios.post(url, data, { headers });
    span.setTag('downstream.success', true);
    return response.data;
  } catch (error) {
    span.setTag(opentracing.Tags.ERROR, true);
    span.log({
      event: 'error',
      message: error.message
    });
    throw error;
  }
}

module.exports = { callDownstreamService };
```

### 6. **Zipkin Integration**

```javascript
// zipkin-setup.js
const CLSContext = require('zipkin-context-cls');
const { Tracer, BatchRecorder, HttpLogger } = require('zipkin');
const zipkinMiddleware = require('zipkin-instrumentation-express').expressMiddleware;

const recorder = new BatchRecorder({
  logger: new HttpLogger({
    endpoint: 'http://localhost:9411/api/v2/spans',
    headers: { 'Content-Type': 'application/json' }
  })
});

const ctxImpl = new CLSContext('zipkin');
const tracer = new Tracer({ recorder, ctxImpl });

module.exports = {
  tracer,
  zipkinMiddleware: zipkinMiddleware({
    tracer,
    serviceName: 'api-service'
  })
};
```

### 7. **Trace Analysis**

```python
# query-traces.py
import requests

def query_traces(service_name, operation=None, limit=20):
    params = {
        'service': service_name,
        'limit': limit
    }
    if operation:
        params['operation'] = operation

    response = requests.get('http://localhost:16686/api/traces', params=params)
    return response.json()['data']

def find_slow_traces(service_name, min_duration_ms=1000):
    traces = query_traces(service_name, limit=100)
    slow_traces = [
        t for t in traces
        if t['duration'] > min_duration_ms * 1000
    ]
    return sorted(slow_traces, key=lambda t: t['duration'], reverse=True)
```

## Best Practices

### ✅ DO
- Sample appropriately for your traffic volume
- Propagate trace context across services
- Add meaningful span tags
- Log errors with spans
- Use consistent service naming
- Monitor trace latency
- Document trace format
- Keep instrumentation lightweight

### ❌ DON'T
- Sample 100% in production
- Skip trace context propagation
- Log sensitive data in spans
- Create excessive spans
- Ignore sampling configuration
- Use unbounded cardinality tags
- Deploy without testing collection

## Key Concepts

- **Trace**: Complete request flow across services
- **Span**: Single operation within a trace
- **Tag**: Metadata attached to spans
- **Log**: Timestamped events within spans
- **Context**: Trace information propagated between services

Overview

This skill implements distributed tracing using Jaeger and Zipkin to track requests across microservices and surface performance bottlenecks. It provides ready-to-use setup snippets, instrumentation patterns for Node.js and Python, and examples for context propagation and trace querying. Use it to visualize request flows, correlate errors, and measure latency end-to-end.

How this skill works

The skill installs tracing backends (Jaeger or Zipkin) and wires application instrumentation into request handling. It shows how to initialize tracers, create server-side spans in Express and Flask, inject and extract context for downstream calls, and record spans and errors. It also includes a simple trace query example to find slow traces for analysis.

When to use it

  • Debugging complex microservice interactions and unexpected latencies
  • Identifying service-level performance bottlenecks and hotspots
  • Correlating errors and root cause analysis across service boundaries
  • Measuring end-to-end request latency and service dependencies
  • Validating sampling and instrumentation behavior in staging or production

Best practices

  • Sample based on traffic volume; avoid 100% sampling in production
  • Propagate trace context on every downstream HTTP/RPC call
  • Add meaningful, low-cardinality tags to spans (avoid sensitive data)
  • Log errors and events to spans for easier diagnosis
  • Keep instrumentation lightweight and test collection end-to-end
  • Use consistent service and operation naming across teams

Example use cases

  • Run Jaeger via docker-compose to get an all-in-one collector and UI for development
  • Instrument an Express app with middleware that starts spans per incoming request and finishes on response
  • Instrument a Flask app to extract incoming context and create child spans for DB or cache calls
  • Inject trace headers before calling downstream services so traces stitch across processes
  • Query Jaeger API to find traces exceeding a duration threshold for SLA investigations

FAQ

How do I avoid sampling overhead in high-traffic systems?

Use a sampling strategy (probabilistic or adaptive) tuned to traffic and storage costs; sample lower in production and increase in specific flows using dynamic rules.

Can I mix Jaeger and Zipkin instrumentation?

Yes. Both use standard context propagation formats (B3 or W3C); ensure you choose compatible headers and configure inject/extract consistently across services.