home / skills / doanchienthangdev / omgkit / distributed-tracing

This skill helps you implement end-to-end distributed tracing with OpenTelemetry, Jaeger, and correlation IDs to diagnose latency across services.

npx playbooks add skill doanchienthangdev/omgkit --skill distributed-tracing

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.2 KB
---
name: distributed-tracing
description: Comprehensive distributed tracing with Jaeger, Zipkin, OpenTelemetry, correlation IDs, and span design.
---

# Distributed Tracing

Comprehensive distributed tracing with Jaeger, Zipkin, OpenTelemetry, correlation IDs, and span design.

## Overview

Distributed tracing tracks requests as they flow through multiple services, enabling debugging and performance analysis in microservices architectures.

## Key Concepts

### Trace Model
- **Trace**: End-to-end request journey
- **Span**: Single operation within a trace
- **Span Context**: Propagated trace information
- **Baggage**: Custom key-value pairs carried across services

### Span Attributes
- **Operation Name**: What the span represents
- **Start/End Time**: Duration measurement
- **Tags**: Indexed metadata for querying
- **Logs**: Time-stamped events within span
- **Status**: Success, error, or unset

## OpenTelemetry Implementation

### Instrumentation Setup
```javascript
// Node.js OpenTelemetry setup
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');

const provider = new NodeTracerProvider();

provider.addSpanProcessor(
  new SimpleSpanProcessor(
    new JaegerExporter({
      endpoint: 'http://jaeger:14268/api/traces',
    })
  )
);

provider.register();

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});
```

### Manual Span Creation
```javascript
const { trace } = require('@opentelemetry/api');

const tracer = trace.getTracer('my-service');

async function processOrder(orderId) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);

      // Child span for database operation
      await tracer.startActiveSpan('db.query', async (dbSpan) => {
        dbSpan.setAttribute('db.system', 'postgresql');
        dbSpan.setAttribute('db.statement', 'SELECT * FROM orders WHERE id = $1');
        await db.query('SELECT * FROM orders WHERE id = $1', [orderId]);
        dbSpan.end();
      });

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}
```

### Context Propagation
```javascript
const { context, propagation } = require('@opentelemetry/api');

// Extract context from incoming request
app.use((req, res, next) => {
  const ctx = propagation.extract(context.active(), req.headers);
  context.with(ctx, next);
});

// Inject context into outgoing request
async function callService(url) {
  const headers = {};
  propagation.inject(context.active(), headers);

  return fetch(url, { headers });
}
```

## Jaeger Configuration

### Kubernetes Deployment
```yaml
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
spec:
  strategy: production
  storage:
    type: elasticsearch
    elasticsearch:
      nodeCount: 3
      resources:
        requests:
          cpu: 1
          memory: 4Gi
  collector:
    maxReplicas: 5
  query:
    replicas: 2
```

### Sampling Strategies
```yaml
# Jaeger sampling configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: jaeger-sampling
data:
  sampling: |
    {
      "service_strategies": [
        {
          "service": "order-service",
          "type": "probabilistic",
          "param": 0.5
        },
        {
          "service": "payment-service",
          "type": "ratelimiting",
          "param": 100
        }
      ],
      "default_strategy": {
        "type": "probabilistic",
        "param": 0.1
      }
    }
```

## Span Design Guidelines

### Naming Conventions
```
HTTP spans:    HTTP {METHOD} {route}
               HTTP GET /api/users/:id

Database:      {db.system}.{operation}
               postgresql.query

Message:       {messaging.system} {operation} {destination}
               kafka send orders-topic

RPC:           {rpc.system}/{service}/{method}
               grpc/UserService/GetUser
```

### Essential Attributes
```javascript
// HTTP spans
span.setAttribute('http.method', 'GET');
span.setAttribute('http.url', 'https://api.example.com/users/123');
span.setAttribute('http.status_code', 200);
span.setAttribute('http.request_content_length', 0);
span.setAttribute('http.response_content_length', 1234);

// Database spans
span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.name', 'mydb');
span.setAttribute('db.statement', 'SELECT * FROM users WHERE id = $1');
span.setAttribute('db.operation', 'SELECT');

// Messaging spans
span.setAttribute('messaging.system', 'kafka');
span.setAttribute('messaging.destination', 'orders');
span.setAttribute('messaging.operation', 'send');
```

## Best Practices

1. **Consistent Naming**: Follow semantic conventions
2. **Don't Over-Trace**: Sample appropriately
3. **Meaningful Spans**: Business-relevant operations
4. **Error Recording**: Always record exceptions
5. **Context Propagation**: Ensure trace continuity

## Sampling Strategies

### Head-Based Sampling
- Decision made at trace start
- Simpler, consistent
- May miss interesting traces

### Tail-Based Sampling
- Decision made at trace end
- Keeps all errors and slow traces
- More resource intensive

### Adaptive Sampling
- Adjusts rate based on traffic
- Balances cost and coverage
- Best for variable traffic

## Anti-Patterns

- Creating spans for every function call
- Not propagating context across service boundaries
- Ignoring span errors
- Sampling 100% in production
- Not correlating traces with logs

## When to Use

- Microservices with complex request flows
- Debugging latency issues
- Understanding service dependencies
- Capacity planning

## When NOT to Use

- Monolithic applications
- Very high-throughput systems without sampling
- When storage costs are a concern

Overview

This skill provides a complete distributed tracing blueprint for Node.js services using Jaeger, Zipkin, and OpenTelemetry. It covers instrumentation, context propagation, span design, sampling strategies, and deployment guidance to help you trace requests end-to-end across microservices. The focus is practical: set up exporters, create meaningful spans, and maintain trace continuity with correlation IDs.

How this skill works

It configures OpenTelemetry instrumentation and exporters (Jaeger, Zipkin) and shows how to register automatic HTTP/Express instrumentations and create manual spans for key operations. It explains context extraction and injection so trace context travels across incoming and outgoing requests. It also includes deployment examples (Jaeger on Kubernetes), sampling strategies, and span naming and attribute conventions to make traces searchable and actionable.

When to use it

  • Microservices where requests flow through multiple services and you need end-to-end visibility
  • When troubleshooting latency spikes, errors, or service dependency bottlenecks
  • To correlate traces with logs and metrics for incident investigation
  • During capacity planning to understand service interactions and hotspots
  • When implementing observability as part of SRE or platform engineering efforts

Best practices

  • Follow consistent span naming and semantic conventions for HTTP, DB, RPC, and messaging operations
  • Instrument only business-relevant operations; avoid tracing every function to reduce noise
  • Use appropriate sampling (head-, tail-, or adaptive) to balance cost and signal
  • Always propagate context across service boundaries and inject correlation IDs into logs
  • Record exceptions and set span status on errors to capture failure signals

Example use cases

  • Instrument an order-service to trace end-to-end order processing across API, DB, and payment services
  • Deploy Jaeger in Kubernetes with scalable storage and query replicas for production tracing
  • Implement context extraction/injection middleware to maintain trace continuity across HTTP and RPC calls
  • Create manual spans around expensive DB queries and external service calls to measure latency contributions
  • Apply tail-based sampling to reliably capture slow and error traces while reducing storage cost

FAQ

How do I ensure traces link to logs?

Inject a correlation ID (trace or span ID) into your logging context and ensure logs include that ID so traces and logs can be correlated in your observability tools.

Which sampling strategy should I pick for production?

Start with head-based probabilistic sampling for simplicity, or adopt adaptive/tail-based sampling when you need to reliably capture errors and slow traces while controlling storage costs.