home / skills / ancoleman / ai-design-components / implementing-observability

implementing-observability skill

Q: Which export protocol is recommended?

Use OTLP over gRPC when possible for performance and feature parity; fall back to HTTP if gRPC is unavailable.

safe

This skill implements production observability with OpenTelemetry across metrics, logs, and traces to boost visibility and reliability.

npx playbooks add skill ancoleman/ai-design-components --skill implementing-observability

Review the files below or copy the command above to add this skill to your agents.

Files (17)

SKILL.md

13.3 KB

---
name: implementing-observability
description: Monitoring, logging, and tracing implementation using OpenTelemetry as the unified standard. Use when building production systems requiring visibility into performance, errors, and behavior. Covers OpenTelemetry (metrics, logs, traces), Prometheus, Grafana, Loki, Jaeger, Tempo, structured logging (structlog, tracing, slog, pino), and alerting.
---

# Production Observability with OpenTelemetry

## Purpose

Implement production-grade observability using OpenTelemetry as the 2025 industry standard. Covers the three pillars (metrics, logs, traces), LGTM stack deployment, and critical log-trace correlation patterns.

## When to Use

Use when:
- Building production systems requiring visibility into performance and errors
- Debugging distributed systems with multiple services
- Setting up monitoring, logging, or tracing infrastructure
- Implementing structured logging with trace correlation
- Configuring alerting rules for production systems

Skip if:
- Building proof-of-concept without production deployment
- System has < 100 requests/day (console logging may suffice)

## The OpenTelemetry Standard (2025)

OpenTelemetry is the CNCF graduated project unifying observability:

```
┌────────────────────────────────────────────────────────┐
│          OpenTelemetry: The Unified Standard           │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ONE SDK for ALL signals:                              │
│  ├── Metrics (Prometheus-compatible)                   │
│  ├── Logs (structured, correlated)                     │
│  ├── Traces (distributed, standardized)                │
│  └── Context (propagates across services)              │
│                                                         │
│  Language SDKs:                                         │
│  ├── Python: opentelemetry-api, opentelemetry-sdk      │
│  ├── Rust: opentelemetry, tracing-opentelemetry        │
│  ├── Go: go.opentelemetry.io/otel                      │
│  └── TypeScript: @opentelemetry/api                    │
│                                                         │
│  Export to ANY backend:                                │
│  ├── LGTM Stack (Loki, Grafana, Tempo, Mimir)          │
│  ├── Prometheus + Jaeger                               │
│  ├── Datadog, New Relic, Honeycomb (SaaS)              │
│  └── Custom backends via OTLP protocol                 │
│                                                         │
└────────────────────────────────────────────────────────┘
```

**Context7 Reference**: `/websites/opentelemetry_io` (Trust: High, Snippets: 5,888, Score: 85.9)

## The Three Pillars of Observability

### 1. Metrics (What is happening?)

Track system health and performance over time.

**Metric Types**: Counters (always increase), Gauges (up/down), Histograms (distributions), Summaries (percentiles).

**Brief Example (Python)**:
```python
from opentelemetry import metrics

meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})
```

### 2. Logs (What happened?)

Record discrete events with context.

**CRITICAL**: Always inject trace_id/span_id for log-trace correlation.

**Brief Example (Python + structlog)**:
```python
import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "processing_request",
    trace_id=format(ctx.trace_id, '032x'),
    span_id=format(ctx.span_id, '016x'),
    user_id=user_id
)
```

**See**: `references/structured-logging.md` for complete configuration.

### 3. Traces (Where did time go?)

Track request flow across distributed services.

**Key Concepts**: Trace (end-to-end journey), Span (individual operation), Parent-Child (nested operations).

**Brief Example (Python + FastAPI)**:
```python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-traces all HTTP requests
```

**See**: `references/opentelemetry-setup.md` for SDK installation by language.

## The LGTM Stack (Self-Hosted Observability)

LGTM = **L**oki (Logs) + **G**rafana (Visualization) + **T**empo (Traces) + **M**imir (Metrics)

```
┌────────────────────────────────────────────────────────┐
│                  LGTM Architecture                      │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────────────────────────────────────┐      │
│  │           Grafana Dashboard (Port 3000)      │      │
│  │  Unified UI for Logs, Metrics, Traces       │      │
│  └──────┬──────────────┬─────────────┬─────────┘      │
│         │              │             │                 │
│         ▼              ▼             ▼                 │
│  ┌──────────┐   ┌──────────┐  ┌──────────┐            │
│  │   Loki   │   │  Tempo   │  │  Mimir   │            │
│  │  (Logs)  │   │ (Traces) │  │(Metrics) │            │
│  │Port 3100 │   │Port 3200 │  │Port 9009 │            │
│  └────▲─────┘   └────▲─────┘  └────▲─────┘            │
│       │              │             │                   │
│       └──────────────┴─────────────┘                   │
│                      │                                 │
│              ┌───────▼────────┐                        │
│              │ Grafana Alloy  │                        │
│              │  (Collector)   │                        │
│              │  Port 4317/8   │ ← OTLP gRPC/HTTP       │
│              └───────▲────────┘                        │
│                      │                                 │
│         OpenTelemetry Instrumented Apps                │
│                                                         │
└────────────────────────────────────────────────────────┘
```

**Quick Start**: Run `examples/lgtm-docker-compose/docker-compose.yml` for a complete LGTM stack.

**See**: `references/lgtm-stack.md` for production deployment guide.

## Critical Pattern: Log-Trace Correlation

**The Problem**: Logs and traces live in separate systems. You see an error log but can't find the related trace.

**The Solution**: Inject `trace_id` and `span_id` into every log record.

### Python (structlog)

```python
import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "request_processed",
    trace_id=format(ctx.trace_id, '032x'),  # 32-char hex
    span_id=format(ctx.span_id, '016x'),    # 16-char hex
    user_id=user_id
)
```

### Rust (tracing)

```rust
use tracing::{info, instrument};

#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
    // trace_id/span_id automatically included
    info!(user_id = user_id, "processing request");
    Ok(result)
}
```

**See**: `references/trace-context.md` for Go and TypeScript patterns.

### Query in Grafana

```logql
{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"
```

## Quick Setup Guide

### 1. Choose Your Stack

**Decision Tree**:
- **Greenfield**: OpenTelemetry SDK + LGTM Stack (self-hosted) or Grafana Cloud (managed)
- **Existing Prometheus**: Add Loki (logs) + Tempo (traces)
- **Kubernetes**: LGTM via Helm, Alloy DaemonSet
- **Zero-ops**: Managed SaaS (Grafana Cloud, Datadog, New Relic)

### 2. Install OpenTelemetry SDK

**Bootstrap Script**:
```bash
python scripts/setup_otel.py --language python --framework fastapi
```

**Manual (Python)**:
```bash
pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi \
    opentelemetry-exporter-otlp
```

**See**: `references/opentelemetry-setup.md` for Rust, Go, TypeScript installation.

### 3. Deploy LGTM Stack

**Docker Compose** (development):
```bash
cd examples/lgtm-docker-compose
docker-compose up -d
# Grafana: http://localhost:3000 (admin/admin)
# OTLP: localhost:4317 (gRPC), localhost:4318 (HTTP)
```

**See**: `references/lgtm-stack.md` for production Kubernetes deployment.

### 4. Configure Structured Logging

**See**: `references/structured-logging.md` for complete setup (Python, Rust, Go, TypeScript).

### 5. Set Up Alerting

**See**: `references/alerting-rules.md` for Prometheus and Loki alert patterns.

## Auto-Instrumentation

OpenTelemetry auto-instruments popular frameworks:

```python
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-trace all HTTP requests
```

**Supported**: FastAPI, Flask, Django, Express, Gin, Echo, Nest.js

**See**: `references/opentelemetry-setup.md` for framework-specific setup.

## Common Patterns

### Custom Spans

```python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("fetch_user_details") as span:
    span.set_attribute("user_id", user_id)
    user = await db.fetch_user(user_id)
    span.set_attribute("user_found", user is not None)
```

### Error Tracking

```python
from opentelemetry.trace import Status, StatusCode

with tracer.start_as_current_span("process_payment") as span:
    try:
        result = process_payment(amount, card_token)
        span.set_status(Status(StatusCode.OK))
    except PaymentError as e:
        span.set_status(Status(StatusCode.ERROR, str(e)))
        span.record_exception(e)
        raise
```

**See**: `references/trace-context.md` for background job tracing and context propagation.

## Validation and Testing

```bash
# Test log-trace correlation
# 1. Make request to your app
# 2. Copy trace_id from logs
# 3. Query in Grafana: {job="myapp"} |= "trace_id=<TRACE_ID>"

# Validate metrics
python scripts/validate_metrics.py
```

## Integration with Other Skills

- **Dashboards**: Embed Grafana panels, query Prometheus metrics
- **Feedback**: Alert routing (Slack, PagerDuty), notification UI
- **Data-Viz**: Time-series charts, trace waterfall, latency heatmaps

**See**: `examples/fastapi-otel/` for complete integration.

## Progressive Disclosure

**Setup Guides**:
- `references/opentelemetry-setup.md` - SDK installation (Python, Rust, Go, TypeScript)
- `references/structured-logging.md` - structlog, tracing, slog, pino configuration
- `references/lgtm-stack.md` - LGTM deployment (Docker, Kubernetes)
- `references/trace-context.md` - Log-trace correlation patterns
- `references/alerting-rules.md` - Prometheus and Loki alert templates

**Examples**:
- `examples/fastapi-otel/` - FastAPI + OpenTelemetry + LGTM
- `examples/axum-tracing/` - Rust Axum + tracing + LGTM
- `examples/lgtm-docker-compose/` - Production-ready LGTM stack

**Scripts**:
- `scripts/setup_otel.py` - Bootstrap OpenTelemetry SDK
- `scripts/generate_dashboards.py` - Generate Grafana dashboards
- `scripts/validate_metrics.py` - Validate metric naming

## Key Principles

1. **OpenTelemetry is THE standard** - Use OTel SDK, not vendor-specific SDKs
2. **Auto-instrumentation first** - Prefer auto over manual spans
3. **Always correlate logs and traces** - Inject trace_id/span_id into every log
4. **Use structured logging** - JSON format, consistent field names
5. **LGTM stack for self-hosting** - Production-ready open-source stack

## Common Pitfalls

**Don't**:
- Use vendor-specific SDKs (use OpenTelemetry)
- Log without trace_id/span_id context
- Manually instrument what auto-instrumentation covers
- Mix logging libraries (pick one: structlog, tracing, slog, pino)

**Do**:
- Start with auto-instrumentation
- Add manual spans only for business-critical operations
- Use semantic conventions for span attributes
- Export to OTLP (gRPC preferred over HTTP)
- Test locally with LGTM docker-compose before production

## Success Metrics

1. 100% of logs include trace_id when in request context
2. Mean time to resolution (MTTR) decreases by >50%
3. Developers use Grafana as first debugging tool
4. 80%+ of telemetry from auto-instrumentation
5. Alert noise < 5% false positives

Overview

This skill implements production-grade observability using OpenTelemetry as the unified standard for metrics, logs, and traces. It provides patterns, examples, and deployment guidance for the LGTM stack (Loki, Grafana, Tempo, Mimir) and shows how to correlate structured logs with traces across distributed services. Use it to get fast debugging, consistent telemetry, and reliable alerting in Python-based systems.

How this skill works

The skill configures OpenTelemetry SDKs and auto-instrumentation to emit metrics, structured logs, and distributed traces via OTLP. It maps metrics to Prometheus/Mimir, logs to Loki, and traces to Tempo or Jaeger, and wires Grafana as a unified UI. It includes concrete examples for Python (FastAPI, structlog), log-trace correlation best practices, alerting rule patterns, and quick-start deployment with a LGTM docker-compose or Helm charts.

When to use it

Building production services that need full visibility into performance and errors
Debugging distributed, microservice-based systems with cross-service requests
Setting up centralized monitoring, structured logging, and tracing infrastructure
Ensuring logs include trace_id/span_id for reliable trace lookup
Configuring alerting and dashboards for on-call and SRE workflows

Best practices

Adopt OpenTelemetry SDKs (not vendor-specific) and prefer OTLP gRPC exports
Start with auto-instrumentation; add manual spans only for critical business operations
Always inject trace_id and span_id into structured logs for correlation
Use semantic conventions for span attributes and consistent metric names
Validate locally using the LGTM docker-compose before production deploys

Example use cases

FastAPI service: auto-instrument HTTP requests, add business spans, forward OTLP to LGTM for local testing
Kubernetes deployment: install Alloy/grafana-collector as a DaemonSet and use Helm charts for LGTM components
Incident debugging: copy trace_id from logs and query Grafana/Loki to find the full trace waterfall
SLO monitoring: export latency histograms to Mimir/Prometheus and create Grafana alerts
Error tracking: record exceptions on spans and route high-severity logs to PagerDuty via alert rules

FAQ

Do I need all three pillars (metrics, logs, traces)?

Yes—metrics give trend visibility, logs record events, and traces show request flow; together they reduce MTTR significantly.

When is LGTM overkill?

Skip a full LGTM deploy for tiny proof-of-concept apps or services with <100 requests/day; console logging may suffice until scale increases.

Which export protocol is recommended?

Use OTLP over gRPC when possible for performance and feature parity; fall back to HTTP if gRPC is unavailable.