home / skills / openclaw / skills / logging-observability

logging-observability skill

/skills/wpank/logging-observability

This skill helps you implement structured logging, tracing, and metrics to build observable Python services with OpenTelemetry and Prometheus.

npx playbooks add skill openclaw/skills --skill logging-observability

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
10.0 KB
---
name: logging-observability
model: standard
description: Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
version: 1.0.0
---

# Logging & Observability

Patterns for building observable systems across the three pillars: logs, metrics, and traces.

## Three Pillars

| Pillar | Purpose | Question It Answers | Example |
|--------|---------|---------------------|---------|
| **Logs** | What happened | Why did this request fail? | `{"level":"error","msg":"payment declined","user_id":"u_82"}` |
| **Metrics** | How much / how fast | Is latency increasing? | `http_request_duration_seconds{route="/api/orders"} 0.342` |
| **Traces** | Request flow | Where is the bottleneck? | Span: `api-gateway → auth → order-service → db` |

Each pillar is strongest when correlated. Embed `trace_id` in every log line to jump from a log entry to the full distributed trace.

---

## Structured Logging

Always emit logs as structured JSON — never free-text strings.

### Required Fields

| Field | Purpose | Required |
|-------|---------|----------|
| `timestamp` | ISO-8601 with milliseconds | Yes |
| `level` | Severity (DEBUG … FATAL) | Yes |
| `service` | Originating service name | Yes |
| `message` | Human-readable description | Yes |
| `trace_id` | Distributed trace correlation | Yes |
| `span_id` | Current span within trace | Yes |
| `correlation_id` | Business-level correlation (order ID) | When applicable |
| `error` | Structured error object | On errors |
| `context` | Request-specific metadata | Recommended |

### Context Enrichment

Attach context at the middleware level so downstream logs inherit automatically:

```typescript
app.use((req, res, next) => {
  const ctx = {
    trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
    request_id: crypto.randomUUID(),
    user_id: req.user?.id,
    method: req.method,
    path: req.path,
  };
  asyncLocalStorage.run(ctx, () => next());
});
```

### Library Recommendations

| Library | Language | Strengths | Perf |
|---------|----------|-----------|------|
| **Pino** | Node.js | Fastest Node logger, low overhead | Excellent |
| **structlog** | Python | Composable processors, context binding | Good |
| **zerolog** | Go | Zero-allocation JSON logging | Excellent |
| **zap** | Go | High performance, typed fields | Excellent |
| **tracing** | Rust | Spans + events, async-aware | Excellent |

Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.

---

## Log Levels

| Level | When to Use | Example |
|-------|-------------|---------|
| **FATAL** | App cannot continue, process will exit | Database connection pool exhausted |
| **ERROR** | Operation failed, needs attention | Payment charge failed: CARD_DECLINED |
| **WARN** | Unexpected but recoverable | Retry 2/3 for upstream timeout |
| **INFO** | Normal business events | Order ORD-1234 placed successfully |
| **DEBUG** | Developer troubleshooting | Cache miss for key user:82:preferences |
| **TRACE** | Very fine-grained (rarely in prod) | Entering validateAddress with payload |

**Rules:** Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.

---

## Distributed Tracing

### OpenTelemetry Setup

Always prefer OpenTelemetry over vendor-specific SDKs:

```typescript
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  serviceName: 'order-service',
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
```

### Span Creation

```typescript
const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.total_cents', order.totalCents);
      await validateInventory(order);
      await chargePayment(order);
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      span.recordException(err);
      throw err;
    } finally {
      span.end();
    }
  });
}
```

### Context Propagation

- Use W3C Trace Context (`traceparent` header) — default in OTel
- Propagate across HTTP, gRPC, and message queues
- For async workers: serialise `traceparent` into the job payload

### Trace Sampling

| Strategy | Use When |
|----------|----------|
| **Always On** | Low-traffic services, debugging |
| **Probabilistic** (N%) | General production use |
| **Rate-limited** (N/sec) | High-throughput services |
| **Tail-based** | When you need all error traces |

Always sample 100% of error traces regardless of strategy.

---

## Metrics Collection

### RED Method (Request-Driven)

Monitor these three for every service endpoint:

| Metric | What It Measures | Prometheus Example |
|--------|-----------------|-------------------|
| **Rate** | Requests/sec | `rate(http_requests_total[5m])` |
| **Errors** | Failed request ratio | `rate(http_requests_total{status=~"5.."}[5m])` |
| **Duration** | Response time | `histogram_quantile(0.99, http_request_duration_seconds)` |

### USE Method (Resource-Driven)

For infrastructure components (CPU, memory, disk, network):

| Metric | What It Measures | Example |
|--------|-----------------|---------|
| **Utilization** | % resource busy | CPU usage at 78% |
| **Saturation** | Work queued/waiting | 12 requests queued in thread pool |
| **Errors** | Error events on resource | 3 disk I/O errors in last minute |

---

## Monitoring Stack

| Tool | Category | Best For |
|------|----------|----------|
| **Prometheus** | Metrics | Pull-based metrics, alerting rules |
| **Grafana** | Visualisation | Dashboards for metrics, logs, traces |
| **Jaeger** | Tracing | Distributed trace visualisation |
| **Loki** | Logs | Log aggregation (pairs with Grafana) |
| **OpenTelemetry** | Collection | Vendor-neutral telemetry collection |

**Recommendation:** Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.

---

## Alert Design

### Severity Levels

| Severity | Response Time | Example |
|----------|---------------|---------|
| **P1** | Immediate | Service fully down, data loss |
| **P2** | < 30 min | Error rate > 5%, latency p99 > 5s |
| **P3** | Business hours | Disk > 80%, cert expiring in 7 days |
| **P4** | Best effort | Non-critical deprecation warning |

### Alert Fatigue Prevention

- **Alert on symptoms, not causes** — "error rate > 5%" not "pod restarted"
- **Multi-window, multi-burn-rate** — catch both sudden spikes and slow burns
- **Require runbook links** — every alert must link to diagnosis and remediation
- **Review monthly** — delete or tune alerts that never fire or always fire
- **Group related alerts** — use inhibition rules to suppress child alerts
- **Set appropriate thresholds** — if alert fires daily and is ignored, raise threshold or delete

---

## Dashboard Patterns

### Overview Dashboard ("War Room")
- Total requests/sec across all services
- Global error rate (%) with trendline
- p50 / p95 / p99 latency
- Active alerts count by severity
- Deployment markers overlaid on graphs

### Service Dashboard (Per-Service)
- RED metrics for each endpoint
- Dependency health (upstream/downstream success rates)
- Resource utilisation (CPU, memory, connections)
- Top errors table with count and last seen

---

## Observability Checklist

Every service must have:

- [ ] Structured JSON logging with consistent schema
- [ ] Correlation / trace IDs propagated on all requests
- [ ] RED metrics exposed for every external endpoint
- [ ] Health check endpoints (`/healthz` and `/readyz`)
- [ ] Distributed tracing with OpenTelemetry
- [ ] Dashboards for RED metrics and resource utilisation
- [ ] Alerts for error rate, latency, and saturation with runbook links
- [ ] Log level configurable at runtime without redeployment
- [ ] PII scrubbing verified and tested
- [ ] Retention policies defined for logs, metrics, and traces

## Anti-Patterns

| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references |
| Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow |
| Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema |
| String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message |
| Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere |
| Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication |
| Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |

## NEVER Do

1. **NEVER log passwords, tokens, API keys, or secrets** — even at DEBUG level
2. **NEVER use console.log / print in production** — use a structured logger
3. **NEVER use user IDs, emails, or request IDs as metric labels** — cardinality will explode
4. **NEVER create alerts without a runbook link** — unactionable alerts erode trust
5. **NEVER rely on logs alone** — you need metrics and traces for full observability
6. **NEVER log request/response bodies by default** — opt-in only, with PII redaction
7. **NEVER ignore log volume** — set budgets and alert when a service exceeds daily quota
8. **NEVER skip context propagation in async flows** — broken traces are worse than no traces

Overview

This skill provides battle-tested patterns for structured logging, distributed tracing, and metrics collection to build observable systems. It focuses on practical rules: JSON logs with correlation IDs, OpenTelemetry-based tracing, RED/USE metrics methods, and operational guidance for alerts and dashboards. Use it to design, review, or implement observability for microservices and infrastructure.

How this skill works

The skill prescribes a consistent, structured JSON logging schema (timestamp, level, service, trace_id, span_id, etc.) and middleware-based context enrichment so logs inherit trace and request metadata. It recommends OpenTelemetry for traces with W3C Trace Context propagation and sampling strategies, and Prometheus/Grafana for metrics and dashboards. It also covers alert severity, runbooks, PII/secret scrubbing, and common anti-patterns to avoid.

When to use it

  • Implementing or standardizing logging across services
  • Setting up distributed tracing with OpenTelemetry
  • Designing metrics using RED for services or USE for resources
  • Configuring alerts, runbooks, and Grafana dashboards
  • Reviewing observability maturity or onboarding new services

Best practices

  • Emit structured JSON logs from the service runtime; avoid free-text logs
  • Propagate trace_id/traceparent across HTTP, gRPC, and message queues
  • Sample traces probabilistically in production but always keep 100% of error traces
  • Use RED for request metrics and USE for infrastructure; avoid high-cardinality labels
  • Scrub or tokenise PII and secrets before logging; never log credentials
  • Attach runbook links to every alert and review thresholds monthly to reduce noise

Example use cases

  • Add middleware that injects trace_id and request_id into async context so all logs include correlation fields
  • Deploy OpenTelemetry collector and export traces to Jaeger while sending metrics to Prometheus
  • Design a per-service Grafana dashboard that surfaces RED metrics, p99 latency, and active alerts
  • Create alert rules for sustained error-rate increases with linked runbooks and multi-window burn-rate checks
  • Audit an existing service for anti-patterns: PII in logs, high-cardinality metric labels, or missing trace propagation

FAQ

Which logger should I pick for Python?

Use a structured JSON-capable logger such as structlog or a logging adapter that emits JSON natively; prefer minimal overhead and context binding.

How do I avoid Prometheus cardinality problems?

Never use user identifiers, emails, or request IDs as metric labels. Limit label cardinality and aggregate on safe dimensions like route or status code.