home / skills / openclaw / skills / logging-observability
This skill helps you implement structured logging, tracing, and metrics to build observable Python services with OpenTelemetry and Prometheus.
npx playbooks add skill openclaw/skills --skill logging-observabilityReview the files below or copy the command above to add this skill to your agents.
---
name: logging-observability
model: standard
description: Structured logging, distributed tracing, and metrics collection patterns for building observable systems. Use when implementing logging infrastructure, setting up distributed tracing with OpenTelemetry, designing metrics collection (RED/USE methods), configuring alerting and dashboards, or reviewing observability practices. Covers structured JSON logging, context propagation, trace sampling, Prometheus/Grafana stack, alert design, and PII/secret scrubbing.
version: 1.0.0
---
# Logging & Observability
Patterns for building observable systems across the three pillars: logs, metrics, and traces.
## Three Pillars
| Pillar | Purpose | Question It Answers | Example |
|--------|---------|---------------------|---------|
| **Logs** | What happened | Why did this request fail? | `{"level":"error","msg":"payment declined","user_id":"u_82"}` |
| **Metrics** | How much / how fast | Is latency increasing? | `http_request_duration_seconds{route="/api/orders"} 0.342` |
| **Traces** | Request flow | Where is the bottleneck? | Span: `api-gateway → auth → order-service → db` |
Each pillar is strongest when correlated. Embed `trace_id` in every log line to jump from a log entry to the full distributed trace.
---
## Structured Logging
Always emit logs as structured JSON — never free-text strings.
### Required Fields
| Field | Purpose | Required |
|-------|---------|----------|
| `timestamp` | ISO-8601 with milliseconds | Yes |
| `level` | Severity (DEBUG … FATAL) | Yes |
| `service` | Originating service name | Yes |
| `message` | Human-readable description | Yes |
| `trace_id` | Distributed trace correlation | Yes |
| `span_id` | Current span within trace | Yes |
| `correlation_id` | Business-level correlation (order ID) | When applicable |
| `error` | Structured error object | On errors |
| `context` | Request-specific metadata | Recommended |
### Context Enrichment
Attach context at the middleware level so downstream logs inherit automatically:
```typescript
app.use((req, res, next) => {
const ctx = {
trace_id: req.headers['x-trace-id'] || crypto.randomUUID(),
request_id: crypto.randomUUID(),
user_id: req.user?.id,
method: req.method,
path: req.path,
};
asyncLocalStorage.run(ctx, () => next());
});
```
### Library Recommendations
| Library | Language | Strengths | Perf |
|---------|----------|-----------|------|
| **Pino** | Node.js | Fastest Node logger, low overhead | Excellent |
| **structlog** | Python | Composable processors, context binding | Good |
| **zerolog** | Go | Zero-allocation JSON logging | Excellent |
| **zap** | Go | High performance, typed fields | Excellent |
| **tracing** | Rust | Spans + events, async-aware | Excellent |
Choose a logger that outputs structured JSON natively. Avoid loggers requiring post-processing.
---
## Log Levels
| Level | When to Use | Example |
|-------|-------------|---------|
| **FATAL** | App cannot continue, process will exit | Database connection pool exhausted |
| **ERROR** | Operation failed, needs attention | Payment charge failed: CARD_DECLINED |
| **WARN** | Unexpected but recoverable | Retry 2/3 for upstream timeout |
| **INFO** | Normal business events | Order ORD-1234 placed successfully |
| **DEBUG** | Developer troubleshooting | Cache miss for key user:82:preferences |
| **TRACE** | Very fine-grained (rarely in prod) | Entering validateAddress with payload |
**Rules:** Production default = INFO and above. If you log an ERROR, someone should act on it. Every FATAL should trigger an alert.
---
## Distributed Tracing
### OpenTelemetry Setup
Always prefer OpenTelemetry over vendor-specific SDKs:
```typescript
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
serviceName: 'order-service',
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
```
### Span Creation
```typescript
const tracer = trace.getTracer('order-service');
async function processOrder(order: Order) {
return tracer.startActiveSpan('processOrder', async (span) => {
try {
span.setAttribute('order.id', order.id);
span.setAttribute('order.total_cents', order.totalCents);
await validateInventory(order);
await chargePayment(order);
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
});
}
```
### Context Propagation
- Use W3C Trace Context (`traceparent` header) — default in OTel
- Propagate across HTTP, gRPC, and message queues
- For async workers: serialise `traceparent` into the job payload
### Trace Sampling
| Strategy | Use When |
|----------|----------|
| **Always On** | Low-traffic services, debugging |
| **Probabilistic** (N%) | General production use |
| **Rate-limited** (N/sec) | High-throughput services |
| **Tail-based** | When you need all error traces |
Always sample 100% of error traces regardless of strategy.
---
## Metrics Collection
### RED Method (Request-Driven)
Monitor these three for every service endpoint:
| Metric | What It Measures | Prometheus Example |
|--------|-----------------|-------------------|
| **Rate** | Requests/sec | `rate(http_requests_total[5m])` |
| **Errors** | Failed request ratio | `rate(http_requests_total{status=~"5.."}[5m])` |
| **Duration** | Response time | `histogram_quantile(0.99, http_request_duration_seconds)` |
### USE Method (Resource-Driven)
For infrastructure components (CPU, memory, disk, network):
| Metric | What It Measures | Example |
|--------|-----------------|---------|
| **Utilization** | % resource busy | CPU usage at 78% |
| **Saturation** | Work queued/waiting | 12 requests queued in thread pool |
| **Errors** | Error events on resource | 3 disk I/O errors in last minute |
---
## Monitoring Stack
| Tool | Category | Best For |
|------|----------|----------|
| **Prometheus** | Metrics | Pull-based metrics, alerting rules |
| **Grafana** | Visualisation | Dashboards for metrics, logs, traces |
| **Jaeger** | Tracing | Distributed trace visualisation |
| **Loki** | Logs | Log aggregation (pairs with Grafana) |
| **OpenTelemetry** | Collection | Vendor-neutral telemetry collection |
**Recommendation:** Start with OTel Collector → Prometheus + Grafana + Loki + Jaeger. Migrate to SaaS only when operational overhead justifies cost.
---
## Alert Design
### Severity Levels
| Severity | Response Time | Example |
|----------|---------------|---------|
| **P1** | Immediate | Service fully down, data loss |
| **P2** | < 30 min | Error rate > 5%, latency p99 > 5s |
| **P3** | Business hours | Disk > 80%, cert expiring in 7 days |
| **P4** | Best effort | Non-critical deprecation warning |
### Alert Fatigue Prevention
- **Alert on symptoms, not causes** — "error rate > 5%" not "pod restarted"
- **Multi-window, multi-burn-rate** — catch both sudden spikes and slow burns
- **Require runbook links** — every alert must link to diagnosis and remediation
- **Review monthly** — delete or tune alerts that never fire or always fire
- **Group related alerts** — use inhibition rules to suppress child alerts
- **Set appropriate thresholds** — if alert fires daily and is ignored, raise threshold or delete
---
## Dashboard Patterns
### Overview Dashboard ("War Room")
- Total requests/sec across all services
- Global error rate (%) with trendline
- p50 / p95 / p99 latency
- Active alerts count by severity
- Deployment markers overlaid on graphs
### Service Dashboard (Per-Service)
- RED metrics for each endpoint
- Dependency health (upstream/downstream success rates)
- Resource utilisation (CPU, memory, connections)
- Top errors table with count and last seen
---
## Observability Checklist
Every service must have:
- [ ] Structured JSON logging with consistent schema
- [ ] Correlation / trace IDs propagated on all requests
- [ ] RED metrics exposed for every external endpoint
- [ ] Health check endpoints (`/healthz` and `/readyz`)
- [ ] Distributed tracing with OpenTelemetry
- [ ] Dashboards for RED metrics and resource utilisation
- [ ] Alerts for error rate, latency, and saturation with runbook links
- [ ] Log level configurable at runtime without redeployment
- [ ] PII scrubbing verified and tested
- [ ] Retention policies defined for logs, metrics, and traces
## Anti-Patterns
| Anti-Pattern | Problem | Fix |
|-------------|---------|-----|
| Logging PII | Privacy/compliance violation | Mask or exclude PII; use token references |
| Excessive logging | Storage costs balloon, signal drowns | Log business events, not data flow |
| Unstructured logs | Cannot query or alert on fields | Use structured JSON with consistent schema |
| String interpolation | Breaks structured fields, injection risk | Pass fields as metadata, not in message |
| Missing correlation IDs | Cannot trace across services | Generate and propagate trace_id everywhere |
| Alert storms | On-call fatigue, real issues buried | Use grouping, inhibition, deduplication |
| Metrics with high cardinality | Prometheus OOM, dashboard timeouts | Never use user ID or request ID as label |
## NEVER Do
1. **NEVER log passwords, tokens, API keys, or secrets** — even at DEBUG level
2. **NEVER use console.log / print in production** — use a structured logger
3. **NEVER use user IDs, emails, or request IDs as metric labels** — cardinality will explode
4. **NEVER create alerts without a runbook link** — unactionable alerts erode trust
5. **NEVER rely on logs alone** — you need metrics and traces for full observability
6. **NEVER log request/response bodies by default** — opt-in only, with PII redaction
7. **NEVER ignore log volume** — set budgets and alert when a service exceeds daily quota
8. **NEVER skip context propagation in async flows** — broken traces are worse than no traces
This skill provides battle-tested patterns for structured logging, distributed tracing, and metrics collection to build observable systems. It focuses on practical rules: JSON logs with correlation IDs, OpenTelemetry-based tracing, RED/USE metrics methods, and operational guidance for alerts and dashboards. Use it to design, review, or implement observability for microservices and infrastructure.
The skill prescribes a consistent, structured JSON logging schema (timestamp, level, service, trace_id, span_id, etc.) and middleware-based context enrichment so logs inherit trace and request metadata. It recommends OpenTelemetry for traces with W3C Trace Context propagation and sampling strategies, and Prometheus/Grafana for metrics and dashboards. It also covers alert severity, runbooks, PII/secret scrubbing, and common anti-patterns to avoid.
Which logger should I pick for Python?
Use a structured JSON-capable logger such as structlog or a logging adapter that emits JSON natively; prefer minimal overhead and context binding.
How do I avoid Prometheus cardinality problems?
Never use user identifiers, emails, or request IDs as metric labels. Limit label cardinality and aggregate on safe dimensions like route or status code.