home / skills / ariegoldkin / ai-agent-hub / observability-monitoring

observability-monitoring skill

/skills/observability-monitoring

This skill helps you implement structured logging, metrics, tracing, and alerting to improve system visibility and reliability.

npx playbooks add skill ariegoldkin/ai-agent-hub --skill observability-monitoring

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
5.2 KB
---
name: Observability & Monitoring
description: Structured logging, metrics, distributed tracing, and alerting strategies
version: 1.0.0
category: Operations & Reliability
agents: [backend-system-architect, code-quality-reviewer, ai-ml-engineer]
keywords: [observability, monitoring, logging, metrics, tracing, alerts, Prometheus, OpenTelemetry]
---

# Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

## When to Use

- Setting up application monitoring
- Implementing structured logging
- Adding metrics and dashboards
- Configuring distributed tracing
- Creating alerting rules
- Debugging production issues

## Three Pillars of Observability

```
┌─────────────────┬─────────────────┬─────────────────┐
│     LOGS        │     METRICS     │     TRACES      │
├─────────────────┼─────────────────┼─────────────────┤
│ What happened   │ How is system   │ How do requests │
│ at specific     │ performing      │ flow through    │
│ point in time   │ over time       │ services        │
└─────────────────┴─────────────────┴─────────────────┘
```

## Structured Logging

### Log Levels

| Level | Use Case |
|-------|----------|
| **ERROR** | Unhandled exceptions, failed operations |
| **WARN** | Deprecated API, retry attempts |
| **INFO** | Business events, successful operations |
| **DEBUG** | Development troubleshooting |

### Best Practice

```typescript
// Good: Structured with context
logger.info('User action completed', {
  action: 'purchase',
  userId: user.id,
  orderId: order.id,
  duration_ms: 150
});

// Bad: String interpolation
logger.info(`User ${user.id} completed purchase`);
```

> See `templates/structured-logging.ts` for Winston setup and request middleware

## Metrics Collection

### RED Method (Rate, Errors, Duration)

Essential metrics for any service:
- **Rate** - Requests per second
- **Errors** - Failed requests per second
- **Duration** - Request latency distribution

### Prometheus Buckets

```typescript
// HTTP request latency
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency
buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]
```

> See `templates/prometheus-metrics.ts` for full metrics configuration

## Distributed Tracing

### OpenTelemetry Setup

Auto-instrument common libraries:
- Express/HTTP
- PostgreSQL
- Redis

### Manual Spans

```typescript
tracer.startActiveSpan('processOrder', async (span) => {
  span.setAttribute('order.id', orderId);
  // ... work
  span.end();
});
```

> See `templates/opentelemetry-tracing.ts` for full setup

## Alerting Strategy

### Severity Levels

| Level | Response Time | Examples |
|-------|---------------|----------|
| **Critical (P1)** | < 15 min | Service down, data loss |
| **High (P2)** | < 1 hour | Major feature broken |
| **Medium (P3)** | < 4 hours | Increased error rate |
| **Low (P4)** | Next day | Warnings |

### Key Alerts

| Alert | Condition | Severity |
|-------|-----------|----------|
| ServiceDown | `up == 0` for 1m | Critical |
| HighErrorRate | 5xx > 5% for 5m | Critical |
| HighLatency | p95 > 2s for 5m | High |
| LowCacheHitRate | < 70% for 10m | Medium |

> See `templates/alerting-rules.yml` for Prometheus alerting rules

## Health Checks

### Kubernetes Probes

| Probe | Purpose | Endpoint |
|-------|---------|----------|
| **Liveness** | Is app running? | `/health` |
| **Readiness** | Ready for traffic? | `/ready` |
| **Startup** | Finished starting? | `/startup` |

### Readiness Response

```json
{
  "status": "healthy|degraded|unhealthy",
  "checks": {
    "database": { "status": "pass", "latency_ms": 5 },
    "redis": { "status": "pass", "latency_ms": 2 }
  },
  "version": "1.0.0",
  "uptime": 3600
}
```

> See `templates/health-checks.ts` for implementation

## Observability Checklist

### Implementation
- [ ] JSON structured logging
- [ ] Request correlation IDs
- [ ] RED metrics (Rate, Errors, Duration)
- [ ] Business metrics
- [ ] Distributed tracing
- [ ] Health check endpoints

### Alerting
- [ ] Service outage alerts
- [ ] Error rate thresholds
- [ ] Latency thresholds
- [ ] Resource utilization alerts

### Dashboards
- [ ] Service overview
- [ ] Error analysis
- [ ] Performance metrics

## Extended Thinking Triggers

Use Opus 4.5 extended thinking for:
- **Incident investigation** - Correlating logs, metrics, traces
- **Alert tuning** - Reducing noise, catching real issues
- **Architecture decisions** - Choosing monitoring solutions
- **Performance debugging** - Cross-service latency analysis

## Templates Reference

| Template | Purpose |
|----------|---------|
| `structured-logging.ts` | Winston logger with request middleware |
| `prometheus-metrics.ts` | HTTP, DB, cache metrics with middleware |
| `opentelemetry-tracing.ts` | Distributed tracing setup |
| `alerting-rules.yml` | Prometheus alerting rules |
| `health-checks.ts` | Liveness, readiness, startup probes |

Overview

This skill provides a practical framework for implementing observability across services using structured logging, metrics, distributed tracing, and alerting strategies. It bundles concrete patterns, telemetry templates, and checklist-driven guidance to get production monitoring working reliably. The focus is on actionable implementations in TypeScript ecosystems.

How this skill works

The skill inspects and codifies best practices for logs, metrics, and traces: structured JSON logging with context, RED metrics and Prometheus histogram buckets, OpenTelemetry tracing with auto-instrumentation and manual spans, and alerting rules for key incidents. It includes health check patterns and templates for wiring these concerns into Node/Express services and Kubernetes probes.

When to use it

  • Setting up or standardizing application monitoring for new or existing services
  • Adding structured logging and request correlation to improve diagnostics
  • Instrumenting services with RED metrics and Prometheus-compatible histograms
  • Enabling distributed tracing with OpenTelemetry for cross-service latency analysis
  • Defining alerting rules and severity levels for production incidents
  • Creating health checks and readiness/liveness probes for Kubernetes deployments

Best practices

  • Log as structured JSON with contextual fields (userId, requestId, duration_ms) rather than string interpolation
  • Capture RED metrics (Rate, Errors, Duration) plus key business metrics and use appropriate Prometheus buckets
  • Auto-instrument standard libraries and add manual spans around business-critical operations
  • Attach correlation IDs to requests and propagate them through logs, metrics, and traces
  • Tune alerts by severity to reduce noise: start with critical service-down and high error-rate rules, then iterate

Example use cases

  • Implement a Winston-based structured logger and request middleware that emits JSON logs with correlation IDs
  • Configure Prometheus metrics: request counter, error counter, and latency histogram with recommended buckets
  • Integrate OpenTelemetry to auto-instrument Express, PostgreSQL, and Redis, and add manual spans for payment processing
  • Create Prometheus alerting rules for ServiceDown, HighErrorRate, and HighLatency with escalation severities
  • Add Kubernetes liveness, readiness, and startup probes and expose a health endpoint that reports dependency statuses

FAQ

What are the three pillars of observability?

Logs (what happened at a point in time), Metrics (how the system performs over time), and Traces (how requests flow through services).

Which metrics should I implement first?

Start with RED: Rate (requests/sec), Errors (failed requests/sec), and Duration (latency distribution). Add business metrics after baseline service metrics are stable.