home / skills / dasien / claudemultiagenttemplate / metrics-monitoring

metrics-monitoring skill

/templates/.claude/skills/metrics-monitoring

This skill helps you instrument applications with RED and USE metrics, build dashboards, and configure alerts for proactive monitoring.

npx playbooks add skill dasien/claudemultiagenttemplate --skill metrics-monitoring

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

10.6 KB

---
name: "Metrics & Monitoring"
description: "Implement application metrics (RED, USE), alerting strategies, and monitoring dashboards"
category: "observability"
required_tools: ["Read", "Write", "Bash"]
---

# Metrics & Monitoring

## Purpose
Instrument applications with meaningful metrics, set up monitoring dashboards, and configure alerts to detect issues before users do.

## When to Use
- Deploying to production
- Performance monitoring
- Capacity planning
- Incident detection and response
- SLA/SLO tracking
- Understanding system behavior

## Key Capabilities

1. **Metric Collection** - Instrument code with RED, USE, Four Golden Signals
2. **Dashboard Creation** - Visualize system health and trends
3. **Alerting** - Detect anomalies and trigger notifications

## Approach

1. **Choose Metric Methodology**
   - **RED**: Rate, Errors, Duration (for services/requests)
   - **USE**: Utilization, Saturation, Errors (for resources)
   - **Four Golden Signals**: Latency, Traffic, Errors, Saturation

2. **Instrument Application**
   - Add counters for events (requests, errors)
   - Add gauges for current values (connections, memory)
   - Add histograms for distributions (latency)
   - Add summaries for quantiles (p95, p99)

3. **Set Up Collection**
   - Prometheus for metrics
   - StatsD for application metrics
   - CloudWatch for AWS
   - DataDog for full-stack

4. **Create Dashboards**
   - System overview (health at a glance)
   - Service-specific (RED metrics per endpoint)
   - Resource usage (USE metrics)
   - Business metrics (orders, revenue)

5. **Configure Alerts**
   - Error rate > threshold
   - Latency > SLO
   - Resource saturation > 80%
   - Service unavailable

## Example

**Context**: Monitoring a web API with Prometheus

```python
from prometheus_client import Counter, Histogram, Gauge, Summary
from flask import Flask, request
import time
import psutil

app = Flask(__name__)

# RED Metrics (Rate, Errors, Duration)

# Rate: Request count
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Errors: Error count
error_count = Counter(
    'http_errors_total',
    'Total HTTP errors',
    ['method', 'endpoint', 'error_type']
)

# Duration: Request latency
request_latency = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)

# Alternative: Summary with quantiles
request_latency_summary = Summary(
    'http_request_duration_summary',
    'HTTP request latency summary',
    ['method', 'endpoint']
)

# USE Metrics (Utilization, Saturation, Errors)

# Utilization: Current resource usage
cpu_usage = Gauge('cpu_usage_percent', 'CPU usage percentage')
memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes')
disk_usage = Gauge('disk_usage_percent', 'Disk usage percentage')

# Saturation: Queue depths, connection pools
db_connection_pool_usage = Gauge(
    'db_connection_pool_usage',
    'Database connections in use'
)
db_connection_pool_max = Gauge(
    'db_connection_pool_max',
    'Maximum database connections'
)

# Application-specific metrics
active_users = Gauge('active_users', 'Currently active users')
cache_hits = Counter('cache_hits_total', 'Cache hits')
cache_misses = Counter('cache_misses_total', 'Cache misses')

# Business metrics
orders_total = Counter('orders_total', 'Total orders', ['status'])
revenue_total = Counter('revenue_total', 'Total revenue in cents')

# Middleware to track requests
@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    # Track request
    method = request.method
    endpoint = request.endpoint or 'unknown'
    status = response.status_code
    
    # Update metrics
    request_count.labels(method, endpoint, status).inc()
    
    # Track latency
    if hasattr(request, 'start_time'):
        duration = time.time() - request.start_time
        request_latency.labels(method, endpoint).observe(duration)
        request_latency_summary.labels(method, endpoint).observe(duration)
    
    return response

# Track errors
@app.errorhandler(Exception)
def handle_error(error):
    method = request.method
    endpoint = request.endpoint or 'unknown'
    error_type = type(error).__name__
    
    error_count.labels(method, endpoint, error_type).inc()
    request_count.labels(method, endpoint, 500).inc()
    
    return {'error': str(error)}, 500

# Expose metrics endpoint
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

# Background job to update resource metrics
import threading

def update_system_metrics():
    while True:
        # CPU usage
        cpu_percent = psutil.cpu_percent(interval=1)
        cpu_usage.set(cpu_percent)
        
        # Memory usage
        memory = psutil.virtual_memory()
        memory_usage.set(memory.used)
        
        # Disk usage
        disk = psutil.disk_usage('/')
        disk_usage.set(disk.percent)
        
        time.sleep(15)  # Update every 15 seconds

# Start background metrics updater
metrics_thread = threading.Thread(target=update_system_metrics, daemon=True)
metrics_thread.start()

# Example: Tracking business metrics
@app.route('/api/orders', methods=['POST'])
def create_order():
    try:
        order_data = request.json
        
        # Process order
        order = process_order(order_data)
        
        # Track metrics
        orders_total.labels(status='success').inc()
        revenue_total.inc(order.amount_cents)
        
        return {'order_id': order.id}, 201
        
    except Exception as e:
        orders_total.labels(status='failed').inc()
        raise
```

**Prometheus Configuration** (`prometheus.yml`):
```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'web-api'
    static_configs:
      - targets: ['localhost:5000']
    metrics_path: '/metrics'

# Alerting rules
rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']
```

**Alert Rules** (`alerts.yml`):
```yaml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate ({{ $value | humanizePercentage }})"
          description: "Error rate is above 5% for 5 minutes"
      
      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint)
          ) > 1.0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.endpoint }}"
          description: "P95 latency is {{ $value }}s (threshold: 1s)"
      
      # High CPU usage
      - alert: HighCPUUsage
        expr: cpu_usage_percent > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage ({{ $value }}%)"
          description: "CPU usage above 80% for 10 minutes"
      
      # Database connection pool exhaustion
      - alert: DBConnectionPoolNearLimit
        expr: |
          db_connection_pool_usage / db_connection_pool_max > 0.9
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Database connection pool near limit"
          description: "Using {{ $value | humanizePercentage }} of connection pool"
```

**Grafana Dashboard** (JSON):
```json
{
  "dashboard": {
    "title": "API Monitoring",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m])) by (endpoint)"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m]))"
          }
        ],
        "type": "graph",
        "alert": {
          "conditions": [
            {
              "evaluator": {
                "params": [0.05],
                "type": "gt"
              }
            }
          ]
        }
      },
      {
        "title": "Request Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Active Connections",
        "targets": [
          {
            "expr": "db_connection_pool_usage"
          }
        ],
        "type": "gauge"
      }
    ]
  }
}
```

**Custom Decorator for Automatic Instrumentation**:
```python
from functools import wraps

def monitor(metric_name=None):
    """Decorator to automatically monitor function calls"""
    def decorator(func):
        name = metric_name or func.__name__
        
        # Create metrics for this function
        calls = Counter(f'{name}_calls_total', f'Total calls to {name}')
        errors = Counter(f'{name}_errors_total', f'Errors in {name}')
        duration = Histogram(f'{name}_duration_seconds', f'Duration of {name}')
        
        @wraps(func)
        def wrapper(*args, **kwargs):
            calls.inc()
            
            with duration.time():
                try:
                    result = func(*args, **kwargs)
                    return result
                except Exception as e:
                    errors.inc()
                    raise
        
        return wrapper
    return decorator

# Usage
@monitor('process_payment')
def process_payment(order_id):
    # Function automatically instrumented
    pass
```

## Best Practices

- ✅ Use RED metrics for request-driven services
- ✅ Use USE metrics for resource monitoring
- ✅ Monitor both technical and business metrics
- ✅ Set up alerts on symptoms, not causes
- ✅ Define SLOs and alert on SLO violations
- ✅ Use percentiles (p95, p99) not averages for latency
- ✅ Include cardinality limits (don't track unbounded labels)
- ✅ Create runbooks for each alert
- ✅ Test alerts (trigger them intentionally)
- ✅ Review and tune alerts regularly
- ❌ Avoid: Too many alerts (alert fatigue)
- ❌ Avoid: Alerts without actionable responses
- ❌ Avoid: High-cardinality labels (user IDs, timestamps)
- ❌ Avoid: Monitoring without SLOs

Overview

This skill implements application metrics (RED, USE, Four Golden Signals), builds monitoring dashboards, and configures alerting strategies so teams detect and resolve issues before users are impacted. It provides instrumentation patterns, collection choices (Prometheus, StatsD, CloudWatch, DataDog), alert rules, and dashboard templates for production services. The output is practical code snippets, alert rules, and best-practice guidance for reliable observability.

How this skill works

The skill instruments services with counters, gauges, histograms, and summaries to capture rate, errors, duration, utilization, and saturation. It wires those metrics into a collection backend (e.g., Prometheus) and deploys dashboards and alert rules (Grafana + Prometheus alerts or equivalent). It also includes patterns like automatic decorators, background resource polling, and example alert expressions for error rate, latency, and resource saturation.

When to use it

Deploying services to production
Detecting incidents and enabling fast response
Tracking SLAs/SLOs and business KPIs
Capacity planning and performance tuning
Establishing on-call alerting and runbooks

Best practices

Use RED for request-driven services and USE for resources
Record percentiles (p95/p99) for latency rather than averages
Limit label cardinality; avoid user IDs and timestamps
Alert on symptoms (error rate, latency, saturation), not on root causes
Define SLOs and alert on SLO burn or violations
Create runbooks and test alerts regularly to avoid alert fatigue

Example use cases

Instrument a Flask API with Prometheus metrics and expose /metrics for scraping
Create Grafana dashboards: request rate, error rate, p95 latency, and connection pool gauges
Define Prometheus alert rules: high error rate, P95 latency breach, CPU > 80%, DB pool near limit
Use a decorator to auto-instrument important functions (calls, errors, duration)
Track business metrics (orders, revenue) alongside technical metrics to correlate customer impact

FAQ

Which metrics library and backend should I choose?

Choose based on environment and team expertise: Prometheus + OpenMetrics works well for on-prem/kubernetes; CloudWatch integrates tightly with AWS; DataDog offers a managed full-stack option. The instrumentation approach (counters/gauges/histograms) is the same across backends.

How do I avoid alert fatigue?

Alert on symptoms and SLOs, set reasonable thresholds and ‘for’ durations, limit noisy alerts, group related alerts, and maintain runbooks so each alert is actionable.