home / skills / eddiebe147 / claude-settings / monitoring-setup

monitoring-setup skill

/skills/monitoring-setup

This skill helps you configure comprehensive monitoring with metrics, logs, and health checks to improve observability and alerting in your app.

npx playbooks add skill eddiebe147/claude-settings --skill monitoring-setup

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

15.5 KB

---
name: monitoring-setup
description: Expert guide for setting up monitoring dashboards, alerting, metrics collection, and observability. Use when implementing application monitoring, setting up alerts, or building dashboards.
---

# Monitoring Setup Skill

## Overview

This skill helps you implement comprehensive monitoring for applications. Covers metrics collection, dashboard creation, alerting strategies, health checks, and observability best practices.

## Monitoring Philosophy

### Four Golden Signals
1. **Latency**: Time to serve a request
2. **Traffic**: Request volume
3. **Errors**: Failed request rate
4. **Saturation**: Resource utilization

### Observability Pillars
- **Metrics**: Numeric measurements over time
- **Logs**: Discrete events with context
- **Traces**: Request flow across services

## Health Check Endpoints

### Comprehensive Health Check

```typescript
// src/app/api/health/route.ts
import { NextResponse } from 'next/server';
import { createClient } from '@supabase/supabase-js';
import Redis from 'ioredis';

interface HealthCheck {
  status: 'healthy' | 'degraded' | 'unhealthy';
  timestamp: string;
  version: string;
  uptime: number;
  checks: {
    database: CheckResult;
    redis: CheckResult;
    external: CheckResult;
  };
}

interface CheckResult {
  status: 'pass' | 'fail';
  latency?: number;
  message?: string;
}

async function checkDatabase(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const supabase = createClient(
      process.env.SUPABASE_URL!,
      process.env.SUPABASE_SERVICE_ROLE_KEY!
    );
    await supabase.from('health_check').select('1').single();
    return {
      status: 'pass',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function checkRedis(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const redis = new Redis(process.env.REDIS_URL!);
    await redis.ping();
    redis.disconnect();
    return {
      status: 'pass',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function checkExternal(): Promise<CheckResult> {
  const start = Date.now();
  try {
    const response = await fetch('https://api.stripe.com/v1/health', {
      method: 'HEAD',
    });
    return {
      status: response.ok ? 'pass' : 'fail',
      latency: Date.now() - start,
    };
  } catch (error) {
    return {
      status: 'fail',
      message: 'External service unavailable',
    };
  }
}

const startTime = Date.now();

export async function GET() {
  const [database, redis, external] = await Promise.all([
    checkDatabase(),
    checkRedis(),
    checkExternal(),
  ]);

  const checks = { database, redis, external };

  const allPassed = Object.values(checks).every((c) => c.status === 'pass');
  const anyFailed = Object.values(checks).some((c) => c.status === 'fail');

  const health: HealthCheck = {
    status: allPassed ? 'healthy' : anyFailed ? 'unhealthy' : 'degraded',
    timestamp: new Date().toISOString(),
    version: process.env.VERCEL_GIT_COMMIT_SHA || 'local',
    uptime: Math.floor((Date.now() - startTime) / 1000),
    checks,
  };

  return NextResponse.json(health, {
    status: health.status === 'healthy' ? 200 : 503,
    headers: {
      'Cache-Control': 'no-store',
    },
  });
}
```

### Kubernetes-Style Probes

```typescript
// src/app/api/health/live/route.ts
// Liveness probe - is the app running?
export async function GET() {
  return new Response('OK', { status: 200 });
}

// src/app/api/health/ready/route.ts
// Readiness probe - can the app handle traffic?
export async function GET() {
  try {
    // Check critical dependencies
    await checkDatabase();
    return new Response('OK', { status: 200 });
  } catch {
    return new Response('Not Ready', { status: 503 });
  }
}
```

## Metrics Collection

### Custom Metrics with Prometheus Client

```typescript
// src/lib/metrics.ts
import { Counter, Histogram, Gauge, Registry } from 'prom-client';

export const registry = new Registry();

// HTTP request metrics
export const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [registry],
});

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
  registers: [registry],
});

// Business metrics
export const activeUsers = new Gauge({
  name: 'active_users',
  help: 'Number of currently active users',
  registers: [registry],
});

export const ordersTotal = new Counter({
  name: 'orders_total',
  help: 'Total orders processed',
  labelNames: ['status', 'payment_method'],
  registers: [registry],
});

// Database metrics
export const dbQueryDuration = new Histogram({
  name: 'db_query_duration_seconds',
  help: 'Database query duration',
  labelNames: ['operation', 'table'],
  buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1],
  registers: [registry],
});
```

### Metrics Endpoint

```typescript
// src/app/api/metrics/route.ts
import { NextResponse } from 'next/server';
import { registry } from '@/lib/metrics';

export async function GET(request: Request) {
  // Optional: Basic auth protection
  const authHeader = request.headers.get('authorization');
  if (authHeader !== `Bearer ${process.env.METRICS_TOKEN}`) {
    return new Response('Unauthorized', { status: 401 });
  }

  const metrics = await registry.metrics();

  return new Response(metrics, {
    headers: {
      'Content-Type': registry.contentType,
    },
  });
}
```

### Middleware for Request Metrics

```typescript
// src/middleware.ts
import { NextResponse } from 'next/server';
import type { NextRequest } from 'next/server';
import { httpRequestsTotal, httpRequestDuration } from '@/lib/metrics';

export async function middleware(request: NextRequest) {
  const start = Date.now();

  const response = NextResponse.next();

  // Record metrics after response
  const route = request.nextUrl.pathname;
  const method = request.method;
  const status = response.status.toString();

  httpRequestsTotal.inc({ method, route, status });
  httpRequestDuration.observe(
    { method, route },
    (Date.now() - start) / 1000
  );

  return response;
}
```

## Alerting Configuration

### Alert Rules (Prometheus/Grafana)

```yaml
# alerts.yml
groups:
  - name: application
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: High error rate detected
          description: Error rate is {{ $value | humanizePercentage }} over the last 5 minutes

      # High latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: High latency detected
          description: 95th percentile latency is {{ $value | humanizeDuration }}

      # Service down
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: Service is down
          description: "{{ $labels.instance }} has been down for more than 1 minute"

      # Database connection pool exhausted
      - alert: DatabaseConnectionsHigh
        expr: pg_stat_activity_count > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: Database connection pool nearly exhausted
          description: "{{ $value }} connections in use"

  - name: infrastructure
    rules:
      # High CPU
      - alert: HighCPU
        expr: node_cpu_seconds_total{mode="idle"} < 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: High CPU usage

      # Low disk space
      - alert: LowDiskSpace
        expr: |
          (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: Low disk space
          description: Only {{ $value | humanizePercentage }} disk space remaining
```

### Vercel/Uptime Monitoring

```typescript
// scripts/uptime-check.ts
// Run via cron or external monitoring service

const ENDPOINTS = [
  { name: 'Health', url: 'https://myapp.com/api/health' },
  { name: 'Homepage', url: 'https://myapp.com' },
  { name: 'API', url: 'https://myapp.com/api/status' },
];

const WEBHOOK_URL = process.env.SLACK_WEBHOOK_URL;

async function checkEndpoint(endpoint: typeof ENDPOINTS[0]) {
  const start = Date.now();
  try {
    const response = await fetch(endpoint.url, {
      method: 'GET',
      signal: AbortSignal.timeout(10000),
    });

    return {
      name: endpoint.name,
      url: endpoint.url,
      status: response.status,
      latency: Date.now() - start,
      healthy: response.ok,
    };
  } catch (error) {
    return {
      name: endpoint.name,
      url: endpoint.url,
      status: 0,
      latency: Date.now() - start,
      healthy: false,
      error: error instanceof Error ? error.message : 'Unknown error',
    };
  }
}

async function notifySlack(message: string) {
  if (!WEBHOOK_URL) return;

  await fetch(WEBHOOK_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text: message }),
  });
}

async function runChecks() {
  const results = await Promise.all(ENDPOINTS.map(checkEndpoint));

  const unhealthy = results.filter((r) => !r.healthy);

  if (unhealthy.length > 0) {
    const message = `🚨 *Uptime Alert*\n${unhealthy
      .map((r) => `• ${r.name}: ${r.error || `Status ${r.status}`}`)
      .join('\n')}`;

    await notifySlack(message);
  }

  console.log(JSON.stringify(results, null, 2));
}

runChecks();
```

## Dashboard Configuration

### Grafana Dashboard JSON

```json
{
  "title": "Application Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "graph",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[5m])) by (route)",
          "legendFormat": "{{ route }}"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "stat",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 1, "color": "yellow" },
              { "value": 5, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Response Time (p95)",
      "type": "gauge",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "thresholds": {
            "steps": [
              { "value": 0, "color": "green" },
              { "value": 0.5, "color": "yellow" },
              { "value": 2, "color": "red" }
            ]
          }
        }
      }
    },
    {
      "title": "Active Users",
      "type": "stat",
      "targets": [
        {
          "expr": "active_users"
        }
      ]
    }
  ]
}
```

### Vercel Analytics Integration

```typescript
// src/app/layout.tsx
import { Analytics } from '@vercel/analytics/react';
import { SpeedInsights } from '@vercel/speed-insights/next';

export default function RootLayout({
  children,
}: {
  children: React.ReactNode;
}) {
  return (
    <html lang="en">
      <body>
        {children}
        <Analytics />
        <SpeedInsights />
      </body>
    </html>
  );
}
```

## Status Page

### Simple Status Page

```typescript
// src/app/status/page.tsx
import { Suspense } from 'react';

interface ServiceStatus {
  name: string;
  status: 'operational' | 'degraded' | 'outage';
  lastChecked: string;
}

async function getStatus(): Promise<ServiceStatus[]> {
  const response = await fetch(
    `${process.env.NEXT_PUBLIC_APP_URL}/api/health`,
    { next: { revalidate: 60 } }
  );

  if (!response.ok) {
    return [
      { name: 'API', status: 'outage', lastChecked: new Date().toISOString() },
    ];
  }

  const health = await response.json();

  return [
    {
      name: 'API',
      status: health.status === 'healthy' ? 'operational' : 'degraded',
      lastChecked: health.timestamp,
    },
    {
      name: 'Database',
      status: health.checks.database.status === 'pass' ? 'operational' : 'outage',
      lastChecked: health.timestamp,
    },
    {
      name: 'Cache',
      status: health.checks.redis.status === 'pass' ? 'operational' : 'degraded',
      lastChecked: health.timestamp,
    },
  ];
}

function StatusBadge({ status }: { status: ServiceStatus['status'] }) {
  const colors = {
    operational: 'bg-green-500',
    degraded: 'bg-yellow-500',
    outage: 'bg-red-500',
  };

  return (
    <span className={`inline-block w-3 h-3 rounded-full ${colors[status]}`} />
  );
}

export default async function StatusPage() {
  const services = await getStatus();
  const allOperational = services.every((s) => s.status === 'operational');

  return (
    <div className="max-w-2xl mx-auto p-8">
      <h1 className="text-2xl font-bold mb-8">System Status</h1>

      <div className={`p-4 rounded-lg mb-8 ${
        allOperational ? 'bg-green-100' : 'bg-yellow-100'
      }`}>
        <p className="font-medium">
          {allOperational
            ? 'All systems operational'
            : 'Some systems experiencing issues'}
        </p>
      </div>

      <div className="space-y-4">
        {services.map((service) => (
          <div
            key={service.name}
            className="flex items-center justify-between p-4 border rounded"
          >
            <div className="flex items-center gap-3">
              <StatusBadge status={service.status} />
              <span className="font-medium">{service.name}</span>
            </div>
            <span className="text-sm text-gray-500 capitalize">
              {service.status}
            </span>
          </div>
        ))}
      </div>

      <p className="mt-8 text-sm text-gray-500">
        Last updated: {new Date().toLocaleString()}
      </p>
    </div>
  );
}
```

## Monitoring Checklist

### Application Monitoring
- [ ] Health check endpoint
- [ ] Request latency metrics
- [ ] Error rate tracking
- [ ] Active user count
- [ ] Business metrics (orders, signups, etc.)

### Infrastructure Monitoring
- [ ] CPU/Memory utilization
- [ ] Disk space
- [ ] Network I/O
- [ ] Database connections
- [ ] Cache hit rate

### Alerting
- [ ] Error rate thresholds
- [ ] Latency thresholds
- [ ] Uptime monitoring
- [ ] Resource alerts
- [ ] On-call rotation configured

### Dashboards
- [ ] Overview dashboard
- [ ] API performance
- [ ] Database metrics
- [ ] Business KPIs
- [ ] Status page (public)

## When to Use This Skill

Invoke this skill when:
- Setting up monitoring for a new project
- Creating health check endpoints
- Implementing metrics collection
- Configuring alerting rules
- Building monitoring dashboards
- Setting up status pages
- Debugging performance issues
- Planning capacity

Overview

This skill is an expert guide for setting up monitoring dashboards, alerting, metrics collection, and observability across applications. It focuses on practical implementations: health checks, Prometheus metrics, middleware instrumentation, alert rules, uptime checks, and dashboard wiring. Use it to build reliable visibility and automated notifications for production services.

How this skill works

The skill provides concrete patterns and code examples to expose health endpoints, collect custom metrics with a Prometheus client, and publish a metrics endpoint protected by a token. It shows middleware instrumentation to record request counts and latencies, sample alert rules for Prometheus/Grafana, uptime check scripts that notify Slack, and Grafana dashboard JSON for key panels. It also covers Kubernetes-style liveness/readiness probes and a simple public status page pattern.

When to use it

Implement application monitoring for new or existing services
Create dashboards and panels to visualize errors, latency, and traffic
Set up Prometheus scraping and a secure metrics endpoint
Define alert rules for error rates, latency, and resource saturation
Add uptime checks and external notifications (Slack/webhooks)

Best practices

Instrument the Four Golden Signals: latency, traffic, errors, saturation
Expose both detailed health checks and simple liveness/readiness endpoints
Protect your /metrics endpoint with a token or network restrictions
Use percentiles (p95/p99) and histograms for latency instead of averages
Tune alert thresholds to reduce noise; require sustained violations (for: blocks)
Include business metrics (active users, orders) alongside infra metrics

Example use cases

Add a /api/health endpoint that aggregates DB, cache, and external service checks
Use a Prometheus client to record http_requests_total and request duration histograms
Create Grafana panels for request rate, error rate, and p95 response time
Deploy Prometheus alert rules for HighErrorRate, HighLatency, and ServiceDown
Run a cron uptime-check script that posts failures to Slack via a webhook

FAQ

How should I secure the metrics endpoint?

Require a bearer token or restrict access by network/IP. Avoid exposing raw metrics publicly.

What percentiles should I track for latency?

Track p50, p95, and p99. p95 highlights common tail latency; p99 catches rare but critical slow requests.

How do I avoid alert fatigue?

Set sensible thresholds, use for: windows to require sustained issues, group related alerts, and provide clear runbooks in annotations.