home / skills / aj-geddes / useful-ai-prompts / infrastructure-monitoring

infrastructure-monitoring skill

/skills/infrastructure-monitoring

This skill helps you implement comprehensive infrastructure monitoring with Prometheus, Grafana, and alerting to track metrics, health, and performance.

npx playbooks add skill aj-geddes/useful-ai-prompts --skill infrastructure-monitoring

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
10.6 KB
---
name: infrastructure-monitoring
description: Set up comprehensive infrastructure monitoring with Prometheus, Grafana, and alerting systems for metrics, health checks, and performance tracking.
---

# Infrastructure Monitoring

## Overview

Implement comprehensive infrastructure monitoring to track system health, performance metrics, and resource utilization with alerting and visualization across your entire stack.

## When to Use

- Real-time performance monitoring
- Capacity planning and trends
- Incident detection and alerting
- Service health tracking
- Resource utilization analysis
- Performance troubleshooting
- Compliance and audit trails
- Historical data analysis

## Implementation Examples

### 1. **Prometheus Configuration**

```yaml
# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    monitor: 'infrastructure-monitor'
    environment: 'production'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093

# Rule files
rule_files:
  - 'alerts.yml'
  - 'rules.yml'

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node Exporter for system metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1.internal:9100'
          - 'node2.internal:9100'
          - 'node3.internal:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

  # Docker container metrics
  - job_name: 'docker'
    static_configs:
      - targets: ['localhost:9323']
    metrics_path: '/metrics'

  # Kubernetes metrics
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Application metrics
  - job_name: 'application'
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'app1.internal:8080'
          - 'app2.internal:8080'
          - 'app3.internal:8080'
    scrape_interval: 10s
    scrape_timeout: 5s

  # PostgreSQL metrics
  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres-exporter.internal:9187']

  # Redis metrics
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter.internal:9121']

  # RabbitMQ metrics
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq.internal:15692']
```

### 2. **Alert Rules**

```yaml
# alerts.yml
groups:
  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High request latency"
          description: "P95 latency is {{ $value }}s"

      - alert: ServiceDown
        expr: up{job="application"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "Service has been unreachable for 1 minute"

  - name: infrastructure_alerts
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value }}%"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value }}%"

      - alert: LowDiskSpace
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|squashfs|vfat"} / node_filesystem_size_bytes) * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Available disk space is {{ $value }}%"

      - alert: NodeNotReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kubernetes node {{ $labels.node }} is not ready"
          description: "Node has been unready for 5 minutes"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Pod has restarted {{ $value }} times in 15 minutes"
```

### 3. **Alertmanager Configuration**

```yaml
# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'YOUR_SLACK_WEBHOOK_URL'

# Template files
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# Routing tree
route:
  receiver: 'default'
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # Critical alerts
    - match:
        severity: critical
      receiver: 'critical-team'
      continue: true
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts
    - match:
        severity: warning
      receiver: 'warning-channel'
      group_wait: 1m

# Receivers
receivers:
  - name: 'default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'

  - name: 'critical-team'
    slack_configs:
      - channel: '#critical-alerts'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.mycompany.com:587'
        auth_username: '[email protected]'
        auth_password: 'secret'

  - name: 'warning-channel'
    slack_configs:
      - channel: '#warnings'
        title: 'Warning: {{ .GroupLabels.alertname }}'
```

### 4. **Grafana Dashboard**

```json
{
  "dashboard": {
    "title": "Infrastructure Overview",
    "panels": [
      {
        "title": "CPU Usage",
        "targets": [
          {
            "expr": "100 - (avg(rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"
          }
        ],
        "type": "graph",
        "alert": {
          "name": "CPU Usage Alert",
          "conditions": [
            {
              "evaluator": {
                "type": "gt",
                "params": [80]
              }
            }
          ]
        }
      },
      {
        "title": "Memory Usage",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Request Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Error Rate",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Response Time P95",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Disk Usage",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100"
          }
        ],
        "type": "graph"
      }
    ]
  }
}
```

### 5. **Monitoring Deployment**

```bash
#!/bin/bash
# deploy-monitoring.sh - Deploy Prometheus and Grafana

set -euo pipefail

NAMESPACE="monitoring"
PROMETHEUS_VERSION="v2.40.0"
GRAFANA_VERSION="9.3.2"

echo "Creating monitoring namespace..."
kubectl create namespace "$NAMESPACE" --dry-run=client -o yaml | kubectl apply -f -

# Deploy Prometheus
echo "Deploying Prometheus..."
kubectl apply -f prometheus-configmap.yaml -n "$NAMESPACE"
kubectl apply -f prometheus-deployment.yaml -n "$NAMESPACE"
kubectl apply -f prometheus-service.yaml -n "$NAMESPACE"

# Deploy Alertmanager
echo "Deploying Alertmanager..."
kubectl apply -f alertmanager-configmap.yaml -n "$NAMESPACE"
kubectl apply -f alertmanager-deployment.yaml -n "$NAMESPACE"
kubectl apply -f alertmanager-service.yaml -n "$NAMESPACE"

# Deploy Grafana
echo "Deploying Grafana..."
kubectl apply -f grafana-deployment.yaml -n "$NAMESPACE"
kubectl apply -f grafana-service.yaml -n "$NAMESPACE"

# Wait for deployments
echo "Waiting for deployments to be ready..."
kubectl rollout status deployment/prometheus -n "$NAMESPACE" --timeout=5m
kubectl rollout status deployment/alertmanager -n "$NAMESPACE" --timeout=5m
kubectl rollout status deployment/grafana -n "$NAMESPACE" --timeout=5m

# Port forward for access
echo "Port forwarding to services..."
kubectl port-forward -n "$NAMESPACE" svc/prometheus 9090:9090 &
kubectl port-forward -n "$NAMESPACE" svc/grafana 3000:3000 &

echo "Monitoring stack deployed successfully!"
echo "Prometheus: http://localhost:9090"
echo "Grafana: http://localhost:3000"
```

## Monitoring Best Practices

### ✅ DO
- Monitor key business metrics
- Set appropriate alert thresholds
- Use consistent naming conventions
- Implement dashboards for visualization
- Keep data retention reasonable
- Use labels for better querying
- Test alerting paths regularly
- Document alert meanings

### ❌ DON'T
- Alert on every metric change
- Ignore alert noise
- Store too much unnecessary data
- Set unrealistic thresholds
- Mix metrics from different sources
- Forget to test alert routing
- Alert without runbooks
- Over-instrument without purpose

## Resources

- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [Alertmanager Documentation](https://prometheus.io/docs/alerting/latest/overview/)
- [Prometheus Best Practices](https://prometheus.io/docs/practices/instrumentation/)

Overview

This skill sets up a complete infrastructure monitoring stack using Prometheus, Grafana, and Alertmanager to collect metrics, visualize dashboards, and deliver alerts. It provides preconfigured scrape targets, alert rules, Alertmanager routing, and sample Grafana dashboards to get production-grade monitoring online quickly. The goal is reliable visibility into system health, performance, and resource utilization across services and clusters.

How this skill works

Prometheus scrapes metrics from nodes, containers, applications, and exporters according to configured jobs and relabeling rules. Alert rules evaluate metric expressions and fire alerts, which Alertmanager groups and routes to channels like Slack or email based on severity. Grafana consumes Prometheus as a data source and renders dashboards and panel alerts for operational and executive views.

When to use it

  • Track real-time performance and error rates for services
  • Detect incidents and notify on-call teams via Slack or email
  • Perform capacity planning using historical metrics and trends
  • Monitor Kubernetes node and pod health and resource pressure
  • Investigate performance regressions and troubleshoot latency

Best practices

  • Monitor key business and system metrics, not everything; prioritize high-value signals
  • Use consistent naming and labels for metrics to simplify queries and dashboards
  • Set threshold-based alerts with appropriate for: durations to avoid flapping
  • Group and route alerts by severity and service; test alert delivery paths regularly
  • Keep data retention reasonable and use recording rules for expensive queries

Example use cases

  • Alert when P95 request latency exceeds 1s for 10 minutes and notify the on-call channel
  • Trigger critical alerts for node not ready or low disk space to prevent outages
  • Create Grafana overview dashboard with CPU, memory, disk, request rate, and error rate panels
  • Scrape application /metrics endpoints and instrument business metrics for SLA tracking
  • Deploy monitoring stack to a dedicated namespace and port-forward Prometheus and Grafana for local access

FAQ

How do I avoid alert fatigue?

Tune thresholds and use 'for' windows to reduce noisy alerts, group related alerts, and implement silences for planned maintenance.

Where should I keep long-term metrics?

Keep short-term high-resolution metrics in Prometheus and offload long-term storage to a remote write compatible store or a metrics archive.