home / skills / manutej / luxor-claude-marketplace / observability-monitoring
observability-monitoring skill

/plugins/luxor-devops-suite/skills/observability-monitoring
This skill enables production-grade observability using Prometheus, Grafana, and OpenTelemetry to collect metrics, set alerts, and build dashboards.
npx playbooks add skill manutej/luxor-claude-marketplace --skill observability-monitoring
Review the files below or copy the command above to add this skill to your agents.
Files (3)
SKILL.md
38.8 KB
---
name: observability-monitoring
description: Comprehensive observability and monitoring skill covering Prometheus, Grafana, metrics collection, alerting, exporters, PromQL, and production monitoring patterns for distributed systems and cloud-native applications
---

# Observability & Monitoring

A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.

## When to Use This Skill

Use this skill when:

- Setting up monitoring for production systems and applications
- Implementing metrics collection and observability for microservices
- Creating dashboards and visualizations for system health monitoring
- Defining alerting rules and incident response automation
- Analyzing system performance and capacity using time-series data
- Implementing SLIs, SLOs, and SLAs for service reliability
- Debugging production issues using metrics and traces
- Building custom exporters for application-specific metrics
- Setting up federation for multi-cluster monitoring
- Migrating from legacy monitoring to cloud-native solutions
- Implementing cost monitoring and optimization tracking
- Creating real-time operational dashboards for DevOps teams

## Core Concepts

### The Four Pillars of Observability

Modern observability is built on four fundamental pillars:

1. **Metrics**: Numerical measurements of system behavior over time
   - Counter: Monotonically increasing values (requests served, errors)
   - Gauge: Point-in-time values that go up and down (memory usage, temperature)
   - Histogram: Distribution of values (request duration buckets)
   - Summary: Similar to histogram but calculates quantiles on client-side

2. **Logs**: Discrete events with contextual information
   - Structured logging (JSON, key-value pairs)
   - Centralized log aggregation (ELK, Loki)
   - Correlation with metrics and traces

3. **Traces**: Request flow through distributed systems
   - Span: Single unit of work with start/end time
   - Trace: Collection of spans representing end-to-end request
   - OpenTelemetry for distributed tracing

4. **Events**: Significant occurrences in system lifecycle
   - Deployments, configuration changes
   - Scaling events, incidents
   - Business events and user actions

### Prometheus Architecture

Prometheus is a pull-based monitoring system with key components:

**Time-Series Database (TSDB)**
- Stores metrics as time-series data
- Efficient compression and retention policies
- Local storage with optional remote storage

**Scrape Targets**
- Service discovery (Kubernetes, Consul, EC2, etc.)
- Static configuration
- Relabeling for flexible target selection

**PromQL Query Engine**
- Powerful query language for metrics analysis
- Aggregation, filtering, and mathematical operations
- Range vectors and instant vectors

**Alertmanager**
- Alert rule evaluation
- Grouping, silencing, and routing
- Integration with PagerDuty, Slack, email, webhooks

**Exporters**
- Bridge between Prometheus and systems
- Node exporter, cAdvisor, custom exporters
- Third-party exporters for databases, services

### Metric Labels and Cardinality

Labels are key-value pairs attached to metrics:

```prometheus
http_requests_total{method="GET", endpoint="/api/users", status="200"}
```

**Label Best Practices:**
- Use labels for dimensions you query/aggregate by
- Avoid high-cardinality labels (user IDs, timestamps)
- Keep label names consistent across metrics
- Use relabeling to normalize external labels

**Cardinality Considerations:**
- Each unique label combination = new time-series
- High cardinality = increased memory and storage
- Monitor cardinality with `prometheus_tsdb_symbol_table_size_bytes`
- Use recording rules to pre-aggregate high-cardinality metrics

### Recording Rules

Pre-compute frequently-used or expensive queries:

```yaml
groups:
  - name: api_performance
    interval: 30s
    rules:
      - record: api:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
      - record: api:http_requests:rate5m
        expr: rate(http_requests_total[5m])
```

**Benefits:**
- Faster dashboard loading
- Reduced query load on Prometheus
- Consistent metric naming conventions
- Enable complex aggregations

### Service Level Objectives (SLOs)

Define and track reliability targets:

**SLI (Service Level Indicator)**: Metric measuring service quality
- Availability: % of successful requests
- Latency: % of requests under threshold
- Throughput: Requests per second

**SLO (Service Level Objective)**: Target for SLI
- 99.9% availability (43.8 minutes downtime/month)
- 95% of requests < 200ms
- 1000 RPS sustained

**SLA (Service Level Agreement)**: Contract with consequences
- External commitments to customers
- Financial penalties for SLO violations

**Error Budget**: Acceptable failure rate
- Error budget = 100% - SLO
- 99.9% SLO = 0.1% error budget
- Use budget for innovation vs. reliability tradeoff

## Prometheus Setup and Configuration

### Basic Prometheus Configuration

```yaml
# prometheus.yml
global:
  scrape_interval: 15s      # Default scrape interval
  evaluation_interval: 15s  # Alert rule evaluation interval
  external_labels:
    cluster: 'production'
    region: 'us-west-2'

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Load rules
rule_files:
  - 'rules/*.yml'
  - 'alerts/*.yml'

# Scrape configurations
scrape_configs:
  # Prometheus self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Node exporter for system metrics
  - job_name: 'node'
    static_configs:
      - targets:
          - 'node1:9100'
          - 'node2:9100'
          - 'node3:9100'
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):.*'
        replacement: '${1}'

  # Application metrics
  - job_name: 'api'
    static_configs:
      - targets: ['api-1:8080', 'api-2:8080', 'api-3:8080']
        labels:
          env: 'production'
          tier: 'backend'
```

### Kubernetes Service Discovery

```yaml
scrape_configs:
  # Kubernetes API server
  - job_name: 'kubernetes-apiservers'
    kubernetes_sd_configs:
      - role: endpoints
    scheme: https
    tls_config:
      ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
    bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabel_configs:
      - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
        action: keep
        regex: default;kubernetes;https

  # Kubernetes pods with prometheus.io annotations
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with prometheus.io/scrape: "true" annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use the port from prometheus.io/port annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: (\d+)
        target_label: __address__
        replacement: ${1}:${2}
      # Use the path from prometheus.io/path annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
      # Add namespace label
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      # Add pod name label
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

  # Kubernetes services
  - job_name: 'kubernetes-services'
    kubernetes_sd_configs:
      - role: service
    metrics_path: /probe
    params:
      module: [http_2xx]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
        action: keep
        regex: true
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: blackbox-exporter:9115
      - source_labels: [__param_target]
        target_label: instance
```

### Storage and Retention

```yaml
# Storage configuration
storage:
  tsdb:
    path: /prometheus/data
    retention.time: 15d
    retention.size: 50GB

# Remote write for long-term storage
remote_write:
  - url: "https://prometheus-remote-storage.example.com/api/v1/write"
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/remote_storage_password
    queue_config:
      capacity: 10000
      max_shards: 50
      max_samples_per_send: 5000
    write_relabel_configs:
      # Drop high-cardinality metrics
      - source_labels: [__name__]
        regex: 'container_network_.*'
        action: drop

# Remote read for querying historical data
remote_read:
  - url: "https://prometheus-remote-storage.example.com/api/v1/read"
    basic_auth:
      username: prometheus
      password_file: /etc/prometheus/remote_storage_password
    read_recent: true
```

## PromQL: The Prometheus Query Language

### Instant Vectors and Selectors

```promql
# Basic metric selection
http_requests_total

# Filter by label
http_requests_total{job="api", status="200"}

# Regex matching
http_requests_total{status=~"2..|3.."}

# Negative matching
http_requests_total{status!="500"}

# Multiple label matchers
http_requests_total{job="api", method="GET", status=~"2.."}
```

### Range Vectors and Aggregations

```promql
# 5-minute range vector
http_requests_total[5m]

# Rate of increase per second
rate(http_requests_total[5m])

# Increase over time window
increase(http_requests_total[1h])

# Average over time
avg_over_time(cpu_usage[5m])

# Max/Min over time
max_over_time(response_time_seconds[10m])
min_over_time(response_time_seconds[10m])

# Standard deviation
stddev_over_time(response_time_seconds[5m])
```

### Aggregation Operators

```promql
# Sum across all instances
sum(rate(http_requests_total[5m]))

# Sum grouped by job
sum by (job) (rate(http_requests_total[5m]))

# Average grouped by multiple labels
avg by (job, instance) (cpu_usage)

# Count number of series
count(up == 1)

# Topk and bottomk
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_memory_available_bytes)

# Quantile across instances
quantile(0.95, http_request_duration_seconds)
```

### Mathematical Operations

```promql
# Arithmetic operations
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100

# Comparison operators
http_request_duration_seconds > 0.5

# Logical operators
up == 1 and rate(http_requests_total[5m]) > 100

# Vector matching
rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
```

### Advanced PromQL Patterns

```promql
# Request success rate
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Latency percentiles (histogram)
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Predict linear growth
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

# Detect anomalies with standard deviation
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
>
3 * stddev_over_time(cpu_usage[1h])

# Calculate saturation (RED method)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance)
/
count(cpu_seconds_total{mode="idle"}) by (instance)

# Burn rate for SLO
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  /
  sum(rate(http_requests_total[1h]))
)
>
(14.4 * (1 - 0.999))  # For 99.9% SLO
```

## Alerting with Prometheus and Alertmanager

### Alert Rule Definitions

```yaml
# alerts/api_alerts.yml
groups:
  - name: api_alerts
    interval: 30s
    rules:
      # High error rate alert
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}"
          runbook_url: "https://runbooks.example.com/HighErrorRate"

      # High latency alert
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 1
        for: 10m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "P99 latency is {{ $value }}s on {{ $labels.service }}"

      # Service down alert
      - alert: ServiceDown
        expr: up{job="api"} == 0
        for: 2m
        labels:
          severity: critical
          team: sre
        annotations:
          summary: "Service {{ $labels.instance }} is down"
          description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"

      # Disk space alert
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"}
          /
          node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
        for: 5m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"

      # Memory pressure alert
      - alert: HighMemoryUsage
        expr: |
          (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"

      # CPU saturation alert
      - alert: HighCPUUsage
        expr: |
          100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 15m
        labels:
          severity: warning
          team: sre
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"
```

### Alertmanager Configuration

```yaml
# alertmanager.yml
global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
  pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

# Templates for notifications
templates:
  - '/etc/alertmanager/templates/*.tmpl'

# Route tree for alert distribution
route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-default'

  routes:
    # Critical alerts go to PagerDuty
    - match:
        severity: critical
      receiver: 'pagerduty-critical'
      continue: true

    # Critical alerts also go to Slack
    - match:
        severity: critical
      receiver: 'slack-critical'
      group_wait: 0s

    # Warning alerts to Slack only
    - match:
        severity: warning
      receiver: 'slack-warnings'

    # Team-specific routing
    - match:
        team: backend
      receiver: 'team-backend'

    - match:
        team: frontend
      receiver: 'team-frontend'

    # Database alerts to DBA team
    - match_re:
        service: 'postgres|mysql|mongodb'
      receiver: 'team-dba'

# Alert receivers/integrations
receivers:
  - name: 'team-default'
    slack_configs:
      - channel: '#alerts'
        title: 'Alert: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        send_resolved: true

  - name: 'pagerduty-critical'
    pagerduty_configs:
      - service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
        description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
        severity: '{{ .CommonLabels.severity }}'

  - name: 'slack-critical'
    slack_configs:
      - channel: '#incidents'
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
        color: 'danger'
        send_resolved: true

  - name: 'slack-warnings'
    slack_configs:
      - channel: '#monitoring'
        title: 'Warning: {{ .GroupLabels.alertname }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
        color: 'warning'
        send_resolved: true

  - name: 'team-backend'
    slack_configs:
      - channel: '#team-backend'
        send_resolved: true
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.gmail.com:587'
        auth_username: '[email protected]'
        auth_password_file: '/etc/alertmanager/email_password'

  - name: 'team-frontend'
    slack_configs:
      - channel: '#team-frontend'
        send_resolved: true

  - name: 'team-dba'
    slack_configs:
      - channel: '#team-dba'
        send_resolved: true
    pagerduty_configs:
      - service_key: 'DBA_PAGERDUTY_KEY'

# Inhibition rules (suppress alerts)
inhibit_rules:
  # Inhibit warnings if critical alert is firing
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

  # Don't alert on instance down if cluster is down
  - source_match:
      alertname: 'ClusterDown'
    target_match_re:
      alertname: 'InstanceDown|ServiceDown'
    equal: ['cluster']
```

### Multi-Window Multi-Burn-Rate Alerts for SLOs

```yaml
# SLO-based alerting using burn rate
groups:
  - name: slo_alerts
    interval: 30s
    rules:
      # Fast burn (1h window, 5m burn)
      - alert: ErrorBudgetBurnFast
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[1h]))
            /
            sum(rate(http_requests_total[1h]))
          ) > (14.4 * (1 - 0.999))
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) > (14.4 * (1 - 0.999))
        for: 2m
        labels:
          severity: critical
          slo: "99.9%"
        annotations:
          summary: "Fast error budget burn detected"
          description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"

      # Slow burn (6h window, 30m burn)
      - alert: ErrorBudgetBurnSlow
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[6h]))
            /
            sum(rate(http_requests_total[6h]))
          ) > (6 * (1 - 0.999))
          and
          (
            sum(rate(http_requests_total{status=~"5.."}[30m]))
            /
            sum(rate(http_requests_total[30m]))
          ) > (6 * (1 - 0.999))
        for: 15m
        labels:
          severity: warning
          slo: "99.9%"
        annotations:
          summary: "Slow error budget burn detected"
          description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"
```

## Grafana Dashboards and Visualization

### Dashboard JSON Structure

```json
{
  "dashboard": {
    "title": "API Performance Dashboard",
    "tags": ["api", "performance", "production"],
    "timezone": "browser",
    "editable": true,
    "graphTooltip": 1,
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "timepicker": {
      "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
      "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
    },
    "templating": {
      "list": [
        {
          "name": "cluster",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up, cluster)",
          "refresh": 1,
          "multi": false,
          "includeAll": false
        },
        {
          "name": "service",
          "type": "query",
          "datasource": "Prometheus",
          "query": "label_values(up{cluster=\"$cluster\"}, service)",
          "refresh": 1,
          "multi": true,
          "includeAll": true
        },
        {
          "name": "interval",
          "type": "interval",
          "query": "1m,5m,10m,30m,1h",
          "auto": true,
          "auto_count": 30,
          "auto_min": "10s"
        }
      ]
    },
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
            "legendFormat": "{{ service }}",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "reqps", "label": "Requests/sec"},
          {"format": "short"}
        ],
        "legend": {
          "show": true,
          "values": true,
          "current": true,
          "avg": true,
          "max": true
        }
      },
      {
        "id": 2,
        "title": "Error Rate",
        "type": "graph",
        "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
        "targets": [
          {
            "expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
            "legendFormat": "{{ service }} error %",
            "refId": "A"
          }
        ],
        "yaxes": [
          {"format": "percent", "label": "Error Rate"},
          {"format": "short"}
        ],
        "alert": {
          "conditions": [
            {
              "evaluator": {"params": [5], "type": "gt"},
              "operator": {"type": "and"},
              "query": {"params": ["A", "5m", "now"]},
              "reducer": {"params": [], "type": "avg"},
              "type": "query"
            }
          ],
          "executionErrorState": "alerting",
          "frequency": "1m",
          "handler": 1,
          "name": "High Error Rate",
          "noDataState": "no_data",
          "notifications": []
        }
      },
      {
        "id": 3,
        "title": "Latency Percentiles",
        "type": "graph",
        "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p99",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
            "legendFormat": "{{ service }} p50",
            "refId": "C"
          }
        ],
        "yaxes": [
          {"format": "s", "label": "Duration"},
          {"format": "short"}
        ]
      }
    ]
  }
}
```

### RED Method Dashboard

The RED method focuses on Request rate, Error rate, and Duration:

```json
{
  "panels": [
    {
      "title": "Request Rate (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
        }
      ]
    },
    {
      "title": "Error Rate % (per service)",
      "targets": [
        {
          "expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
        }
      ]
    },
    {
      "title": "Duration p99 (per service)",
      "targets": [
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
        }
      ]
    }
  ]
}
```

### USE Method Dashboard

The USE method monitors Utilization, Saturation, and Errors:

```json
{
  "panels": [
    {
      "title": "CPU Utilization %",
      "targets": [
        {
          "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
        }
      ]
    },
    {
      "title": "CPU Saturation (Load Average)",
      "targets": [
        {
          "expr": "node_load1"
        }
      ]
    },
    {
      "title": "Memory Utilization %",
      "targets": [
        {
          "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
        }
      ]
    },
    {
      "title": "Disk I/O Utilization %",
      "targets": [
        {
          "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
        }
      ]
    },
    {
      "title": "Network Errors",
      "targets": [
        {
          "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
        }
      ]
    }
  ]
}
```

## Exporters and Metric Collection

### Node Exporter for System Metrics

```bash
# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter --web.listen-address=":9100" \
  --collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)" \
  --collector.netclass.ignored-devices="^(veth.*|br.*|docker.*|lo)$"
```

**Key Metrics from Node Exporter:**
- `node_cpu_seconds_total`: CPU usage by mode
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: Memory
- `node_disk_io_time_seconds_total`: Disk I/O
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: Network
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: Disk space

### Custom Application Exporter (Python)

```python
# app_exporter.py
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import time
import random

# Define metrics
REQUEST_COUNT = Counter(
    'app_requests_total',
    'Total app requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'app_request_duration_seconds',
    'Request duration in seconds',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0]
)

ACTIVE_USERS = Gauge(
    'app_active_users',
    'Number of active users'
)

QUEUE_SIZE = Gauge(
    'app_queue_size',
    'Current queue size',
    ['queue_name']
)

DATABASE_CONNECTIONS = Gauge(
    'app_database_connections',
    'Number of database connections',
    ['pool', 'state']
)

CACHE_HITS = Counter(
    'app_cache_hits_total',
    'Total cache hits',
    ['cache_name']
)

CACHE_MISSES = Counter(
    'app_cache_misses_total',
    'Total cache misses',
    ['cache_name']
)

def simulate_metrics():
    """Simulate application metrics"""
    while True:
        # Simulate requests
        method = random.choice(['GET', 'POST', 'PUT', 'DELETE'])
        endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
        status = random.choice(['200', '200', '200', '400', '500'])

        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

        # Simulate request duration
        duration = random.uniform(0.01, 2.0)
        REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

        # Update gauges
        ACTIVE_USERS.set(random.randint(100, 1000))
        QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
        QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

        # Database connection pool
        DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
        DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

        # Cache metrics
        if random.random() > 0.3:
            CACHE_HITS.labels(cache_name='redis').inc()
        else:
            CACHE_MISSES.labels(cache_name='redis').inc()

        time.sleep(1)

if __name__ == '__main__':
    # Start metrics server on port 8000
    start_http_server(8000)
    print("Metrics server started on port 8000")
    simulate_metrics()
```

### Custom Exporter (Go)

```go
package main

import (
    "log"
    "math/rand"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    requestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "app_requests_total",
            Help: "Total number of requests",
        },
        []string{"method", "endpoint", "status"},
    )

    requestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "app_request_duration_seconds",
            Help:    "Request duration in seconds",
            Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
        },
        []string{"method", "endpoint"},
    )

    activeUsers = prometheus.NewGauge(
        prometheus.GaugeOpts{
            Name: "app_active_users",
            Help: "Number of active users",
        },
    )

    databaseConnections = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "app_database_connections",
            Help: "Number of database connections",
        },
        []string{"pool", "state"},
    )
)

func init() {
    prometheus.MustRegister(requestsTotal)
    prometheus.MustRegister(requestDuration)
    prometheus.MustRegister(activeUsers)
    prometheus.MustRegister(databaseConnections)
}

func simulateMetrics() {
    ticker := time.NewTicker(1 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        // Simulate requests
        methods := []string{"GET", "POST", "PUT", "DELETE"}
        endpoints := []string{"/api/users", "/api/products", "/api/orders"}
        statuses := []string{"200", "200", "200", "400", "500"}

        method := methods[rand.Intn(len(methods))]
        endpoint := endpoints[rand.Intn(len(endpoints))]
        status := statuses[rand.Intn(len(statuses))]

        requestsTotal.WithLabelValues(method, endpoint, status).Inc()
        requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

        // Update gauges
        activeUsers.Set(float64(rand.Intn(900) + 100))
        databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
        databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
    }
}

func main() {
    go simulateMetrics()

    http.Handle("/metrics", promhttp.Handler())
    log.Println("Starting metrics server on :8000")
    log.Fatal(http.ListenAndServe(":8000", nil))
}
```

### PostgreSQL Exporter

```yaml
# docker-compose.yml for postgres_exporter
version: '3.8'
services:
  postgres-exporter:
    image: prometheuscommunity/postgres-exporter
    environment:
      DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
    ports:
      - "9187:9187"
    command:
      - '--collector.stat_statements'
      - '--collector.stat_database'
      - '--collector.replication'
```

**Key PostgreSQL Metrics:**
- `pg_up`: Database reachability
- `pg_stat_database_tup_returned`: Rows read
- `pg_stat_database_tup_inserted`: Rows inserted
- `pg_stat_database_deadlocks`: Deadlock count
- `pg_stat_replication_lag`: Replication lag in seconds
- `pg_locks_count`: Active locks

### Blackbox Exporter for Probing

```yaml
# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      preferred_ip_protocol: "ip4"

  http_post_json:
    prober: http
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"key":"value"}'
      valid_status_codes: [200, 201]

  tcp_connect:
    prober: tcp
    timeout: 5s

  icmp:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: "ip4"
```

```yaml
# Prometheus config for blackbox exporter
scrape_configs:
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115
```

## Best Practices

### Metric Naming Conventions

Follow Prometheus naming best practices:

```
# Format: <namespace>_<subsystem>_<metric>_<unit>

# Good examples
http_requests_total              # Counter
http_request_duration_seconds    # Histogram
database_connections_active      # Gauge
cache_hits_total                 # Counter
memory_usage_bytes              # Gauge

# Include unit suffixes
_seconds, _bytes, _total, _ratio, _percentage

# Avoid
RequestCount                     # Use snake_case
http_requests                    # Missing _total for counter
request_time                     # Missing unit (should be _seconds)
```

### Label Guidelines

```promql
# Good: Low cardinality labels
http_requests_total{method="GET", endpoint="/api/users", status="200"}

# Bad: High cardinality labels (avoid)
http_requests_total{user_id="12345", session_id="abc-def-ghi"}

# Good: Use bounded label values
http_requests_total{status_class="2xx"}

# Bad: Unbounded label values
http_requests_total{response_size="1234567"}
```

### Recording Rule Patterns

```yaml
groups:
  - name: performance_rules
    interval: 30s
    rules:
      # Pre-aggregate expensive queries
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      # Namespace aggregations
      - record: namespace:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (namespace)

      # SLI calculations
      - record: job:http_requests_success:rate5m
        expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)

      - record: job:http_requests_error_rate:ratio
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
          /
          sum(rate(http_requests_total[5m])) by (job)
```

### Alert Design Principles

1. **Alert on symptoms, not causes**: Alert on user-facing issues
2. **Make alerts actionable**: Include runbook links
3. **Use appropriate severity levels**: Critical, warning, info
4. **Set proper thresholds**: Based on historical data
5. **Include context in annotations**: Help on-call engineers
6. **Group related alerts**: Reduce alert fatigue
7. **Use inhibition rules**: Suppress redundant alerts
8. **Test alert rules**: Verify they fire when expected

### Dashboard Best Practices

1. **One dashboard per audience**: SRE, developers, business
2. **Use consistent time ranges**: Make comparisons easier
3. **Include SLI/SLO metrics**: Show business impact
4. **Add annotations for deploys**: Correlate changes with metrics
5. **Use template variables**: Make dashboards reusable
6. **Show trends and aggregates**: Not just raw metrics
7. **Include links to runbooks**: Enable quick response
8. **Use appropriate visualizations**: Graphs, gauges, tables

### High Availability Setup

```yaml
# Prometheus HA with Thanos
# Deploy multiple Prometheus instances with same config
# Use Thanos to deduplicate and provide global view

# prometheus-1.yml
global:
  external_labels:
    cluster: 'prod'
    replica: '1'

# prometheus-2.yml
global:
  external_labels:
    cluster: 'prod'
    replica: '2'

# Thanos sidecar configuration
# Uploads blocks to object storage
# Provides StoreAPI for querying
```

### Capacity Planning Queries

```promql
# Disk space exhaustion prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

# Memory growth trend
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)

# Request rate growth
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)

# Storage capacity planning
prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
```

## Advanced Patterns

### Federation for Multi-Cluster Monitoring

```yaml
# Global Prometheus federating from cluster Prometheus instances
scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="prometheus"}'
        - '{__name__=~"job:.*"}'  # All recording rules
    static_configs:
      - targets:
          - 'prometheus-us-west:9090'
          - 'prometheus-us-east:9090'
          - 'prometheus-eu-central:9090'
```

### Cost Monitoring Pattern

```yaml
# Track cloud costs with custom metrics
groups:
  - name: cost_tracking
    rules:
      - record: cloud:cost:hourly_rate
        expr: |
          (
            sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03  # CPU cost/hour
            +
            sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005  # Memory cost/hour
          )

      - record: cloud:cost:monthly_estimate
        expr: cloud:cost:hourly_rate * 730  # Hours in average month
```

### Custom SLO Implementation

```yaml
# SLO: 99.9% availability for API
groups:
  - name: api_slo
    interval: 30s
    rules:
      # Success rate SLI
      - record: api:sli:success_rate
        expr: |
          sum(rate(http_requests_total{job="api",status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total{job="api"}[5m]))

      # Error budget remaining (30 days)
      - record: api:error_budget:remaining
        expr: |
          1 - (
            (1 - api:sli:success_rate)
            /
            (1 - 0.999)
          )

      # Latency SLI (p99 < 500ms)
      - record: api:sli:latency_success_rate
        expr: |
          (
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
            ) < 0.5
          )
```

## Examples Summary

This skill includes 20+ comprehensive examples covering:

1. Prometheus configuration (basic, Kubernetes SD, storage)
2. PromQL queries (instant vectors, range vectors, aggregations)
3. Mathematical operations and advanced patterns
4. Alert rule definitions (error rate, latency, resource usage)
5. Alertmanager configuration (routing, receivers, inhibition)
6. Multi-window multi-burn-rate SLO alerts
7. Grafana dashboard JSON (full dashboard, RED method, USE method)
8. Custom exporters (Python, Go)
9. Third-party exporters (PostgreSQL, Blackbox)
10. Recording rules for performance
11. Federation for multi-cluster monitoring
12. Cost monitoring and SLO implementation
13. High availability patterns
14. Capacity planning queries

---

**Skill Version**: 1.0.0
**Last Updated**: October 2025
**Skill Category**: Observability, Monitoring, SRE, DevOps
**Compatible With**: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes
Overview

This skill provides a practical, production-ready guide to observability and monitoring using Prometheus, Grafana, exporters, PromQL and Alertmanager. It focuses on metrics collection, time-series analysis, alerting, dashboards, and SLO-driven reliability for cloud-native and distributed systems. Use it to design, deploy, and operate a scalable monitoring stack with actionable alerts and efficient queries.
How this skill works

The skill explains Prometheus architecture (scraping, TSDB, remote write/read, Alertmanager) and how exporters and service discovery feed metrics into Prometheus. It teaches PromQL patterns, recording rules, and alert rule design so you can pre-compute heavy queries, reduce cardinality, and route alerts. It also covers SLOs/SLIs, federation, remote storage, and Grafana dashboards for visualization and incident response.
When to use it

Setting up production monitoring for services and clusters
Implementing metrics collection for microservices and Kubernetes
Designing alerting rules and incident workflows with Alertmanager
Creating performant Grafana dashboards and recording rules
Defining SLIs, SLOs and tracking error budget
Migrating from legacy monitoring to Prometheus-based cloud-native stack
Best practices

Prefer pull model with service discovery; use relabel rules to normalize labels
Avoid high-cardinality labels; monitor cardinality metrics and drop noisy series
Use recording rules to pre-aggregate expensive queries and speed dashboards
Define clear SLI/SLO targets and use burn-rate alerts tied to error budget
Route alerts with Alertmanager using silences, grouping and escalation policies
Use remote_write/remote_read for long-term storage and retention management
Example use cases

Kubernetes cluster observability with prometheus.io annotations and pod scraping
API performance monitoring with P99 latency recording rules and dashboards
Production alerting: high error rate, service down, disk space and memory pressure
Multi-cluster monitoring using federation and remote storage for long-term analysis
Custom exporter development for application-specific metrics and business KPIs
FAQ

How do I reduce metric cardinality?
Remove high-cardinality label values (user IDs, request IDs), normalize labels via relabel_configs, and use recording rules to pre-aggregate where possible.
When should I use remote_write vs local retention?
Keep recent, high-resolution data locally for fast queries and alerts; use remote_write to offload long-term storage, compliance retention, and cross-cluster historical analysis.