home / skills / manutej / luxor-claude-marketplace / observability-monitoring
This skill enables production-grade observability using Prometheus, Grafana, and OpenTelemetry to collect metrics, set alerts, and build dashboards.
npx playbooks add skill manutej/luxor-claude-marketplace --skill observability-monitoringReview the files below or copy the command above to add this skill to your agents.
---
name: observability-monitoring
description: Comprehensive observability and monitoring skill covering Prometheus, Grafana, metrics collection, alerting, exporters, PromQL, and production monitoring patterns for distributed systems and cloud-native applications
---
# Observability & Monitoring
A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.
## When to Use This Skill
Use this skill when:
- Setting up monitoring for production systems and applications
- Implementing metrics collection and observability for microservices
- Creating dashboards and visualizations for system health monitoring
- Defining alerting rules and incident response automation
- Analyzing system performance and capacity using time-series data
- Implementing SLIs, SLOs, and SLAs for service reliability
- Debugging production issues using metrics and traces
- Building custom exporters for application-specific metrics
- Setting up federation for multi-cluster monitoring
- Migrating from legacy monitoring to cloud-native solutions
- Implementing cost monitoring and optimization tracking
- Creating real-time operational dashboards for DevOps teams
## Core Concepts
### The Four Pillars of Observability
Modern observability is built on four fundamental pillars:
1. **Metrics**: Numerical measurements of system behavior over time
- Counter: Monotonically increasing values (requests served, errors)
- Gauge: Point-in-time values that go up and down (memory usage, temperature)
- Histogram: Distribution of values (request duration buckets)
- Summary: Similar to histogram but calculates quantiles on client-side
2. **Logs**: Discrete events with contextual information
- Structured logging (JSON, key-value pairs)
- Centralized log aggregation (ELK, Loki)
- Correlation with metrics and traces
3. **Traces**: Request flow through distributed systems
- Span: Single unit of work with start/end time
- Trace: Collection of spans representing end-to-end request
- OpenTelemetry for distributed tracing
4. **Events**: Significant occurrences in system lifecycle
- Deployments, configuration changes
- Scaling events, incidents
- Business events and user actions
### Prometheus Architecture
Prometheus is a pull-based monitoring system with key components:
**Time-Series Database (TSDB)**
- Stores metrics as time-series data
- Efficient compression and retention policies
- Local storage with optional remote storage
**Scrape Targets**
- Service discovery (Kubernetes, Consul, EC2, etc.)
- Static configuration
- Relabeling for flexible target selection
**PromQL Query Engine**
- Powerful query language for metrics analysis
- Aggregation, filtering, and mathematical operations
- Range vectors and instant vectors
**Alertmanager**
- Alert rule evaluation
- Grouping, silencing, and routing
- Integration with PagerDuty, Slack, email, webhooks
**Exporters**
- Bridge between Prometheus and systems
- Node exporter, cAdvisor, custom exporters
- Third-party exporters for databases, services
### Metric Labels and Cardinality
Labels are key-value pairs attached to metrics:
```prometheus
http_requests_total{method="GET", endpoint="/api/users", status="200"}
```
**Label Best Practices:**
- Use labels for dimensions you query/aggregate by
- Avoid high-cardinality labels (user IDs, timestamps)
- Keep label names consistent across metrics
- Use relabeling to normalize external labels
**Cardinality Considerations:**
- Each unique label combination = new time-series
- High cardinality = increased memory and storage
- Monitor cardinality with `prometheus_tsdb_symbol_table_size_bytes`
- Use recording rules to pre-aggregate high-cardinality metrics
### Recording Rules
Pre-compute frequently-used or expensive queries:
```yaml
groups:
- name: api_performance
interval: 30s
rules:
- record: api:http_request_duration_seconds:p99
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- record: api:http_requests:rate5m
expr: rate(http_requests_total[5m])
```
**Benefits:**
- Faster dashboard loading
- Reduced query load on Prometheus
- Consistent metric naming conventions
- Enable complex aggregations
### Service Level Objectives (SLOs)
Define and track reliability targets:
**SLI (Service Level Indicator)**: Metric measuring service quality
- Availability: % of successful requests
- Latency: % of requests under threshold
- Throughput: Requests per second
**SLO (Service Level Objective)**: Target for SLI
- 99.9% availability (43.8 minutes downtime/month)
- 95% of requests < 200ms
- 1000 RPS sustained
**SLA (Service Level Agreement)**: Contract with consequences
- External commitments to customers
- Financial penalties for SLO violations
**Error Budget**: Acceptable failure rate
- Error budget = 100% - SLO
- 99.9% SLO = 0.1% error budget
- Use budget for innovation vs. reliability tradeoff
## Prometheus Setup and Configuration
### Basic Prometheus Configuration
```yaml
# prometheus.yml
global:
scrape_interval: 15s # Default scrape interval
evaluation_interval: 15s # Alert rule evaluation interval
external_labels:
cluster: 'production'
region: 'us-west-2'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules
rule_files:
- 'rules/*.yml'
- 'alerts/*.yml'
# Scrape configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+):.*'
replacement: '${1}'
# Application metrics
- job_name: 'api'
static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080']
labels:
env: 'production'
tier: 'backend'
```
### Kubernetes Service Discovery
```yaml
scrape_configs:
# Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes pods with prometheus.io annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with prometheus.io/scrape: "true" annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use the port from prometheus.io/port annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: (\d+)
target_label: __address__
replacement: ${1}:${2}
# Use the path from prometheus.io/path annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
# Add namespace label
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
# Add pod name label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Kubernetes services
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
action: keep
regex: true
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: blackbox-exporter:9115
- source_labels: [__param_target]
target_label: instance
```
### Storage and Retention
```yaml
# Storage configuration
storage:
tsdb:
path: /prometheus/data
retention.time: 15d
retention.size: 50GB
# Remote write for long-term storage
remote_write:
- url: "https://prometheus-remote-storage.example.com/api/v1/write"
basic_auth:
username: prometheus
password_file: /etc/prometheus/remote_storage_password
queue_config:
capacity: 10000
max_shards: 50
max_samples_per_send: 5000
write_relabel_configs:
# Drop high-cardinality metrics
- source_labels: [__name__]
regex: 'container_network_.*'
action: drop
# Remote read for querying historical data
remote_read:
- url: "https://prometheus-remote-storage.example.com/api/v1/read"
basic_auth:
username: prometheus
password_file: /etc/prometheus/remote_storage_password
read_recent: true
```
## PromQL: The Prometheus Query Language
### Instant Vectors and Selectors
```promql
# Basic metric selection
http_requests_total
# Filter by label
http_requests_total{job="api", status="200"}
# Regex matching
http_requests_total{status=~"2..|3.."}
# Negative matching
http_requests_total{status!="500"}
# Multiple label matchers
http_requests_total{job="api", method="GET", status=~"2.."}
```
### Range Vectors and Aggregations
```promql
# 5-minute range vector
http_requests_total[5m]
# Rate of increase per second
rate(http_requests_total[5m])
# Increase over time window
increase(http_requests_total[1h])
# Average over time
avg_over_time(cpu_usage[5m])
# Max/Min over time
max_over_time(response_time_seconds[10m])
min_over_time(response_time_seconds[10m])
# Standard deviation
stddev_over_time(response_time_seconds[5m])
```
### Aggregation Operators
```promql
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Sum grouped by job
sum by (job) (rate(http_requests_total[5m]))
# Average grouped by multiple labels
avg by (job, instance) (cpu_usage)
# Count number of series
count(up == 1)
# Topk and bottomk
topk(5, rate(http_requests_total[5m]))
bottomk(3, node_memory_available_bytes)
# Quantile across instances
quantile(0.95, http_request_duration_seconds)
```
### Mathematical Operations
```promql
# Arithmetic operations
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100
# Comparison operators
http_request_duration_seconds > 0.5
# Logical operators
up == 1 and rate(http_requests_total[5m]) > 100
# Vector matching
rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
```
### Advanced PromQL Patterns
```promql
# Request success rate
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100
# Latency percentiles (histogram)
histogram_quantile(0.99,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Predict linear growth
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
# Detect anomalies with standard deviation
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
>
3 * stddev_over_time(cpu_usage[1h])
# Calculate saturation (RED method)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance)
/
count(cpu_seconds_total{mode="idle"}) by (instance)
# Burn rate for SLO
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
>
(14.4 * (1 - 0.999)) # For 99.9% SLO
```
## Alerting with Prometheus and Alertmanager
### Alert Rule Definitions
```yaml
# alerts/api_alerts.yml
groups:
- name: api_alerts
interval: 30s
rules:
# High error rate alert
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}"
runbook_url: "https://runbooks.example.com/HighErrorRate"
# High latency alert
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by (service, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P99 latency is {{ $value }}s on {{ $labels.service }}"
# Service down alert
- alert: ServiceDown
expr: up{job="api"} == 0
for: 2m
labels:
severity: critical
team: sre
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"
# Disk space alert
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"}
/
node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
for: 5m
labels:
severity: warning
team: sre
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"
# Memory pressure alert
- alert: HighMemoryUsage
expr: |
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
team: sre
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"
# CPU saturation alert
- alert: HighCPUUsage
expr: |
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 15m
labels:
severity: warning
team: sre
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"
```
### Alertmanager Configuration
```yaml
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
# Templates for notifications
templates:
- '/etc/alertmanager/templates/*.tmpl'
# Route tree for alert distribution
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-default'
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: 'pagerduty-critical'
continue: true
# Critical alerts also go to Slack
- match:
severity: critical
receiver: 'slack-critical'
group_wait: 0s
# Warning alerts to Slack only
- match:
severity: warning
receiver: 'slack-warnings'
# Team-specific routing
- match:
team: backend
receiver: 'team-backend'
- match:
team: frontend
receiver: 'team-frontend'
# Database alerts to DBA team
- match_re:
service: 'postgres|mysql|mongodb'
receiver: 'team-dba'
# Alert receivers/integrations
receivers:
- name: 'team-default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
send_resolved: true
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY'
description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}'
severity: '{{ .CommonLabels.severity }}'
- name: 'slack-critical'
slack_configs:
- channel: '#incidents'
title: 'CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}'
color: 'danger'
send_resolved: true
- name: 'slack-warnings'
slack_configs:
- channel: '#monitoring'
title: 'Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
color: 'warning'
send_resolved: true
- name: 'team-backend'
slack_configs:
- channel: '#team-backend'
send_resolved: true
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.gmail.com:587'
auth_username: '[email protected]'
auth_password_file: '/etc/alertmanager/email_password'
- name: 'team-frontend'
slack_configs:
- channel: '#team-frontend'
send_resolved: true
- name: 'team-dba'
slack_configs:
- channel: '#team-dba'
send_resolved: true
pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'
# Inhibition rules (suppress alerts)
inhibit_rules:
# Inhibit warnings if critical alert is firing
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
# Don't alert on instance down if cluster is down
- source_match:
alertname: 'ClusterDown'
target_match_re:
alertname: 'InstanceDown|ServiceDown'
equal: ['cluster']
```
### Multi-Window Multi-Burn-Rate Alerts for SLOs
```yaml
# SLO-based alerting using burn rate
groups:
- name: slo_alerts
interval: 30s
rules:
# Fast burn (1h window, 5m burn)
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
) > (14.4 * (1 - 0.999))
and
(
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) > (14.4 * (1 - 0.999))
for: 2m
labels:
severity: critical
slo: "99.9%"
annotations:
summary: "Fast error budget burn detected"
description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"
# Slow burn (6h window, 30m burn)
- alert: ErrorBudgetBurnSlow
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
) > (6 * (1 - 0.999))
and
(
sum(rate(http_requests_total{status=~"5.."}[30m]))
/
sum(rate(http_requests_total[30m]))
) > (6 * (1 - 0.999))
for: 15m
labels:
severity: warning
slo: "99.9%"
annotations:
summary: "Slow error budget burn detected"
description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"
```
## Grafana Dashboards and Visualization
### Dashboard JSON Structure
```json
{
"dashboard": {
"title": "API Performance Dashboard",
"tags": ["api", "performance", "production"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
"time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
},
"templating": {
"list": [
{
"name": "cluster",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, cluster)",
"refresh": 1,
"multi": false,
"includeAll": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{cluster=\"$cluster\"}, service)",
"refresh": 1,
"multi": true,
"includeAll": true
},
{
"name": "interval",
"type": "interval",
"query": "1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service)",
"legendFormat": "{{ service }}",
"refId": "A"
}
],
"yaxes": [
{"format": "reqps", "label": "Requests/sec"},
{"format": "short"}
],
"legend": {
"show": true,
"values": true,
"current": true,
"avg": true,
"max": true
}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service=~\"$service\",status=~\"5..\"}[$interval])) by (service) / sum(rate(http_requests_total{service=~\"$service\"}[$interval])) by (service) * 100",
"legendFormat": "{{ service }} error %",
"refId": "A"
}
],
"yaxes": [
{"format": "percent", "label": "Error Rate"},
{"format": "short"}
],
"alert": {
"conditions": [
{
"evaluator": {"params": [5], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"params": [], "type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"frequency": "1m",
"handler": 1,
"name": "High Error Rate",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 3,
"title": "Latency Percentiles",
"type": "graph",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p99",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p95",
"refId": "B"
},
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~\"$service\"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p50",
"refId": "C"
}
],
"yaxes": [
{"format": "s", "label": "Duration"},
{"format": "short"}
]
}
]
}
}
```
### RED Method Dashboard
The RED method focuses on Request rate, Error rate, and Duration:
```json
{
"panels": [
{
"title": "Request Rate (per service)",
"targets": [
{
"expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)"
}
]
},
{
"title": "Error Rate % (per service)",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100"
}
]
},
{
"title": "Duration p99 (per service)",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))"
}
]
}
]
}
```
### USE Method Dashboard
The USE method monitors Utilization, Saturation, and Errors:
```json
{
"panels": [
{
"title": "CPU Utilization %",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[$__rate_interval])) * 100)"
}
]
},
{
"title": "CPU Saturation (Load Average)",
"targets": [
{
"expr": "node_load1"
}
]
},
{
"title": "Memory Utilization %",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100"
}
]
},
{
"title": "Disk I/O Utilization %",
"targets": [
{
"expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100"
}
]
},
{
"title": "Network Errors",
"targets": [
{
"expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])"
}
]
}
]
}
```
## Exporters and Metric Collection
### Node Exporter for System Metrics
```bash
# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter --web.listen-address=":9100" \
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)" \
--collector.netclass.ignored-devices="^(veth.*|br.*|docker.*|lo)$"
```
**Key Metrics from Node Exporter:**
- `node_cpu_seconds_total`: CPU usage by mode
- `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`: Memory
- `node_disk_io_time_seconds_total`: Disk I/O
- `node_network_receive_bytes_total`, `node_network_transmit_bytes_total`: Network
- `node_filesystem_size_bytes`, `node_filesystem_avail_bytes`: Disk space
### Custom Application Exporter (Python)
```python
# app_exporter.py
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary
import time
import random
# Define metrics
REQUEST_COUNT = Counter(
'app_requests_total',
'Total app requests',
['method', 'endpoint', 'status']
)
REQUEST_DURATION = Histogram(
'app_request_duration_seconds',
'Request duration in seconds',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0]
)
ACTIVE_USERS = Gauge(
'app_active_users',
'Number of active users'
)
QUEUE_SIZE = Gauge(
'app_queue_size',
'Current queue size',
['queue_name']
)
DATABASE_CONNECTIONS = Gauge(
'app_database_connections',
'Number of database connections',
['pool', 'state']
)
CACHE_HITS = Counter(
'app_cache_hits_total',
'Total cache hits',
['cache_name']
)
CACHE_MISSES = Counter(
'app_cache_misses_total',
'Total cache misses',
['cache_name']
)
def simulate_metrics():
"""Simulate application metrics"""
while True:
# Simulate requests
method = random.choice(['GET', 'POST', 'PUT', 'DELETE'])
endpoint = random.choice(['/api/users', '/api/products', '/api/orders'])
status = random.choice(['200', '200', '200', '400', '500'])
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()
# Simulate request duration
duration = random.uniform(0.01, 2.0)
REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)
# Update gauges
ACTIVE_USERS.set(random.randint(100, 1000))
QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))
# Database connection pool
DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))
# Cache metrics
if random.random() > 0.3:
CACHE_HITS.labels(cache_name='redis').inc()
else:
CACHE_MISSES.labels(cache_name='redis').inc()
time.sleep(1)
if __name__ == '__main__':
# Start metrics server on port 8000
start_http_server(8000)
print("Metrics server started on port 8000")
simulate_metrics()
```
### Custom Exporter (Go)
```go
package main
import (
"log"
"math/rand"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "app_requests_total",
Help: "Total number of requests",
},
[]string{"method", "endpoint", "status"},
)
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "app_request_duration_seconds",
Help: "Request duration in seconds",
Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
},
[]string{"method", "endpoint"},
)
activeUsers = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "app_active_users",
Help: "Number of active users",
},
)
databaseConnections = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "app_database_connections",
Help: "Number of database connections",
},
[]string{"pool", "state"},
)
)
func init() {
prometheus.MustRegister(requestsTotal)
prometheus.MustRegister(requestDuration)
prometheus.MustRegister(activeUsers)
prometheus.MustRegister(databaseConnections)
}
func simulateMetrics() {
ticker := time.NewTicker(1 * time.Second)
defer ticker.Stop()
for range ticker.C {
// Simulate requests
methods := []string{"GET", "POST", "PUT", "DELETE"}
endpoints := []string{"/api/users", "/api/products", "/api/orders"}
statuses := []string{"200", "200", "200", "400", "500"}
method := methods[rand.Intn(len(methods))]
endpoint := endpoints[rand.Intn(len(endpoints))]
status := statuses[rand.Intn(len(statuses))]
requestsTotal.WithLabelValues(method, endpoint, status).Inc()
requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)
// Update gauges
activeUsers.Set(float64(rand.Intn(900) + 100))
databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
}
}
func main() {
go simulateMetrics()
http.Handle("/metrics", promhttp.Handler())
log.Println("Starting metrics server on :8000")
log.Fatal(http.ListenAndServe(":8000", nil))
}
```
### PostgreSQL Exporter
```yaml
# docker-compose.yml for postgres_exporter
version: '3.8'
services:
postgres-exporter:
image: prometheuscommunity/postgres-exporter
environment:
DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
ports:
- "9187:9187"
command:
- '--collector.stat_statements'
- '--collector.stat_database'
- '--collector.replication'
```
**Key PostgreSQL Metrics:**
- `pg_up`: Database reachability
- `pg_stat_database_tup_returned`: Rows read
- `pg_stat_database_tup_inserted`: Rows inserted
- `pg_stat_database_deadlocks`: Deadlock count
- `pg_stat_replication_lag`: Replication lag in seconds
- `pg_locks_count`: Active locks
### Blackbox Exporter for Probing
```yaml
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
preferred_ip_protocol: "ip4"
http_post_json:
prober: http
http:
method: POST
headers:
Content-Type: application/json
body: '{"key":"value"}'
valid_status_codes: [200, 201]
tcp_connect:
prober: tcp
timeout: 5s
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
```
```yaml
# Prometheus config for blackbox exporter
scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
```
## Best Practices
### Metric Naming Conventions
Follow Prometheus naming best practices:
```
# Format: <namespace>_<subsystem>_<metric>_<unit>
# Good examples
http_requests_total # Counter
http_request_duration_seconds # Histogram
database_connections_active # Gauge
cache_hits_total # Counter
memory_usage_bytes # Gauge
# Include unit suffixes
_seconds, _bytes, _total, _ratio, _percentage
# Avoid
RequestCount # Use snake_case
http_requests # Missing _total for counter
request_time # Missing unit (should be _seconds)
```
### Label Guidelines
```promql
# Good: Low cardinality labels
http_requests_total{method="GET", endpoint="/api/users", status="200"}
# Bad: High cardinality labels (avoid)
http_requests_total{user_id="12345", session_id="abc-def-ghi"}
# Good: Use bounded label values
http_requests_total{status_class="2xx"}
# Bad: Unbounded label values
http_requests_total{response_size="1234567"}
```
### Recording Rule Patterns
```yaml
groups:
- name: performance_rules
interval: 30s
rules:
# Pre-aggregate expensive queries
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
# Namespace aggregations
- record: namespace:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (namespace)
# SLI calculations
- record: job:http_requests_success:rate5m
expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)
- record: job:http_requests_error_rate:ratio
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (job)
/
sum(rate(http_requests_total[5m])) by (job)
```
### Alert Design Principles
1. **Alert on symptoms, not causes**: Alert on user-facing issues
2. **Make alerts actionable**: Include runbook links
3. **Use appropriate severity levels**: Critical, warning, info
4. **Set proper thresholds**: Based on historical data
5. **Include context in annotations**: Help on-call engineers
6. **Group related alerts**: Reduce alert fatigue
7. **Use inhibition rules**: Suppress redundant alerts
8. **Test alert rules**: Verify they fire when expected
### Dashboard Best Practices
1. **One dashboard per audience**: SRE, developers, business
2. **Use consistent time ranges**: Make comparisons easier
3. **Include SLI/SLO metrics**: Show business impact
4. **Add annotations for deploys**: Correlate changes with metrics
5. **Use template variables**: Make dashboards reusable
6. **Show trends and aggregates**: Not just raw metrics
7. **Include links to runbooks**: Enable quick response
8. **Use appropriate visualizations**: Graphs, gauges, tables
### High Availability Setup
```yaml
# Prometheus HA with Thanos
# Deploy multiple Prometheus instances with same config
# Use Thanos to deduplicate and provide global view
# prometheus-1.yml
global:
external_labels:
cluster: 'prod'
replica: '1'
# prometheus-2.yml
global:
external_labels:
cluster: 'prod'
replica: '2'
# Thanos sidecar configuration
# Uploads blocks to object storage
# Provides StoreAPI for querying
```
### Capacity Planning Queries
```promql
# Disk space exhaustion prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
# Memory growth trend
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)
# Request rate growth
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)
# Storage capacity planning
prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
```
## Advanced Patterns
### Federation for Multi-Cluster Monitoring
```yaml
# Global Prometheus federating from cluster Prometheus instances
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}' # All recording rules
static_configs:
- targets:
- 'prometheus-us-west:9090'
- 'prometheus-us-east:9090'
- 'prometheus-eu-central:9090'
```
### Cost Monitoring Pattern
```yaml
# Track cloud costs with custom metrics
groups:
- name: cost_tracking
rules:
- record: cloud:cost:hourly_rate
expr: |
(
sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour
+
sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour
)
- record: cloud:cost:monthly_estimate
expr: cloud:cost:hourly_rate * 730 # Hours in average month
```
### Custom SLO Implementation
```yaml
# SLO: 99.9% availability for API
groups:
- name: api_slo
interval: 30s
rules:
# Success rate SLI
- record: api:sli:success_rate
expr: |
sum(rate(http_requests_total{job="api",status=~"2.."}[5m]))
/
sum(rate(http_requests_total{job="api"}[5m]))
# Error budget remaining (30 days)
- record: api:error_budget:remaining
expr: |
1 - (
(1 - api:sli:success_rate)
/
(1 - 0.999)
)
# Latency SLI (p99 < 500ms)
- record: api:sli:latency_success_rate
expr: |
(
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le)
) < 0.5
)
```
## Examples Summary
This skill includes 20+ comprehensive examples covering:
1. Prometheus configuration (basic, Kubernetes SD, storage)
2. PromQL queries (instant vectors, range vectors, aggregations)
3. Mathematical operations and advanced patterns
4. Alert rule definitions (error rate, latency, resource usage)
5. Alertmanager configuration (routing, receivers, inhibition)
6. Multi-window multi-burn-rate SLO alerts
7. Grafana dashboard JSON (full dashboard, RED method, USE method)
8. Custom exporters (Python, Go)
9. Third-party exporters (PostgreSQL, Blackbox)
10. Recording rules for performance
11. Federation for multi-cluster monitoring
12. Cost monitoring and SLO implementation
13. High availability patterns
14. Capacity planning queries
---
**Skill Version**: 1.0.0
**Last Updated**: October 2025
**Skill Category**: Observability, Monitoring, SRE, DevOps
**Compatible With**: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes
This skill provides a practical, production-ready guide to observability and monitoring using Prometheus, Grafana, exporters, PromQL and Alertmanager. It focuses on metrics collection, time-series analysis, alerting, dashboards, and SLO-driven reliability for cloud-native and distributed systems. Use it to design, deploy, and operate a scalable monitoring stack with actionable alerts and efficient queries.
The skill explains Prometheus architecture (scraping, TSDB, remote write/read, Alertmanager) and how exporters and service discovery feed metrics into Prometheus. It teaches PromQL patterns, recording rules, and alert rule design so you can pre-compute heavy queries, reduce cardinality, and route alerts. It also covers SLOs/SLIs, federation, remote storage, and Grafana dashboards for visualization and incident response.
How do I reduce metric cardinality?
Remove high-cardinality label values (user IDs, request IDs), normalize labels via relabel_configs, and use recording rules to pre-aggregate where possible.
When should I use remote_write vs local retention?
Keep recent, high-resolution data locally for fast queries and alerts; use remote_write to offload long-term storage, compliance retention, and cross-cluster historical analysis.