home / skills / aj-geddes / useful-ai-prompts / prometheus-monitoring

prometheus-monitoring skill

/skills/prometheus-monitoring

This skill helps you set up Prometheus monitoring with custom metrics, scraping, and alerts to improve observability across services.

npx playbooks add skill aj-geddes/useful-ai-prompts --skill prometheus-monitoring

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.7 KB
---
name: prometheus-monitoring
description: Set up Prometheus monitoring for applications with custom metrics, scraping configurations, and service discovery. Use when implementing time-series metrics collection, monitoring applications, or building observability infrastructure.
---

# Prometheus Monitoring

## Overview

Implement comprehensive Prometheus monitoring infrastructure for collecting, storing, and querying time-series metrics from applications and infrastructure.

## When to Use

- Setting up metrics collection
- Creating custom application metrics
- Configuring scraping targets
- Implementing service discovery
- Building monitoring infrastructure

## Instructions

### 1. **Prometheus Configuration**

```yaml
# /etc/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: production

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - '/etc/prometheus/alert_rules.yml'

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'api-service'
    static_configs:
      - targets: ['localhost:8080/metrics']
    scrape_interval: 10s

  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: 'true'
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
```

### 2. **Node.js Metrics Implementation**

```javascript
// metrics.js
const promClient = require('prom-client');
const register = new promClient.Registry();

promClient.collectDefaultMetrics({ register });

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

const requestsTotal = new promClient.Counter({
  name: 'requests_total',
  help: 'Total requests',
  labelNames: ['method', 'route', 'status_code'],
  registers: [register]
});

// Express middleware
const express = require('express');
const app = express();

app.get('/metrics', (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(register.metrics());
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration
      .labels(req.method, req.path, res.statusCode)
      .observe(duration);
    requestsTotal
      .labels(req.method, req.path, res.statusCode)
      .inc();
  });
  next();
});

module.exports = { register, httpRequestDuration, requestsTotal };
```

### 3. **Python Prometheus Integration**

```python
from prometheus_client import Counter, Histogram, start_http_server
from flask import Flask, request
import time

app = Flask(__name__)

request_count = Counter('requests_total', 'Total requests', ['method', 'endpoint'])
request_duration = Histogram('request_duration_seconds', 'Request duration', ['method', 'endpoint'])

@app.before_request
def before():
    request.start_time = time.time()

@app.after_request
def after(response):
    duration = time.time() - request.start_time
    request_count.labels(request.method, request.path).inc()
    request_duration.labels(request.method, request.path).observe(duration)
    return response

if __name__ == '__main__':
    start_http_server(8000)
    app.run(port=5000)
```

### 4. **Alert Rules**

```yaml
# /etc/prometheus/alert_rules.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: rate(requests_total{status_code=~"5.."}[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate: {{ $value }}"

      - alert: HighLatency
        expr: histogram_quantile(0.95, request_duration_seconds) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "p95 latency: {{ $value }}s"

      - alert: HighMemoryUsage
        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory: {{ $value }}"
```

### 5. **Docker Compose Setup**

```yaml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

volumes:
  prometheus_data:
```

## Best Practices

### ✅ DO
- Use consistent metric naming conventions
- Add comprehensive labels for filtering
- Set appropriate scrape intervals (10-60s)
- Implement retention policies
- Monitor Prometheus itself
- Test alert rules before deployment
- Document metric meanings

### ❌ DON'T
- Add unbounded cardinality labels
- Scrape too frequently (< 10s)
- Ignore metric naming conventions
- Create alerts without runbooks
- Store raw event data in Prometheus
- Use counters for gauge-like values

## Key Prometheus Queries

```promql
rate(requests_total[5m])  # Request rate
histogram_quantile(0.95, request_duration_seconds)  # p95 latency
rate(requests_total{status_code=~"5.."}[5m])  # Error rate
```

Overview

This skill sets up Prometheus monitoring for applications and infrastructure, including configuration, custom metrics, scraping, service discovery, and alerting. It provides concrete examples for Node.js and Python metric instrumentation, Prometheus scrape configs, alert rules, and a Docker Compose deployment to run Prometheus and node-exporter.

How this skill works

Configure Prometheus with global settings, scrape_configs, and rule_files to collect metrics from targets and Kubernetes pods using service discovery and relabeling. Instrument applications with Prometheus client libraries (Node.js or Python) to expose /metrics endpoints, then define alerting rules and query expressions for runtime detection. Optionally run Prometheus and exporters in Docker Compose and persist TSDB data.

When to use it

  • When implementing time-series metrics collection for applications and infrastructure
  • When you need custom application metrics (histograms, counters, gauges)
  • When configuring scraping targets and service discovery (static or Kubernetes)
  • When building alerting rules and observability pipelines
  • When deploying Prometheus quickly for development or production via Docker Compose

Best practices

  • Follow consistent metric naming and label conventions to simplify queries
  • Limit label cardinality to avoid high memory and cardinality explosion
  • Set scrape intervals between 10s and 60s based on metric criticality
  • Test alert rules and include runbooks for each alert before enabling
  • Monitor Prometheus itself (exporters, TSDB usage, and retention policies)

Example use cases

  • Instrument a Node.js API with request duration histogram and total request counter, then scrape /metrics
  • Expose Flask app metrics via prometheus_client and run an HTTP metrics endpoint on a dedicated port
  • Configure Prometheus to discover Kubernetes pods using annotations and relabeling
  • Write alerts for high error rate, high p95 latency, and low available memory
  • Deploy Prometheus and node-exporter with Docker Compose for a self-contained monitoring stack

FAQ

How do I avoid high cardinality in labels?

Avoid including unique identifiers (user IDs, full URLs, timestamps) as label values; instead use coarse labels like route names and status codes.

What scrape_interval should I choose?

Use 10–60s depending on how fresh the data must be; critical services can use 10s but balance with Prometheus resource usage.