home / skills / williamzujkowski / cognitive-toolworks / observability-prometheus-configurator

observability-prometheus-configurator skill

/skills/observability-prometheus-configurator

This skill configures Prometheus with alerting, recording rules, service discovery, federation, and PromQL optimization to improve monitoring reliability.

npx playbooks add skill williamzujkowski/cognitive-toolworks --skill observability-prometheus-configurator

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
27.1 KB
---
name: Prometheus Configuration Specialist
slug: observability-prometheus-configurator
description: Configure Prometheus with alerting, recording rules, service discovery (K8s, Consul, EC2), federation, PromQL optimization, and Alertmanager.
capabilities:
  - Prometheus scrape configuration with service discovery
  - Alerting rules with multi-window burn rate patterns
  - Recording rules for pre-computing expensive queries
  - Relabeling for metric filtering and label transformation
  - Federation for multi-DC and cross-service monitoring
  - PromQL query optimization and cardinality management
  - Alertmanager routing and notification configuration
  - Prometheus 3.0+ features (UTF-8, OTLP, Remote Write 2.0)
inputs:
  - Service topology and scrape targets
  - Service discovery mechanism (Kubernetes, Consul, EC2, file_sd)
  - Alert definitions with severity levels
  - Recording rule requirements
  - Alertmanager notification channels (PagerDuty, Slack, email)
  - Federation topology (if multi-DC or cross-service)
  - Cardinality constraints and retention requirements
outputs:
  - prometheus.yml configuration file
  - Alerting rules YAML files
  - Recording rules YAML files
  - Alertmanager configuration
  - Relabeling strategies for cardinality management
  - PromQL query optimization recommendations
  - Federation endpoint configuration
  - Service discovery relabel configs
keywords:
  - prometheus
  - monitoring
  - observability
  - alerting
  - recording-rules
  - service-discovery
  - kubernetes-sd
  - promql
  - federation
  - alertmanager
  - metrics
  - relabeling
  - cardinality
  - burn-rate
  - slo
version: "1.0.0"
owner: cognitive-toolworks
license: MIT
security: "No sensitive data allowed in metric labels. Use relabeling to drop secrets. Avoid high-cardinality labels (user IDs, request IDs)."
links:
  - title: "Prometheus 3.0 Release (November 2024)"
    url: "https://prometheus.io/blog/2024/11/14/prometheus-3-0/"
    accessed: "2025-10-26"
  - title: "Prometheus Configuration Documentation"
    url: "https://prometheus.io/docs/prometheus/latest/configuration/configuration/"
    accessed: "2025-10-26"
  - title: "Prometheus Alerting Best Practices"
    url: "https://prometheus.io/docs/practices/alerting/"
    accessed: "2025-10-26"
  - title: "Prometheus Recording Rules"
    url: "https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/"
    accessed: "2025-10-26"
  - title: "Prometheus Naming Conventions"
    url: "https://prometheus.io/docs/practices/naming/"
    accessed: "2025-10-26"
---

# Prometheus Configuration Specialist

## Purpose & When-To-Use

**Trigger conditions:**

* You need to configure Prometheus for metrics collection from Kubernetes, Consul, EC2, or static targets
* You need to create alerting rules with burn rate calculations or multi-window patterns
* You need to optimize PromQL queries or reduce cardinality for high-volume metrics
* You need to set up federation for multi-datacenter or cross-service monitoring
* You need to configure Alertmanager routing with grouping, inhibition, or multiple receivers
* You need to pre-compute expensive queries using recording rules

**Complements:**

* `observability-stack-configurator`: For overall observability stack design
* `observability-unified-dashboard`: For Grafana dashboard design with Prometheus datasources
* `observability-slo-calculator`: For SLO/error budget definitions that drive alerting rules

**Out of scope:**

* Application instrumentation (use language-specific Prometheus client libraries)
* Long-term metrics storage (use Thanos, Cortex, or Mimir)
* Log aggregation (use Loki or ELK)
* Distributed tracing (use Tempo, Jaeger, or Zipkin)

---

## Pre-Checks

**Time normalization:**

* Compute `NOW_ET` using NIST/time.gov semantics (America/New_York, ISO-8601)
* Use `NOW_ET` for all access dates in citations

**Verify inputs:**

* ✅ **Required:** At least one scrape target specification (service discovery config or static targets)
* ✅ **Required:** Prometheus version specified (recommend 3.0+ for UTF-8, OTLP, Remote Write 2.0)
* ⚠️ **Optional:** Alert definitions (if alerting is needed)
* ⚠️ **Optional:** Recording rule definitions (if query optimization is needed)
* ⚠️ **Optional:** Alertmanager receivers (PagerDuty, Slack, email, webhook)
* ⚠️ **Optional:** Federation topology (if multi-DC or cross-service monitoring is required)

**Validate service discovery:**

* If `kubernetes_sd_config`: Verify Kubernetes API access and RBAC permissions
* If `consul_sd_config`: Verify Consul agent accessibility and service catalog
* If `ec2_sd_config`: Verify AWS credentials and EC2 instance tags
* If `file_sd_config`: Verify JSON/YAML file path and refresh interval

**Check cardinality constraints:**

* Every unique combination of label key-value pairs creates a new time series
* High-cardinality labels (user IDs, request IDs, timestamps) cause memory/storage issues
* Recommended: <10M active time series per Prometheus instance
* Use `metric_relabel_configs` to drop high-cardinality labels

**Source freshness:**

* Prometheus 3.0 released November 14, 2024 (accessed `NOW_ET`)
* Prometheus 3.5 (upcoming LTS release, 2025)
* Alerting best practices and recording rule conventions stable across versions

**Abort if:**

* No scrape targets specified → **EMIT TODO:** "Specify at least one scrape target (kubernetes_sd, consul_sd, ec2_sd, static_configs)"
* Service discovery config incomplete → **EMIT TODO:** "Provide complete service discovery configuration (API endpoints, credentials, filters)"
* Alert definitions lack severity or description → **EMIT TODO:** "Add severity label and description annotation to all alerts"

---

## Procedure

### T1: Basic Prometheus Setup (≤2k tokens, 80% use case)

**Scenario:** Single service with static targets or file-based service discovery, basic alerting, no recording rules.

**Steps:**

1. **Global Configuration:**
   * Set `scrape_interval: 15s` (balance between data freshness and storage)
   * Set `evaluation_interval: 15s` (how often to evaluate alerting/recording rules)
   * Set `external_labels` for federation or remote write (e.g., `datacenter: us-east-1`)

2. **Scrape Configuration:**
   * Define `job_name` (logical grouping, e.g., `api-service`, `postgres-exporter`)
   * Choose service discovery:
     * **Static:** `static_configs` with `targets: ['localhost:9090']`
     * **File-based:** `file_sd_configs` with `files: ['/etc/prometheus/targets/*.json']`
   * Set `scrape_interval` override if different from global

3. **Basic Alerting Rules:**
   * Create `alerts.yml` with groups
   * Alert on symptoms (high latency, error rate) not causes (CPU, disk)
   * Include severity label (`severity: critical|warning|info`)
   * Add description and summary annotations

4. **Alertmanager Integration:**
   * Configure `alertmanager_config` with `static_configs` pointing to Alertmanager instance
   * Set `send_resolved: true` to notify when alert resolves

**Output:**

* `prometheus.yml` with global config, single scrape job, alerting rules file reference
* `alerts.yml` with 2-5 basic alerts
* No recording rules (not needed for T1 simplicity)

**Token budget:** ≤2000 tokens

---

### T2: Multi-Service Discovery + Recording Rules (≤6k tokens)

**Scenario:** Multiple services with Kubernetes/Consul/EC2 service discovery, recording rules for expensive queries, Alertmanager routing with grouping.

**Steps:**

1. **Service Discovery Configuration:**

   **Kubernetes Service Discovery:**
   * Use `kubernetes_sd_configs` with `role: pod` (discover all pods with `prometheus.io/scrape: "true"` annotation)
   * Relabeling pattern:
     ```yaml
     relabel_configs:
       - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
         action: keep
         regex: true
       - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
         action: replace
         target_label: __metrics_path__
         regex: (.+)
       - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
         action: replace
         target_label: __address__
         regex: ([^:]+)(?::\d+)?;(\d+)
         replacement: $1:$2
       - source_labels: [__meta_kubernetes_namespace]
         action: replace
         target_label: kubernetes_namespace
       - source_labels: [__meta_kubernetes_pod_name]
         action: replace
         target_label: kubernetes_pod_name
     ```
   * Supported roles: `node`, `pod`, `service`, `endpoints`, `ingress` (accessed `NOW_ET`: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config)

   **Consul Service Discovery:**
   * Use `consul_sd_configs` with `server: 'consul.service.consul:8500'`
   * Filter by service tags: `tags: ['production', 'monitoring-enabled']`

   **EC2 Service Discovery:**
   * Use `ec2_sd_configs` with AWS region and filters
   * Relabel based on EC2 tags: `__meta_ec2_tag_<tagkey>`

2. **Recording Rules:**

   * **Naming convention:** `level:metric:operations` (accessed `NOW_ET`: https://prometheus.io/docs/practices/naming/)
   * **Level:** Aggregation level (`job`, `instance`, `cluster`)
   * **Metric:** Base metric name
   * **Operations:** Aggregation operations (`sum`, `avg`, `rate`)
   * **Example:**
     ```yaml
     groups:
       - name: api_recording_rules
         interval: 30s
         rules:
           - record: job:http_requests_total:rate5m
             expr: sum(rate(http_requests_total[5m])) by (job)
           - record: job:http_request_duration_seconds:p95
             expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
     ```
   * **Use cases:** Pre-compute dashboard queries, optimize slow PromQL queries, aggregate high-cardinality metrics

3. **Relabeling Strategies:**

   **Metric Relabeling (`metric_relabel_configs`):**
   * Drop high-cardinality labels:
     ```yaml
     metric_relabel_configs:
       - source_labels: [user_id]
         action: labeldrop
         regex: .*
       - source_labels: [__name__]
         action: drop
         regex: 'expensive_metric_.*'
     ```

   **Target Relabeling (`relabel_configs`):**
   * Modify labels before scraping (transform service discovery metadata)

4. **Alerting Rules (Advanced):**

   **Multi-Window Burn Rate Alerts:**
   * Detect fast SLO burn (error budget exhausted in days instead of weeks)
   * Example: 14.4× burn rate (exhaust 30-day budget in 2 days) for critical, 6× for warning
   * **Pattern:**
     ```yaml
     groups:
       - name: slo_alerts
         rules:
           - alert: ErrorBudgetBurn_Critical
             expr: |
               (
                 sum(rate(http_requests_total{status=~"5.."}[1h]))
                 /
                 sum(rate(http_requests_total[1h]))
               ) > (14.4 * 0.001)
             for: 2m
             labels:
               severity: critical
             annotations:
               summary: "Error budget burning 14.4× faster than allowed"
               description: "{{ $labels.job }} has {{ $value | humanizePercentage }} error rate (SLO: 99.9%, budget exhausted in 2 days)"
     ```

   **Symptom-Based Alerts:**
   * Alert on latency, error rate, saturation (not CPU/memory directly)
   * **Golden Signals:** Latency, Traffic, Errors, Saturation
   * Example:
     ```yaml
     - alert: HighLatency
       expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le)) > 0.5
       for: 5m
       labels:
         severity: warning
       annotations:
         summary: "High latency on {{ $labels.job }}"
         description: "p95 latency is {{ $value }}s (threshold: 0.5s)"
     ```

5. **Alertmanager Routing:**

   * **Routing tree:** Group alerts by `cluster` + `alertname`, wait 30s for batch
   * **Receivers:** PagerDuty (critical), Slack (warning/info), email (all)
   * **Inhibition:** Suppress lower-severity alerts when higher-severity alerts are firing
   * **Example:**
     ```yaml
     route:
       receiver: 'default-email'
       group_by: ['cluster', 'alertname']
       group_wait: 30s
       group_interval: 5m
       repeat_interval: 4h
       routes:
         - match:
             severity: critical
           receiver: 'pagerduty'
         - match:
             severity: warning
           receiver: 'slack'

     receivers:
       - name: 'pagerduty'
         pagerduty_configs:
           - service_key: '<PD_SERVICE_KEY>'
       - name: 'slack'
         slack_configs:
           - api_url: '<SLACK_WEBHOOK_URL>'
             channel: '#alerts'
             title: '{{ .GroupLabels.alertname }}'
             text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
       - name: 'default-email'
         email_configs:
           - to: '[email protected]'

     inhibit_rules:
       - source_match:
           severity: critical
         target_match:
           severity: warning
         equal: ['cluster', 'alertname']
     ```

**Output:**

* `prometheus.yml` with Kubernetes/Consul/EC2 service discovery, relabeling configs
* `recording_rules.yml` with 5-10 recording rules (level:metric:operations naming)
* `alerts.yml` with multi-window burn rate alerts and symptom-based alerts
* `alertmanager.yml` with routing tree, receivers, inhibition rules

**Token budget:** ≤6000 tokens

---

### T3: Enterprise Federation + PromQL Optimization (≤12k tokens)

**Scenario:** Multi-datacenter federation, cardinality management, PromQL query optimization, Prometheus 3.0+ features (UTF-8, OTLP, Remote Write 2.0).

**Steps:**

1. **Federation Configuration:**

   **Hierarchical Federation (Multi-DC):**
   * **Pattern:** Per-datacenter Prometheus servers scrape local services, global Prometheus server federates aggregated metrics
   * **Benefits:** Scales to tens of datacenters and millions of nodes (accessed `NOW_ET`: https://prometheus.io/docs/prometheus/latest/federation/)
   * **Global server config:**
     ```yaml
     scrape_configs:
       - job_name: 'federate-us-east-1'
         scrape_interval: 30s
         honor_labels: true
         metrics_path: '/federate'
         params:
           'match[]':
             - '{job="prometheus"}'
             - '{__name__=~"job:.*"}'  # Only federate aggregated recording rules
         static_configs:
           - targets:
             - 'prometheus-us-east-1:9090'
             - 'prometheus-us-west-2:9090'
     ```

   **Cross-Service Federation:**
   * **Pattern:** Service A Prometheus federates metrics from Service B Prometheus to correlate cross-service metrics
   * **Use case:** Cluster scheduler federating resource usage from multiple service Prometheus servers

2. **PromQL Optimization:**

   **Query Performance Best Practices:**
   * **Filter early:** Use label matchers to narrow time series before aggregation
     * ❌ **Slow:** `sum(http_requests_total)` (aggregates 10k+ time series)
     * ✅ **Fast:** `sum(http_requests_total{job="api-service", status=~"5.."})` (aggregates 10-50 time series)
   * **Avoid broad selectors:** Never use bare metric names (`api_http_requests_total`) without labels
   * **Use recording rules:** Pre-compute expensive queries (accessed `NOW_ET`: https://prometheus.io/docs/prometheus/latest/querying/basics/)
   * **Limit time ranges:** Avoid queries over >24h without recording rules
   * **Example optimized query:**
     ```promql
     # Compute error rate using pre-recorded job-level metrics (fast)
     job:http_requests_total:rate5m{job="api-service", status=~"5.."}
     /
     job:http_requests_total:rate5m{job="api-service"}
     ```

   **Cardinality Management:**
   * **Problem:** High-cardinality labels (user IDs, request IDs) create millions of time series → memory/disk explosion
   * **Detection:** Query `topk(10, count by (__name__)({__name__=~".+"}))` to find high-cardinality metrics
   * **Solutions:**
     1. **Drop labels:** Use `metric_relabel_configs` to remove high-cardinality labels
     2. **Aggregate:** Use recording rules to pre-aggregate high-cardinality metrics
     3. **Sample:** Use `metric_relabel_configs` with `action: drop` to sample metrics
   * **Example cardinality reduction:**
     ```yaml
     metric_relabel_configs:
       # Drop user_id label (high cardinality)
       - source_labels: [user_id]
         action: labeldrop
         regex: .*
       # Keep only 5xx errors (reduce cardinality of status label)
       - source_labels: [status]
         action: keep
         regex: '5..'
     ```

3. **Prometheus 3.0+ Features:**

   **UTF-8 Support (Prometheus 3.0+):**
   * **Feature:** Allows all valid UTF-8 characters in metric and label names (accessed `NOW_ET`: https://prometheus.io/blog/2024/11/14/prometheus-3-0/)
   * **Example:** `http_requests_total{endpoint="用户登录"}` (Chinese characters now valid)
   * **Migration:** UTF-8 mode enabled by default in Prometheus 3.0

   **OpenTelemetry OTLP Receiver (Prometheus 3.0+):**
   * **Feature:** Prometheus can receive OTLP metrics natively
   * **Endpoint:** `/api/v1/otlp/v1/metrics`
   * **Configuration:**
     ```yaml
     otlp:
       protocols:
         http:
           endpoint: 0.0.0.0:9090
     ```
   * **Use case:** Consolidate Prometheus and OpenTelemetry pipelines

   **Remote Write 2.0 (Prometheus 3.0+):**
   * **Feature:** Native support for metadata, exemplars, created timestamps, native histograms
   * **Benefits:** Better interoperability with long-term storage (Thanos, Cortex, Mimir)

4. **Advanced Relabeling Patterns:**

   **Extract Kubernetes Annotations into Labels:**
   ```yaml
   relabel_configs:
     - source_labels: [__meta_kubernetes_pod_annotation_app_version]
       action: replace
       target_label: version
     - source_labels: [__meta_kubernetes_pod_annotation_team]
       action: replace
       target_label: team
   ```

   **Drop Expensive Metrics Based on Name Pattern:**
   ```yaml
   metric_relabel_configs:
     - source_labels: [__name__]
       action: drop
       regex: 'go_.*|process_.*'  # Drop Go runtime metrics to save storage
   ```

5. **Recording Rules for Aggregation:**

   **Multi-Level Aggregation:**
   ```yaml
   groups:
     - name: instance_aggregation
       interval: 30s
       rules:
         # Level 1: Instance-level
         - record: instance:http_requests_total:rate5m
           expr: sum(rate(http_requests_total[5m])) by (instance, job, status)

         # Level 2: Job-level (aggregates Level 1)
         - record: job:http_requests_total:rate5m
           expr: sum(instance:http_requests_total:rate5m) by (job, status)

         # Level 3: Cluster-level (aggregates Level 2)
         - record: cluster:http_requests_total:rate5m
           expr: sum(job:http_requests_total:rate5m) by (status)
   ```

6. **Alertmanager Advanced Features:**

   **Time-Based Routing (Mute Alerts During Maintenance):**
   ```yaml
   route:
     routes:
       - match:
           severity: warning
         mute_time_intervals:
           - weekends
           - maintenance_window

   mute_time_intervals:
     - name: weekends
       time_intervals:
         - weekdays: ['saturday', 'sunday']
     - name: maintenance_window
       time_intervals:
         - times:
           - start_time: '23:00'
             end_time: '01:00'
   ```

   **Grouping by Multiple Labels:**
   ```yaml
   route:
     group_by: ['cluster', 'namespace', 'alertname']
     group_wait: 30s
     group_interval: 5m
     repeat_interval: 12h
   ```

**Output:**

* `prometheus.yml` with federation endpoints, OTLP receiver, Remote Write 2.0
* Multi-level recording rules (instance → job → cluster aggregation)
* Cardinality management relabeling configs
* PromQL optimization recommendations with query examples
* Alertmanager advanced routing (time-based muting, multi-label grouping)

**Token budget:** ≤12000 tokens

---

## Decision Rules

**When to use federation vs remote write:**

* **Federation:** Multi-DC with global aggregation, <10 Prometheus servers
* **Remote Write:** Long-term storage, >10 Prometheus servers, different retention policies

**When to create recording rules:**

* Query execution time >5s on Grafana dashboard
* Query used in multiple dashboards or alerts
* High-cardinality metric needs pre-aggregation (e.g., >100k time series)

**Alert severity assignment:**

* **Critical:** User-impacting outage, page on-call engineer immediately (e.g., API error rate >5%)
* **Warning:** Potential issue, notify Slack, no page (e.g., API latency p95 >500ms)
* **Info:** FYI notification, email only (e.g., deployment completed)

**Service discovery selection:**

* **Kubernetes:** Use `kubernetes_sd_configs` with `role: pod` for dynamic pod discovery
* **Consul:** Use `consul_sd_configs` for VM-based infrastructure with Consul service catalog
* **EC2:** Use `ec2_sd_configs` for AWS instances with consistent tagging
* **File-based:** Use `file_sd_configs` for static infrastructure or external service discovery

**Cardinality limits:**

* **Target:** <10M active time series per Prometheus instance
* **Alert:** If `prometheus_tsdb_symbol_table_size_bytes` >1GB or `prometheus_tsdb_head_series` >10M
* **Action:** Drop high-cardinality labels or aggregate with recording rules

**Abort conditions:**

* Prometheus memory usage >80% of available → reduce cardinality or add recording rules
* Scrape duration >scrape interval → increase interval or optimize exporters
* Alert fatigue (>50 alerts firing) → review alert thresholds and use inhibition rules

---

## Output Contract

**prometheus.yml schema:**

```yaml
global:
  scrape_interval: <duration>
  evaluation_interval: <duration>
  external_labels:
    <label_name>: <label_value>

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['<alertmanager_host>:<port>']

rule_files:
  - 'alerts.yml'
  - 'recording_rules.yml'

scrape_configs:
  - job_name: '<job_name>'
    kubernetes_sd_configs: [...]  # OR consul_sd_configs, ec2_sd_configs, static_configs
    relabel_configs: [...]
    metric_relabel_configs: [...]
```

**alerts.yml schema:**

```yaml
groups:
  - name: <group_name>
    rules:
      - alert: <alert_name>
        expr: <promql_expression>
        for: <duration>
        labels:
          severity: critical|warning|info
        annotations:
          summary: <short_description>
          description: <detailed_description_with_templating>
```

**recording_rules.yml schema:**

```yaml
groups:
  - name: <group_name>
    interval: <duration>
    rules:
      - record: <level>:<metric>:<operations>
        expr: <promql_expression>
        labels:
          <label_name>: <label_value>
```

**alertmanager.yml schema:**

```yaml
route:
  receiver: <default_receiver>
  group_by: [<label_name>, ...]
  group_wait: <duration>
  group_interval: <duration>
  repeat_interval: <duration>
  routes:
    - match:
        <label_name>: <label_value>
      receiver: <receiver_name>

receivers:
  - name: <receiver_name>
    pagerduty_configs: [...]
    slack_configs: [...]
    email_configs: [...]

inhibit_rules:
  - source_match:
      <label_name>: <label_value>
    target_match:
      <label_name>: <label_value>
    equal: [<label_name>, ...]
```

**Required fields:**

* `prometheus.yml`: `global.scrape_interval`, `scrape_configs[].job_name`
* `alerts.yml`: `alert`, `expr`, `labels.severity`, `annotations.summary`
* `recording_rules.yml`: `record`, `expr`
* `alertmanager.yml`: `route.receiver`, `receivers[].name`

**Validation:**

* All PromQL expressions syntactically valid: `promtool check rules <file.yml>`
* Prometheus config valid: `promtool check config prometheus.yml`
* Alertmanager config valid: `amtool check-config alertmanager.yml`

---

## Examples

### Example 1: Kubernetes Service Discovery with Recording Rules

**Scenario:** Scrape all pods with `prometheus.io/scrape: "true"` annotation, create recording rules for API latency.

**prometheus.yml:**

```yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
```

**recording_rules.yml:**

```yaml
groups:
  - name: api_latency
    interval: 30s
    rules:
      - record: job:http_request_duration_seconds:p95
        expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
      - record: job:http_request_duration_seconds:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
```

---

## Quality Gates

**Token budgets:**

* **T1:** ≤2000 tokens (basic scrape + alerting)
* **T2:** ≤6000 tokens (service discovery + recording rules + Alertmanager routing)
* **T3:** ≤12000 tokens (federation + PromQL optimization + cardinality management)

**Safety:**

* ❌ **Never:** Include secrets in metric labels (passwords, API keys, tokens)
* ❌ **Never:** Use high-cardinality labels (user IDs, request IDs, UUIDs) without aggregation
* ✅ **Always:** Validate PromQL expressions with `promtool check rules`
* ✅ **Always:** Use `metric_relabel_configs` to drop secrets if accidentally exposed

**Auditability:**

* All Prometheus configs in version control (Git)
* Recording rule naming follows `level:metric:operations` convention
* Alert annotations include `summary` and `description` with templating
* Alertmanager routing documented with receiver purposes

**Determinism:**

* Same scrape targets + same relabeling = same time series
* Recording rules evaluated at fixed intervals (deterministic)
* Alert grouping by `cluster` + `alertname` produces predictable batches

**Performance:**

* Scrape duration <80% of scrape interval (avoid missed scrapes)
* PromQL query execution time <5s (use recording rules if slower)
* Cardinality <10M active time series per Prometheus instance
* Alert evaluation time <1s (use recording rules to pre-aggregate)

---

## Resources

**Official Documentation:**

* Prometheus 3.0 announcement (UTF-8, OTLP, Remote Write 2.0): https://prometheus.io/blog/2024/11/14/prometheus-3-0/ (accessed `NOW_ET`)
* Configuration reference: https://prometheus.io/docs/prometheus/latest/configuration/configuration/ (accessed `NOW_ET`)
* Alerting best practices: https://prometheus.io/docs/practices/alerting/ (accessed `NOW_ET`)
* Recording rules: https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ (accessed `NOW_ET`)
* Naming conventions: https://prometheus.io/docs/practices/naming/ (accessed `NOW_ET`)
* Federation: https://prometheus.io/docs/prometheus/latest/federation/ (accessed `NOW_ET`)

**Tooling:**

* `promtool`: Validate Prometheus configs and PromQL queries
* `amtool`: Validate Alertmanager configs and manage silences
* Prometheus exporters: Node Exporter, Blackbox Exporter, PostgreSQL Exporter, etc.

**Related Skills:**

* `observability-stack-configurator`: Overall observability stack design
* `observability-unified-dashboard`: Grafana dashboard design with Prometheus datasources
* `observability-slo-calculator`: SLO/error budget definitions for alerting rules
* `kubernetes-manifest-generator`: Kubernetes deployment manifests for Prometheus + Alertmanager

Overview

This skill configures Prometheus for production-grade metrics collection, alerting, recording rules, federation, and Alertmanager routing. It supports Kubernetes, Consul, EC2, and file-based service discovery and focuses on PromQL optimization and cardinality control. The goal is reliable scraping, efficient queries, and actionable alerting for single-cluster to multi-datacenter deployments.

How this skill works

The skill inspects provided service discovery specs, Prometheus version, and alert/recording definitions, and emits recommended prometheus.yml, alerts.yml, recording_rules.yml, and alertmanager.yml artifacts. It validates SD credentials/permissions, checks cardinality risks, recommends relabeling/metric_relabel_configs, and derives recording rules to precompute expensive aggregations. It also proposes federation topologies and optimized PromQL patterns to reduce query cost and improve dashboard performance.

When to use it

  • Setting up Prometheus to scrape Kubernetes, Consul, EC2, or static/file targets
  • Creating symptom-based alerting and multi-window burn-rate alerts for SLOs
  • Adding recording rules to speed up expensive or high-cardinality queries
  • Designing federation for multi-datacenter or cross-service aggregation
  • Configuring Alertmanager routing, grouping, and inhibition for incident workflows

Best practices

  • Normalize global scrape_interval and evaluation_interval (15s recommended) and override per-job when needed
  • Drop or aggregate high-cardinality labels via metric_relabel_configs before they create new time series
  • Precompute expensive aggregations with recording rules named using level:metric:operations conventions
  • Filter early in PromQL using label matchers; avoid broad selectors and long time ranges without recording rules
  • Group alerts by cluster+alertname and use inhibition rules to suppress noisy lower-severity alerts

Example use cases

  • T1: Single-service setup — static targets, basic alerts, alertmanager integration, no recording rules
  • T2: Multi-service discovery — Kubernetes/Consul/EC2 SD, relabeling, 5–10 recording rules, multi-window burn-rate alerts, Alertmanager routing
  • T3: Enterprise — hierarchical federation across datacenters, cardinality reduction, PromQL performance tuning, Remote Write/OTLP integration
  • Alertmanager routing example: PagerDuty for critical, Slack for warnings, email fallback with group_wait/group_interval controls

FAQ

What inputs are required to generate configurations?

At minimum provide at least one scrape target specification (kubernetes_sd, consul_sd, ec2_sd, static_configs) and the Prometheus version. Alerts, recording rules, receivers, and federation topology are optional but recommended for advanced outputs.

How do I prevent high-cardinality storms?

Identify offending labels, drop or rename them with metric_relabel_configs, pre-aggregate via recording rules, and avoid emitting user-specific IDs/timestamps as labels. Aim for under ~10M active series per Prometheus instance.