home / skills / pluginagentmarketplace / custom-plugin-devops / monitoring

monitoring skill

/skills/monitoring

This skill helps you implement and maintain observability using Prometheus, Grafana, ELK, and traces across services for reliable monitoring.

npx playbooks add skill pluginagentmarketplace/custom-plugin-devops --skill monitoring

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
2.6 KB
---
name: monitoring-skill
description: Monitoring and observability with Prometheus, Grafana, ELK Stack, and distributed tracing.
sasmp_version: "1.3.0"
bonded_agent: 06-monitoring-observability
bond_type: PRIMARY_BOND

parameters:
  - name: pillar
    type: string
    required: false
    enum: ["metrics", "logs", "traces", "all"]
    default: "all"
  - name: tool
    type: string
    required: false
    enum: ["prometheus", "grafana", "elk", "jaeger"]
    default: "prometheus"

retry_config:
  strategy: exponential_backoff
  initial_delay_ms: 1000
  max_retries: 3

observability:
  logging: structured
  metrics: enabled
---

# Monitoring & Observability Skill

## Overview
Master the three pillars of observability: metrics, logs, and traces.

## Parameters
| Name | Type | Required | Default | Description |
|------|------|----------|---------|-------------|
| pillar | string | No | all | Observability pillar |
| tool | string | No | prometheus | Tool focus |

## Core Topics

### MANDATORY
- Prometheus metrics and PromQL
- Grafana dashboards
- ELK Stack basics
- SLIs, SLOs, error budgets
- Alerting rules

### OPTIONAL
- Distributed tracing
- OpenTelemetry
- Custom exporters
- Log correlation

### ADVANCED
- High cardinality handling
- Recording rules
- Federation
- Continuous profiling

## Quick Reference

```bash
# PromQL
sum(rate(http_requests_total[5m])) by (service)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Prometheus API
curl http://localhost:9090/api/v1/targets
curl 'http://localhost:9090/api/v1/query?query=up'
curl -X POST http://localhost:9090/-/reload

# Alertmanager
amtool silence add alertname="HighLatency" --duration=2h
amtool alert
```

## SRE Golden Signals
| Signal | Metric |
|--------|--------|
| Latency | `histogram_quantile(0.99, ...)` |
| Traffic | `sum(rate(requests_total[5m]))` |
| Errors | `rate(errors_total[5m])` |
| Saturation | `node_memory_MemAvailable_bytes` |

## Troubleshooting

### Common Failures
| Symptom | Root Cause | Solution |
|---------|------------|----------|
| No data | Scrape failing | Check targets page |
| Alert not firing | PromQL error | Test in UI |
| High cardinality | Too many labels | Reduce labels |
| Slow queries | Too much data | Add aggregation |

### Debug Checklist
1. Check targets: `/targets`
2. Test query in UI
3. Check logs: `journalctl -u prometheus`
4. Verify time sync (NTP)

### Recovery Procedures

#### Prometheus OOM
1. Check cardinality
2. Reduce retention
3. Add federation

## Resources
- [Prometheus Docs](https://prometheus.io/docs)
- [Grafana Docs](https://grafana.com/docs)

Overview

This skill provides a focused monitoring and observability toolkit built around Prometheus, Grafana, the ELK Stack, and distributed tracing. It helps teams instrument services, build dashboards, define SLIs/SLOs, and set robust alerting and recovery procedures. The goal is to make metrics, logs, and traces actionable for faster incident resolution and reliable SLAs.

How this skill works

The skill inspects and documents common observability setups: Prometheus for metrics and PromQL, Grafana for visualizations, ELK for log aggregation, and tracing tools for request flows. It includes practical queries, alert examples, troubleshooting checklists, and recovery steps for typical failures like OOMs or missing scrapes. It also covers SRE concepts such as SLIs, SLOs, error budgets, and strategies for high-cardinality and performance tuning.

When to use it

  • When you need to instrument services and establish core observability (metrics, logs, traces).
  • When creating Grafana dashboards and PromQL alerts for production services.
  • When troubleshooting missing metrics, slow queries, or alert misfires.
  • When defining SLIs, SLOs, and alerting thresholds tied to business outcomes.
  • When you need guidance on scaling Prometheus (cardinality, federation, retention).

Best practices

  • Start with the SRE golden signals: latency, traffic, errors, saturation.
  • Keep Prometheus label cardinality low; avoid high-cardinality dynamic labels.
  • Use recording rules and aggregation to improve query performance.
  • Define clear SLIs and SLOs first, then derive alerting rules and error budgets.
  • Correlate logs and traces with metric alerts to reduce time-to-detect and time-to-restore.

Example use cases

  • Create a 99th-percentile latency dashboard using histogram_quantile and Grafana panels.
  • Write alerting rules for error rate increases and wire them into Alertmanager silences.
  • Investigate missing metrics by checking /targets, Prometheus logs, and NTP time sync.
  • Mitigate Prometheus OOM by diagnosing cardinality, reducing retention, or adding federation.
  • Implement end-to-end request tracing with OpenTelemetry and correlate traces to slow metrics.

FAQ

Which observability pillar should I prioritize first?

Start with metrics to get service-level visibility, then add logs for context and traces for request-level diagnosis.

How do I reduce Prometheus query slowness?

Introduce recording rules to precompute heavy aggregations, lower retention for raw series, and reduce label cardinality.