home / skills / dexploarer / hyper-forge / sla-monitor-generator

sla-monitor-generator skill

/.claude/skills/sla-monitor-generator

This skill helps generate SLA/SLO/SLI monitoring configurations to track reliability, error budgets, and alerting for service health.

npx playbooks add skill dexploarer/hyper-forge --skill sla-monitor-generator

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
1.5 KB
---
name: sla-monitor-generator
description: Generate SLA/SLO/SLI monitoring configurations for reliability tracking and error budget management. Activates for SLO setup, reliability targets, and error budget configuration.
allowed-tools: [Read, Write, Edit, Bash, Grep, Glob]
---

# SLA Monitor Generator

Define and monitor Service Level Objectives (SLOs) and track error budgets.

## SLO Definition Example

```yaml
slos:
  - name: api-availability
    sli: 
      metric: http_requests_total
      filter: status < 500
    target: 99.9  # 99.9% availability
    window: 30d
    
  - name: api-latency
    sli:
      metric: http_request_duration_seconds
      percentile: 99
    target: 200  # 200ms at p99
    window: 30d

  - name: error-rate
    sli:
      metric: http_requests_total
      filter: status >= 500
    target: 0.1  # < 0.1% error rate
    window: 30d
```

## Prometheus AlertManager Rules

```yaml
groups:
  - name: slo-alerts
    rules:
      - alert: SLOBudgetBurnRate
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) 
                 / sum(rate(http_requests_total[5m])))
          ) > 0.001 * 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast burn rate detected - 2% budget in 1 hour"
```

## Best Practices
- ✅ Define SLIs based on user experience
- ✅ Set realistic SLO targets (99.9% not 100%)
- ✅ Track error budgets continuously
- ✅ Alert on burn rate, not just breaches
- ✅ Review and adjust SLOs quarterly

Overview

This skill generates SLA, SLO, and SLI monitoring configurations to track reliability and manage error budgets. It produces concrete SLO definitions and alerting rules suitable for systems that expose metrics (e.g., Prometheus). Use it to translate reliability targets into actionable monitoring and burn-rate alerts.

How this skill works

The skill inspects desired reliability targets and SLI definitions (metrics, filters, percentiles, windows) and outputs YAML configurations for SLOs and alert rules. It computes expressions for availability, latency, and error-rate SLIs and creates Prometheus-style alerting rules for budget burn rate and breaches. Outputs are ready to paste into your monitoring pipeline and adjust for your metric names and windows.

When to use it

  • When defining or updating SLOs for a service or API
  • When onboarding new services into your reliability program
  • Before deploying changes that could impact error budgets
  • When you need standardized alerting for burn rate and SLO breaches
  • During quarterly reliability reviews and SLO tuning

Best practices

  • Define SLIs from user experience signals (availability, latency, error rate)
  • Set realistic SLO targets and avoid 100% goals
  • Use rolling windows (e.g., 30d) and appropriate percentiles for latency SLIs
  • Alert on burn rate and trends, not only on threshold breaches
  • Continuously track error budget and tie policy/actions to budget state
  • Review and adjust SLOs and alerts at least quarterly

Example use cases

  • Generate SLO definitions for an API gateway with availability and p99 latency targets
  • Create Prometheus alert rules that trigger on fast error budget burn rates
  • Translate service metrics into SLIs for an internal reliability dashboard
  • Standardize SLO YAML across microservices in a TypeScript-based platform
  • Create automated alerts to pause risky releases when error budget is exhausted

FAQ

Can this output be used directly with Prometheus Alertmanager?

Yes. The skill produces Prometheus-style alerting rules and SLO YAML that can be integrated into Prometheus and Alertmanager, but metric names and label filters should be adjusted to match your instrumentation.

How are error budgets and burn rates calculated?

The skill expresses burn rate as the ratio of current observed SLI deviation versus allowed budget over a lookback interval, then translates that into PromQL expressions that compare recent error ratios to the SLO target scaled by the burn multiplier.