home / skills / dexploarer / hyper-forge / sla-monitor-generator

sla-monitor-generator skill

safe

This skill helps generate SLA/SLO/SLI monitoring configurations to track reliability, error budgets, and alerting for service health.

npx playbooks add skill dexploarer/hyper-forge --skill sla-monitor-generator

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

1.5 KB

---
name: sla-monitor-generator
description: Generate SLA/SLO/SLI monitoring configurations for reliability tracking and error budget management. Activates for SLO setup, reliability targets, and error budget configuration.
allowed-tools: [Read, Write, Edit, Bash, Grep, Glob]
---

# SLA Monitor Generator

Define and monitor Service Level Objectives (SLOs) and track error budgets.

## SLO Definition Example

```yaml
slos:
  - name: api-availability
    sli: 
      metric: http_requests_total
      filter: status < 500
    target: 99.9  # 99.9% availability
    window: 30d
    
  - name: api-latency
    sli:
      metric: http_request_duration_seconds
      percentile: 99
    target: 200  # 200ms at p99
    window: 30d

  - name: error-rate
    sli:
      metric: http_requests_total
      filter: status >= 500
    target: 0.1  # < 0.1% error rate
    window: 30d
```

## Prometheus AlertManager Rules

```yaml
groups:
  - name: slo-alerts
    rules:
      - alert: SLOBudgetBurnRate
        expr: |
          (
            1 - (sum(rate(http_requests_total{status!~"5.."}[5m])) 
                 / sum(rate(http_requests_total[5m])))
          ) > 0.001 * 14.4
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Fast burn rate detected - 2% budget in 1 hour"
```

## Best Practices
- ✅ Define SLIs based on user experience
- ✅ Set realistic SLO targets (99.9% not 100%)
- ✅ Track error budgets continuously
- ✅ Alert on burn rate, not just breaches
- ✅ Review and adjust SLOs quarterly

Overview

This skill generates SLA, SLO, and SLI monitoring configurations to track reliability and manage error budgets. It produces concrete SLO definitions and alerting rules suitable for systems that expose metrics (e.g., Prometheus). Use it to translate reliability targets into actionable monitoring and burn-rate alerts.

How this skill works

The skill inspects desired reliability targets and SLI definitions (metrics, filters, percentiles, windows) and outputs YAML configurations for SLOs and alert rules. It computes expressions for availability, latency, and error-rate SLIs and creates Prometheus-style alerting rules for budget burn rate and breaches. Outputs are ready to paste into your monitoring pipeline and adjust for your metric names and windows.

When to use it

When defining or updating SLOs for a service or API
When onboarding new services into your reliability program
Before deploying changes that could impact error budgets
When you need standardized alerting for burn rate and SLO breaches
During quarterly reliability reviews and SLO tuning

Best practices

Define SLIs from user experience signals (availability, latency, error rate)
Set realistic SLO targets and avoid 100% goals
Use rolling windows (e.g., 30d) and appropriate percentiles for latency SLIs
Alert on burn rate and trends, not only on threshold breaches
Continuously track error budget and tie policy/actions to budget state
Review and adjust SLOs and alerts at least quarterly

Example use cases

Generate SLO definitions for an API gateway with availability and p99 latency targets
Create Prometheus alert rules that trigger on fast error budget burn rates
Translate service metrics into SLIs for an internal reliability dashboard
Standardize SLO YAML across microservices in a TypeScript-based platform
Create automated alerts to pause risky releases when error budget is exhausted

FAQ

Can this output be used directly with Prometheus Alertmanager?

Yes. The skill produces Prometheus-style alerting rules and SLO YAML that can be integrated into Prometheus and Alertmanager, but metric names and label filters should be adjusted to match your instrumentation.

How are error budgets and burn rates calculated?

The skill expresses burn rate as the ratio of current observed SLI deviation versus allowed budget over a lookback interval, then translates that into PromQL expressions that compare recent error ratios to the SLO target scaled by the burn multiplier.