home / skills / shaul1991 / shaul-agents-plugin / sre-reliability

sre-reliability skill

safe

This skill helps manage service reliability by defining and monitoring SLOs/SLIs, budgeting errors, and driving availability improvements.

npx playbooks add skill shaul1991/shaul-agents-plugin --skill sre-reliability

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

483 B

---
name: sre-reliability
description: SRE Reliability Agent. 서비스 신뢰성, SLO/SLI 관리, 가용성 개선을 담당합니다.
allowed-tools: Read, Write, Edit, Bash, Grep, Glob
---

# SRE Reliability Agent

## 역할
서비스 신뢰성 및 가용성 관리를 담당합니다.

## 담당 업무
- SLO/SLI 정의 및 모니터링
- Error Budget 관리
- 가용성 개선
- 성능 엔지니어링

## 산출물 위치
- SLO 정의: `docs/slo/`
- 모니터링: `monitoring/`

Overview

This skill is an SRE Reliability Agent focused on improving service availability and operational resilience. It helps define and manage SLOs/SLIs, track error budgets, and guide performance and reliability improvements. The agent produces clear artifacts for monitoring and SLO governance to support engineering and ops teams.

How this skill works

The agent inspects service telemetry, SLI measurements, and incident history to recommend SLO targets and error budget policies. It synthesizes monitoring gaps, proposes alerts and dashboards, and suggests remediation or capacity changes to reduce downtime. Outputs include SLO definitions, monitoring configurations, and prioritized reliability actions.

When to use it

When launching or revising service-level objectives (SLOs/SLIs)
After repeated incidents or when error budgets are near exhaustion
During capacity planning or before major releases
To establish or improve monitoring and observability practices
When you need objective measures for reliability and operational risk

Best practices

Start with a small set of meaningful SLIs tied to user experience rather than internal metrics
Define clear, measurable SLO targets and document error budget policies for risk-based decision making
Automate SLI collection and alerting to ensure objective, timely signals
Use post-incident analysis to refine SLOs and remediation runbooks
Prioritize corrective work based on error budget depletion and customer impact

Example use cases

Create SLOs for latency and availability for a customer-facing API and store definitions under docs/slo/
Monitor request error rates and latency trends, and update monitoring assets under monitoring/
Recommend throttling or rollback policies when error budget burn rate is high
Drive a performance engineering plan to reduce tail latency and improve capacity headroom
Report periodic reliability metrics to product and engineering leadership using SLO-backed dashboards

FAQ

What artifacts does the agent produce?

SLO definitions, monitoring recommendations, alert thresholds, and prioritized reliability actions.

How does it use error budgets?

Error budgets guide risk decisions: high burn rates trigger mitigations and restrict risky releases until the budget stabilizes.