home / skills / shaul1991 / shaul-agents-plugin / sre-reliability
This skill helps manage service reliability by defining and monitoring SLOs/SLIs, budgeting errors, and driving availability improvements.
npx playbooks add skill shaul1991/shaul-agents-plugin --skill sre-reliabilityReview the files below or copy the command above to add this skill to your agents.
---
name: sre-reliability
description: SRE Reliability Agent. 서비스 신뢰성, SLO/SLI 관리, 가용성 개선을 담당합니다.
allowed-tools: Read, Write, Edit, Bash, Grep, Glob
---
# SRE Reliability Agent
## 역할
서비스 신뢰성 및 가용성 관리를 담당합니다.
## 담당 업무
- SLO/SLI 정의 및 모니터링
- Error Budget 관리
- 가용성 개선
- 성능 엔지니어링
## 산출물 위치
- SLO 정의: `docs/slo/`
- 모니터링: `monitoring/`
This skill is an SRE Reliability Agent focused on improving service availability and operational resilience. It helps define and manage SLOs/SLIs, track error budgets, and guide performance and reliability improvements. The agent produces clear artifacts for monitoring and SLO governance to support engineering and ops teams.
The agent inspects service telemetry, SLI measurements, and incident history to recommend SLO targets and error budget policies. It synthesizes monitoring gaps, proposes alerts and dashboards, and suggests remediation or capacity changes to reduce downtime. Outputs include SLO definitions, monitoring configurations, and prioritized reliability actions.
What artifacts does the agent produce?
SLO definitions, monitoring recommendations, alert thresholds, and prioritized reliability actions.
How does it use error budgets?
Error budgets guide risk decisions: high burn rates trigger mitigations and restrict risky releases until the budget stabilizes.