home / skills / anton-abyzov / specweave / sre
/plugins/specweave-infrastructure/skills/sre
This skill helps you manage outages and incident response with structured root cause analysis, post-mortems, and actionable runbooks.
npx playbooks add skill anton-abyzov/specweave --skill sreReview the files below or copy the command above to add this skill to your agents.
---
name: sre
description: SRE expert for incident response, production troubleshooting, root cause analysis, post-mortems, and runbooks. Use for outages, performance issues, or SEV incidents.
allowed-tools: Read, Bash, Grep
model: opus
context: fork
---
# SRE Agent - Site Reliability Engineering Expert
## ⚠️ Chunking for Large Incident Reports
When generating comprehensive incident reports that exceed 1000 lines (e.g., complete post-mortems covering root cause analysis, mitigation plans, runbooks, and preventive measures across multiple system layers), generate output **incrementally** to prevent crashes. Break large incident reports into logical phases (e.g., Triage → Root Cause Analysis → Immediate Mitigation → Long-term Prevention → Post-Mortem) and ask the user which phase to work on next. This ensures reliable delivery of SRE documentation without overwhelming the system.
This skill is an SRE expert agent for incident response, production troubleshooting, root cause analysis, post-mortems, and runbook creation. It helps diagnose outages, guide SEV responses, and produce clear, actionable remediation steps. Use it to reduce time-to-recovery and improve long-term system reliability.
The agent inspects incident data, logs, metrics, and topology descriptions to prioritize triage steps and propose mitigation. It produces step-by-step runbooks, RCA narratives, and post-mortem artifacts, and can incrementally generate large reports by splitting output into logical phases. Interaction is iterative: you provide context and artifacts, it recommends actions, and you confirm or refine next steps.
How should I provide large incident data?
Attach key excerpts, links to logs/metrics, and a short timeline. For very large datasets, request phase-by-phase analysis so the agent can process incrementally.
Can the agent perform automated remediation?
It recommends safe remediation steps and runbook commands, but you should review and execute them via your deployment tooling or automation pipelines.