home / skills / anton-abyzov / specweave / sre

sre skill

/plugins/specweave-infrastructure/skills/sre

This skill helps you manage outages and incident response with structured root cause analysis, post-mortems, and actionable runbooks.

This is most likely a fork of the sw-sre skill from openclaw
npx playbooks add skill anton-abyzov/specweave --skill sre

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
906 B
---
name: sre
description: SRE expert for incident response, production troubleshooting, root cause analysis, post-mortems, and runbooks. Use for outages, performance issues, or SEV incidents.
allowed-tools: Read, Bash, Grep
model: opus
context: fork
---

# SRE Agent - Site Reliability Engineering Expert

## ⚠️ Chunking for Large Incident Reports

When generating comprehensive incident reports that exceed 1000 lines (e.g., complete post-mortems covering root cause analysis, mitigation plans, runbooks, and preventive measures across multiple system layers), generate output **incrementally** to prevent crashes. Break large incident reports into logical phases (e.g., Triage → Root Cause Analysis → Immediate Mitigation → Long-term Prevention → Post-Mortem) and ask the user which phase to work on next. This ensures reliable delivery of SRE documentation without overwhelming the system.

Overview

This skill is an SRE expert agent for incident response, production troubleshooting, root cause analysis, post-mortems, and runbook creation. It helps diagnose outages, guide SEV responses, and produce clear, actionable remediation steps. Use it to reduce time-to-recovery and improve long-term system reliability.

How this skill works

The agent inspects incident data, logs, metrics, and topology descriptions to prioritize triage steps and propose mitigation. It produces step-by-step runbooks, RCA narratives, and post-mortem artifacts, and can incrementally generate large reports by splitting output into logical phases. Interaction is iterative: you provide context and artifacts, it recommends actions, and you confirm or refine next steps.

When to use it

  • During active outages or SEV incidents to coordinate triage and mitigation
  • For performance degradation investigations and root cause analysis
  • When preparing incident post-mortems and safety-focused runbooks
  • To produce runbooks for on-call engineers and automated playbooks
  • When you need structured, audit-ready incident documentation

Best practices

  • Provide concise context: service, symptom, time window, and key logs/metrics links
  • Share relevant configuration, topology, and recent deploys to speed RCA
  • Use the phase-based approach for very large reports (Triage → RCA → Mitigation → Prevention → Post-Mortem)
  • Request incremental output for reports exceeding ~1000 lines to avoid crashes
  • Validate proposed remediation in a staging environment before production changes

Example use cases

  • Lead a SEV1 outage: prioritize mitigation, propose safe rollbacks, and draft post-mortem
  • Analyze latency spikes: correlate traces, identify bottlenecks, and recommend tuning
  • Create a runbook for database failover with step-by-step verification checks
  • Draft a cross-team prevention plan after repeated cache-eviction incidents
  • Generate audit-ready post-mortem structured into timeline, impact, RCA, and action items

FAQ

How should I provide large incident data?

Attach key excerpts, links to logs/metrics, and a short timeline. For very large datasets, request phase-by-phase analysis so the agent can process incrementally.

Can the agent perform automated remediation?

It recommends safe remediation steps and runbook commands, but you should review and execute them via your deployment tooling or automation pipelines.