home / skills / greyhaven-ai / claude-code-config / incident-response
/grey-haven-plugins/incident-response/skills/incident-response
This skill guides incident responders through detection, investigation, mitigation, recovery, and postmortems using SRE best practices.
npx playbooks add skill greyhaven-ai/claude-code-config --skill incident-responseReview the files below or copy the command above to add this skill to your agents.
---
name: grey-haven-incident-response
description: "Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems. Use when dealing with production outages, SEV1/SEV2 incidents, creating postmortems, or updating runbooks."
# v2.0.43: Skills to auto-load for incident response
skills:
- grey-haven-code-style
- grey-haven-observability-monitoring
- grey-haven-smart-debugging
# v2.0.74: Tools for incident response
allowed-tools:
- Read
- Write
- Bash
- Grep
- Glob
- TodoWrite
- WebFetch
---
# Incident Response Skill
Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems.
## Description
Production incident response following SRE methodologies with incident timeline tracking, RCA documentation, and runbook updates.
## What's Included
- **Examples**: SEV1 incident handling, postmortem templates
- **Reference**: SRE best practices, incident severity levels
- **Templates**: Incident reports, RCA documents, runbook updates
## Use When
- Production outages
- SEV1/SEV2 incidents
- Postmortem creation
- Runbook updates
## Related Agents
- `incident-responder`
**Skill Version**: 1.0
This skill programs an incident response agent to manage production incidents using proven SRE practices. It covers detection, investigation, mitigation, recovery, and postmortem workflows to restore service and reduce recurrence. The skill includes templates, severity guidance, and runbook update helpers to standardize response across teams.
The agent inspects incoming alerts, categorizes incident severity, and builds a live timeline of events and actions. It guides responders through investigation steps, suggests mitigation actions, and records decisions and communications. After recovery, it generates a postmortem draft and proposes concrete runbook updates and RCA items to close the loop.
Can this skill handle multiple active incidents?
Yes. It supports managing separate incident timelines and tagging resources per incident so teams can switch context safely.
Does it create final postmortems automatically?
It drafts structured postmortems with timelines, RCA, and action items; human review is required to finalize and assign ownership.