home / skills / greyhaven-ai / claude-code-config / incident-response

incident-response skill

safe

/grey-haven-plugins/incident-response/skills/incident-response

This skill guides incident responders through detection, investigation, mitigation, recovery, and postmortems using SRE best practices.

npx playbooks add skill greyhaven-ai/claude-code-config --skill incident-response

Review the files below or copy the command above to add this skill to your agents.

Files (9)

SKILL.md

1.2 KB

---
name: grey-haven-incident-response
description: "Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems. Use when dealing with production outages, SEV1/SEV2 incidents, creating postmortems, or updating runbooks."
# v2.0.43: Skills to auto-load for incident response
skills:
  - grey-haven-code-style
  - grey-haven-observability-monitoring
  - grey-haven-smart-debugging
# v2.0.74: Tools for incident response
allowed-tools:
  - Read
  - Write
  - Bash
  - Grep
  - Glob
  - TodoWrite
  - WebFetch
---

# Incident Response Skill

Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems.

## Description

Production incident response following SRE methodologies with incident timeline tracking, RCA documentation, and runbook updates.

## What's Included

- **Examples**: SEV1 incident handling, postmortem templates
- **Reference**: SRE best practices, incident severity levels
- **Templates**: Incident reports, RCA documents, runbook updates

## Use When

- Production outages
- SEV1/SEV2 incidents
- Postmortem creation
- Runbook updates

## Related Agents

- `incident-responder`

**Skill Version**: 1.0

Overview

This skill programs an incident response agent to manage production incidents using proven SRE practices. It covers detection, investigation, mitigation, recovery, and postmortem workflows to restore service and reduce recurrence. The skill includes templates, severity guidance, and runbook update helpers to standardize response across teams.

How this skill works

The agent inspects incoming alerts, categorizes incident severity, and builds a live timeline of events and actions. It guides responders through investigation steps, suggests mitigation actions, and records decisions and communications. After recovery, it generates a postmortem draft and proposes concrete runbook updates and RCA items to close the loop.

When to use it

Active production outages and service degradation
SEV1 or SEV2 incidents requiring coordinated response
Creating structured postmortems after an incident
Updating or validating runbooks based on lessons learned
Training on incident response procedures and tabletop exercises

Best practices

Triage quickly: prioritize stabilizing service before deep analysis
Maintain a clear, timestamped incident timeline for all actions
Document decisions and who approved mitigations in real time
Focus postmortems on root causes and actionable prevention items
Keep runbooks concise, tested, and version-controlled

Example use cases

Respond to a SEV1 outage impacting customer-facing API and coordinate mitigations
Draft a postmortem with timeline, RCA, and follow-up action items after recovery
Update an existing runbook with a verified rollback procedure and new diagnostic checks
Run a simulated incident exercise to validate team roles and communication paths
Produce a concise incident report for stakeholders summarizing impact and next steps

FAQ

Can this skill handle multiple active incidents?

Yes. It supports managing separate incident timelines and tagging resources per incident so teams can switch context safely.

Does it create final postmortems automatically?

It drafts structured postmortems with timelines, RCA, and action items; human review is required to finalize and assign ownership.