home / skills / greyhaven-ai / claude-code-config / incident-response

This skill guides incident responders through detection, investigation, mitigation, recovery, and postmortems using SRE best practices.

npx playbooks add skill greyhaven-ai/claude-code-config --skill incident-response

Review the files below or copy the command above to add this skill to your agents.

Files (9)
SKILL.md
1.2 KB
---
name: grey-haven-incident-response
description: "Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems. Use when dealing with production outages, SEV1/SEV2 incidents, creating postmortems, or updating runbooks."
# v2.0.43: Skills to auto-load for incident response
skills:
  - grey-haven-code-style
  - grey-haven-observability-monitoring
  - grey-haven-smart-debugging
# v2.0.74: Tools for incident response
allowed-tools:
  - Read
  - Write
  - Bash
  - Grep
  - Glob
  - TodoWrite
  - WebFetch
---

# Incident Response Skill

Handle production incidents with SRE best practices including detection, investigation, mitigation, recovery, and postmortems.

## Description

Production incident response following SRE methodologies with incident timeline tracking, RCA documentation, and runbook updates.

## What's Included

- **Examples**: SEV1 incident handling, postmortem templates
- **Reference**: SRE best practices, incident severity levels
- **Templates**: Incident reports, RCA documents, runbook updates

## Use When

- Production outages
- SEV1/SEV2 incidents
- Postmortem creation
- Runbook updates

## Related Agents

- `incident-responder`

**Skill Version**: 1.0

Overview

This skill programs an incident response agent to manage production incidents using proven SRE practices. It covers detection, investigation, mitigation, recovery, and postmortem workflows to restore service and reduce recurrence. The skill includes templates, severity guidance, and runbook update helpers to standardize response across teams.

How this skill works

The agent inspects incoming alerts, categorizes incident severity, and builds a live timeline of events and actions. It guides responders through investigation steps, suggests mitigation actions, and records decisions and communications. After recovery, it generates a postmortem draft and proposes concrete runbook updates and RCA items to close the loop.

When to use it

  • Active production outages and service degradation
  • SEV1 or SEV2 incidents requiring coordinated response
  • Creating structured postmortems after an incident
  • Updating or validating runbooks based on lessons learned
  • Training on incident response procedures and tabletop exercises

Best practices

  • Triage quickly: prioritize stabilizing service before deep analysis
  • Maintain a clear, timestamped incident timeline for all actions
  • Document decisions and who approved mitigations in real time
  • Focus postmortems on root causes and actionable prevention items
  • Keep runbooks concise, tested, and version-controlled

Example use cases

  • Respond to a SEV1 outage impacting customer-facing API and coordinate mitigations
  • Draft a postmortem with timeline, RCA, and follow-up action items after recovery
  • Update an existing runbook with a verified rollback procedure and new diagnostic checks
  • Run a simulated incident exercise to validate team roles and communication paths
  • Produce a concise incident report for stakeholders summarizing impact and next steps

FAQ

Can this skill handle multiple active incidents?

Yes. It supports managing separate incident timelines and tagging resources per incident so teams can switch context safely.

Does it create final postmortems automatically?

It drafts structured postmortems with timelines, RCA, and action items; human review is required to finalize and assign ownership.