home / skills / phrazzld / claude-config / incident-response

incident-response skill

/skills/incident-response

This skill guides incident response from investigation to prevention, documenting root cause and postmortem to enable systemic fixes.

npx playbooks add skill phrazzld/claude-config --skill incident-response

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
1.5 KB
---
name: incident-response
description: |
  Investigate, fix, postmortem, prevent.
  Full incident lifecycle from bug report to systemic prevention.
  Use when: production down, critical bug, incident response, post-incident review.
  Composes: /investigate, /fix, /postmortem, /codify-learning.
argument-hint: <bug-report>
effort: max
---

# /incident-response

Fix the fire. Then prevent the next one.

## Role

Incident commander running the response lifecycle.

## Objective

Resolve the incident described in `$ARGUMENTS`. Fix it, verify it, learn from it, prevent recurrence.

## Latitude

- Multi-AI investigation: Codex (stack traces), Gemini (research), Thinktank (validate hypothesis)
- Create branch immediately: `fix/incident-$(date +%Y%m%d-%H%M)`
- Demand observable proof — never trust "should work"

## Workflow

1. **Investigate** — `/investigate $ARGUMENTS` (creates INCIDENT.md with timeline, evidence, root cause)
2. **Branch** — `fix/incident-$(date +%Y%m%d-%H%M)` from main
3. **Fix** — `/fix "Root cause from investigation"` (Codex delegation + verify)
4. **Verify** — Observable proof: log entries, metrics, database state. Mark UNVERIFIED until confirmed.
5. **Postmortem** — `/postmortem` (blameless: summary, timeline, 5 Whys, follow-ups)
6. **Prevent** — If systemic: create prevention issue, optionally `/autopilot` it
7. **Codify** — `/codify-learning` (regression test, agent update, monitoring rule)

## Output

Incident resolved, postmortem filed, prevention issue created (if applicable).

Overview

This skill manages the full incident lifecycle from alert to systemic prevention. It guides an incident commander through investigation, fix deployment, verification, postmortem, and codifying learnings. The goal is to restore service quickly, prove the fix with observability, and remove root causes to prevent recurrence.

How this skill works

The skill orchestrates a stepwise workflow: run a focused investigation to build an evidence-backed timeline and identify root cause, create a dedicated fix branch, implement and verify fixes with observable proof, and produce a blameless postmortem. It then converts findings into prevention work: regression tests, monitoring rules, and follow-up issues. The process emphasizes verifiable outcomes and repeatable artifacts (incident notes, postmortem, prevention ticket).

When to use it

  • Production service is degraded or unavailable
  • A critical bug impacts customers or data integrity
  • You need a formal incident response and coordination plan
  • After restoring service to conduct a blameless post-incident review
  • When systemic risks require creating prevention work and tests

Best practices

  • Act as incident commander: coordinate investigation, fixes, and communication
  • Create a fix branch immediately to contain changes and document intent
  • Require observable verification (logs, metrics, DB state) before marking resolved
  • Keep the postmortem blameless and focused on evidence and follow-ups
  • Turn learnings into concrete artifacts: regression tests, monitoring rules, and tickets

Example use cases

  • A deployment causes a sudden increase in error rates; run investigation, roll a fix branch, patch, and verify error reduction via metrics
  • A data corruption bug is reported; trace timeline, identify root cause, apply DB-safe fix, and add verification queries plus a regression test
  • Intermittent latency spikes; gather traces and logs, create a preventative monitoring rule, and codify test cases
  • Post-incident review for an outage: compile timeline, perform 5 Whys, and create follow-up prevention issues

FAQ

What counts as observable proof?

Concrete, reproducible signals such as log entries showing error resolution, metric trends returning to baseline, or database state updates that confirm the fix.

When should I create a prevention issue?

Create a prevention issue whenever the root cause indicates a systemic gap (testing, monitoring, design) or when recurrence risk is non-trivial.

How do I keep postmortems blameless?

Focus on facts and timelines, avoid naming individuals as causes, and translate findings into actionable follow-ups and process changes.