home / skills / phrazzld / claude-config / incident-response
This skill guides incident response from investigation to prevention, documenting root cause and postmortem to enable systemic fixes.
npx playbooks add skill phrazzld/claude-config --skill incident-responseReview the files below or copy the command above to add this skill to your agents.
---
name: incident-response
description: |
Investigate, fix, postmortem, prevent.
Full incident lifecycle from bug report to systemic prevention.
Use when: production down, critical bug, incident response, post-incident review.
Composes: /investigate, /fix, /postmortem, /codify-learning.
argument-hint: <bug-report>
effort: max
---
# /incident-response
Fix the fire. Then prevent the next one.
## Role
Incident commander running the response lifecycle.
## Objective
Resolve the incident described in `$ARGUMENTS`. Fix it, verify it, learn from it, prevent recurrence.
## Latitude
- Multi-AI investigation: Codex (stack traces), Gemini (research), Thinktank (validate hypothesis)
- Create branch immediately: `fix/incident-$(date +%Y%m%d-%H%M)`
- Demand observable proof — never trust "should work"
## Workflow
1. **Investigate** — `/investigate $ARGUMENTS` (creates INCIDENT.md with timeline, evidence, root cause)
2. **Branch** — `fix/incident-$(date +%Y%m%d-%H%M)` from main
3. **Fix** — `/fix "Root cause from investigation"` (Codex delegation + verify)
4. **Verify** — Observable proof: log entries, metrics, database state. Mark UNVERIFIED until confirmed.
5. **Postmortem** — `/postmortem` (blameless: summary, timeline, 5 Whys, follow-ups)
6. **Prevent** — If systemic: create prevention issue, optionally `/autopilot` it
7. **Codify** — `/codify-learning` (regression test, agent update, monitoring rule)
## Output
Incident resolved, postmortem filed, prevention issue created (if applicable).
This skill manages the full incident lifecycle from alert to systemic prevention. It guides an incident commander through investigation, fix deployment, verification, postmortem, and codifying learnings. The goal is to restore service quickly, prove the fix with observability, and remove root causes to prevent recurrence.
The skill orchestrates a stepwise workflow: run a focused investigation to build an evidence-backed timeline and identify root cause, create a dedicated fix branch, implement and verify fixes with observable proof, and produce a blameless postmortem. It then converts findings into prevention work: regression tests, monitoring rules, and follow-up issues. The process emphasizes verifiable outcomes and repeatable artifacts (incident notes, postmortem, prevention ticket).
What counts as observable proof?
Concrete, reproducible signals such as log entries showing error resolution, metric trends returning to baseline, or database state updates that confirm the fix.
When should I create a prevention issue?
Create a prevention issue whenever the root cause indicates a systemic gap (testing, monitoring, design) or when recurrence risk is non-trivial.
How do I keep postmortems blameless?
Focus on facts and timelines, avoid naming individuals as causes, and translate findings into actionable follow-ups and process changes.