home / skills / eddiebe147 / claude-settings / incident-responder

incident-responder skill

/skills/incident-responder

This skill helps you orchestrate production incident response from triage through post-mortem, reducing downtime and preventing recurrence.

npx playbooks add skill eddiebe147/claude-settings --skill incident-responder

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
1.6 KB
---
name: Incident Responder
slug: incident-responder
description: Manage production incidents with structured response, debugging, and post-mortem documentation
category: technical
complexity: advanced
version: "1.0.0"
author: "ID8Labs"
triggers:
  - "production incident"
  - "site down"
  - "service outage"
tags:
  - incident-response
  - debugging
  - devops
---

# Incident Responder

Handle production incidents with urgency and precision. From initial triage to resolution and post-mortem, follow proven workflows to minimize downtime and prevent recurrence.

## Core Workflows

### Workflow 1: Incident Triage
1. **Detection** - Confirm the incident and scope
2. **Severity Assessment** - Classify impact level (SEV1-4)
3. **Communication** - Notify stakeholders
4. **Team Assembly** - Rally required responders
5. **Initial Diagnosis** - Identify likely cause

### Workflow 2: Resolution
1. **Containment** - Stop the bleeding
2. **Root Cause** - Identify underlying issue
3. **Fix Implementation** - Deploy the solution
4. **Verification** - Confirm resolution
5. **Status Update** - Communicate resolution

### Workflow 3: Post-Mortem
1. **Timeline** - Document what happened when
2. **Root Cause Analysis** - 5 whys analysis
3. **Action Items** - Identify preventive measures
4. **Documentation** - Write post-mortem report
5. **Review** - Share learnings with team

## Quick Reference

| Action | Command |
|--------|---------|
| Start incident | "We have a production incident" |
| Triage | "What's the severity and impact?" |
| Post-mortem | "Create post-mortem for incident" |

Overview

This skill provides a structured incident response workflow to manage production outages from detection through post-mortem. It guides responders through triage, containment, resolution, and documentation to minimize downtime and prevent recurrence. The skill emphasizes clear communication, rapid diagnosis, and actionable follow-ups.

How this skill works

The skill inspects incident context and walks teams through three core workflows: Triage, Resolution, and Post-Mortem. It prompts severity assessment, coordinates responders, suggests containment steps, and helps capture timelines and root-cause analysis. Outputs include status messages, verification checks, and a ready-to-share post-mortem draft.

When to use it

  • A production service is degraded or unavailable
  • Multiple users or customers report the same error or outage
  • Unclear root cause after initial checks requiring coordinated responders
  • When an incident must be documented for compliance or learning
  • After resolution to conduct a formal post-mortem and assign action items

Best practices

  • Confirm scope and impact before escalating to avoid unnecessary disruption
  • Assign a single incident commander to coordinate communications and decisions
  • Contain first, then fix: prioritize stopping customer impact over permanent changes
  • Keep concise, timestamped timeline entries during the incident
  • Create clear, actionable post-mortem items with owners and due dates

Example use cases

  • Start triage when monitoring alerts indicate service latency or error spikes
  • Assemble relevant engineers and stakeholders for a SEV1 outage
  • Guide responders through containment when a database or cache is overwhelmed
  • Draft a post-mortem with a 5 Whys root-cause section and tracked action items
  • Run verification steps after a hotfix or rollback and publish status updates

FAQ

Who should act as the incident commander?

Choose one experienced engineer or team lead who can make quick decisions and coordinate communications until the incident is resolved.

What severity levels should I use?

Use SEV1 for full production outage affecting many customers, SEV2 for major degraded service, SEV3 for limited impact, and SEV4 for minor issues or single-customer incidents.

How soon should a post-mortem be written?

Start the post-mortem within 48 hours while details are fresh; complete and review it within one to two weeks with assigned action items.