home / skills / nickcrew / claude-cortex / incident-response

incident-response skill

/skills/incident-response

This skill guides incident triage, containment, and postmortems to reduce outages and accelerate recovery with structured playbooks.

npx playbooks add skill nickcrew/claude-cortex --skill incident-response

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
3.1 KB
---
name: incident-response
description: Incident triage, cascade prevention, and postmortem methodology. Use when handling production incidents, designing resilience patterns, or conducting chaos engineering exercises.
---

# Incident Response

Structured incident management from detection through postmortem, with resilience patterns for preventing and containing cascading failures.

## When to Use

- Production incident in progress (outage, degradation, data loss)
- Designing circuit breakers, bulkheads, or fallback strategies
- Conducting or planning chaos engineering exercises
- Writing or reviewing postmortem documents
- Establishing on-call procedures and escalation paths

Avoid when:
- The issue is a development-time bug with no production impact
- Designing general system architecture (use system-design instead)

## Quick Reference

| Topic | Load reference |
| --- | --- |
| **Triage Framework** | `skills/incident-response/references/triage-framework.md` |
| **Postmortem Patterns** | `skills/incident-response/references/postmortem-patterns.md` |

## Incident Response Workflow

### Phase 1: Detect

- Alert fires or user report received
- Confirm the issue is real (not a false positive)
- Identify affected services and user impact scope

### Phase 2: Triage

- Classify severity (P0-P3)
- Assign incident commander
- Open communication channel (war room, Slack channel)
- Begin status page updates

### Phase 3: Contain

- Stop the bleeding: rollback, feature flag, traffic shift
- Prevent cascade: circuit breakers, load shedding, bulkhead isolation
- Communicate: stakeholder updates every 15 minutes for P0/P1

### Phase 4: Resolve

- Implement fix (minimal viable fix first)
- Validate in staging if time permits
- Deploy with monitoring and rollback plan ready
- Confirm recovery with metrics returning to baseline

### Phase 5: Postmortem

- Document timeline within 48 hours
- Conduct blameless review with all participants
- Identify root cause and contributing factors
- Assign action items with owners and deadlines
- Update runbooks and alerting based on lessons learned

## Severity Framework

| Level | Impact | Response Time | Examples |
|-------|--------|---------------|---------|
| **P0** | Complete outage, data loss, security breach | Immediate (< 5 min) | Service down, data corruption, credential leak |
| **P1** | Major feature broken, significant user impact | < 30 min | Payment processing failed, auth broken for region |
| **P2** | Degraded performance, partial feature loss | < 4 hours | Elevated latency, non-critical feature unavailable |
| **P3** | Minor issue, workaround available | Next business day | UI glitch, slow report generation, cosmetic error |

## Output

- Incident timeline and severity classification
- Containment actions taken
- Postmortem document with action items
- Updated runbooks and alerting rules

## Common Mistakes

- Skipping severity classification and treating everything as P0
- Making changes without a rollback plan
- Forgetting to communicate status to stakeholders
- Writing postmortems that assign blame instead of identifying systemic issues
- Not following up on postmortem action items

Overview

This skill provides a structured incident-response playbook for detection, triage, containment, resolution, and postmortem. It focuses on preventing cascading failures with resilience patterns like circuit breakers, bulkheads, and load shedding. Use it to run live incident operations, design recovery strategies, and capture actionable postmortems.

How this skill works

The skill guides operators through five phases: Detect, Triage, Contain, Resolve, and Postmortem. It enforces severity classification (P0–P3), assigns roles (incident commander), prescribes containment actions (rollback, feature flags, traffic shifts), and defines communication cadence. After recovery it produces a timeline, root-cause analysis, and tracked action items to update runbooks and alerts.

When to use it

  • Handling a production outage, data corruption, or security incident
  • Designing resilience patterns like circuit breakers, bulkheads, or graceful degradation
  • Planning or running chaos engineering experiments to validate failure modes
  • Writing, reviewing, or enforcing postmortem and remediation processes
  • Establishing on-call procedures and escalation paths

Best practices

  • Classify severity early and consistently (P0–P3) to set response tempo
  • Assign an incident commander and open a single communication channel for coordination
  • Contain first: apply minimal viable fixes with rollback plans before full remediation
  • Communicate status frequently to stakeholders, more often for higher severities
  • Run a blameless postmortem within 48 hours with assigned owners and deadlines for actions

Example use cases

  • Triage and contain a P0 outage impacting multiple regions with traffic shifting and circuit breakers
  • Design bulkhead and load-shedding strategies for critical payment flows
  • Run a chaos engineering runbook and validate abort/rollback behavior under failure
  • Draft a postmortem that documents timeline, root cause, and tracked remediation items
  • Create or update runbooks and alert thresholds after recurring degradations

FAQ

How quickly should I start a postmortem?

Begin drafting the timeline within 48 hours and schedule a blameless review as soon as key participants are available.

When should I apply rollbacks versus feature flags?

Prefer quick rollbacks for unsafe deployments causing severe impact; use feature flags for targeted rollbacks or gradual traffic shifts when rollback is risky.