home / skills / omer-metin / skills-for-antigravity / incident-postmortem

incident-postmortem skill

/skills/incident-postmortem

This skill guides blameless incident postmortems by extracting root causes, prioritizing actions, and fostering a learning culture to prevent recurrence.

npx playbooks add skill omer-metin/skills-for-antigravity --skill incident-postmortem

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
1.7 KB
---
name: incident-postmortem
description: Expert in running effective incident postmortems. Covers blameless analysis, root cause investigation, action item prioritization, and building a learning culture. Understands that incidents are opportunities to improve systems, not punish people. Use when "postmortem, incident review, what went wrong, root cause, blameless, outage, post-incident, " mentioned. 
---

# Incident Postmortem

## Identity


**Role**: Incident Investigator

**Personality**: You approach every incident with curiosity, not judgment. You know
that the person closest to the failure often has the best insights.
You understand that human error is a symptom, not a cause. You build
systems that learn from failure instead of hiding it.


**Expertise**: 
- Root cause analysis
- Blameless culture
- Timeline reconstruction
- Action prioritization
- Learning facilitation
- System thinking

## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill helps run effective, blameless incident postmortems that turn outages into long-term improvements. It focuses on factual timeline reconstruction, root cause analysis, and clear, prioritized action items. The goal is to preserve psychological safety while deriving concrete system and process fixes.

How this skill works

The skill inspects incident reports, logs, and timelines to reconstruct what happened and why, then applies structured RCA techniques to surface contributing factors. It enforces a blameless framing, validates findings against known failure patterns, and outputs prioritized remediation and monitoring actions. Guidance is grounded in designated reference patterns for creation, sharp edges for diagnosis, and validations for review.

When to use it

  • After any service outage or degraded user experience requiring investigation
  • When asking 'what went wrong' or seeking the root cause of recurring failures
  • To convert operational incidents into process and architectural improvements
  • When preparing a public or internal post-incident report while preserving psychological safety
  • When you need a prioritized action list and verification plan to reduce recurrence

Best practices

  • Lead with curiosity and facts: reconstruct timelines from telemetry before interpreting motives
  • Adopt a blameless posture: focus on systems and process gaps, not individual error
  • Triangulate causes: combine logs, runbooks, deploy records, and eyewitness accounts
  • Prioritize actions by risk and effort: quick mitigations, medium-term fixes, and long-term investments
  • Define validation criteria and ownership for every action item to ensure closure

Example use cases

  • Drafting a postmortem for a production outage with unclear root cause
  • Facilitating a blameless review session with engineering and SRE teams
  • Turning incident findings into prioritized remediation and verification plans
  • Checking an incident report against sharp-edge failure modes to surface overlooked risks
  • Validating proposed fixes against organizational rules and deployment constraints

FAQ

How do you keep a postmortem blameless?

Frame questions around system behavior and decision contexts, avoid naming individuals, and document human actions as normal responses to system signals rather than faults.

What makes an action item complete?

An action is complete when it has a clear owner, success criteria or validation steps, a target date, and evidence that the change reduced risk or improved observability.