home / skills / 89jobrien / steve / devops-runbooks

devops-runbooks skill

safe

This skill helps you create actionable operational runbooks, incident response playbooks, and maintenance guides with clear escalation and rollback steps.

npx playbooks add skill 89jobrien/steve --skill devops-runbooks

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

1.4 KB

---
name: devops-runbooks
description: Operational runbook and procedure documentation specialist. Use when
  creating incident response procedures, operational playbooks, or system maintenance
  guides.
author: Joseph OBrien
status: unpublished
updated: '2025-12-23'
version: 1.0.1
tag: skill
type: skill
---

# DevOps Runbooks Skill

Creates actionable runbooks for operational procedures, incident response, and system maintenance.

## What This Skill Does

- Creates operational runbooks
- Documents incident procedures
- Defines escalation paths
- Provides troubleshooting guides
- Documents rollback procedures
- Captures operational knowledge

## When to Use

- Incident response planning
- On-call documentation
- System maintenance procedures
- Disaster recovery planning
- Knowledge transfer

## Reference Files

- `references/RUNBOOK.template.md` - Comprehensive operational runbook format

## Runbook Structure

1. **Overview** - Purpose and when to use
2. **Prerequisites** - Access and tools needed
3. **Quick Reference** - Key commands and URLs
4. **Procedure** - Step-by-step with verification
5. **Rollback** - How to revert changes
6. **Troubleshooting** - Common issues
7. **Escalation** - When and how to escalate

## Best Practices

- Commands must be copy-pasteable
- Include expected output for each step
- Document decision points clearly
- Define rollback at each step
- Keep procedures current (test regularly)
- Include escalation contacts

Overview

This skill produces actionable operational runbooks and procedure documentation for incident response, maintenance, and knowledge transfer. It focuses on clear step-by-step procedures, verifiable commands, and explicit rollback and escalation instructions. Use it to turn tribal operational knowledge into repeatable, testable runbooks that teams can follow under pressure.

How this skill works

The skill inspects the target system context and desired outcome, then generates a structured runbook following a proven template: Overview, Prerequisites, Quick Reference, Procedure, Rollback, Troubleshooting, and Escalation. It outputs copy-pasteable commands, expected outputs for verification, decision points, and contact information for escalation. The runbooks are concise, test-oriented, and designed for on-call and incident use.

When to use it

Creating incident response procedures for production outages
Documenting on-call playbooks and escalation paths
Preparing system maintenance or deployment procedures
Designing disaster recovery and rollback plans
Transferring operational knowledge between teams

Best practices

Provide copy-pasteable commands and exact file paths or URLs
Include expected output or verification steps for each action
Define rollback actions at every risky step and test them regularly
Call out decision points and criteria for escalation explicitly
Keep runbooks versioned and review them after incidents

Example use cases

Write an incident runbook for a web service 503 error, with triage steps and quick fixes
Create a scheduled maintenance runbook for database schema migration with rollback
Draft an on-call escalation guide with contacts, paging rules, and SLA thresholds
Generate a disaster recovery procedure for restoring from backups and validating integrity
Produce a troubleshooting guide for degraded performance including metrics to check

FAQ

Can the runbook include commands for multiple environments?

Yes. Specify environment labels (staging, production) and include environment-specific commands and verification steps.

How are escalation contacts handled?

List primary and secondary contacts with roles, contact methods, and when to escalate based on defined thresholds.

What verification should be included?

Include exact commands, expected outputs, log locations, and success criteria so responders can confirm the system state.