home / skills / nik-kale / sre-skills / runbook-creator
This skill helps you create actionable, testable runbooks and playbooks for on-call teams, accelerating incident response and documentation quality.
npx playbooks add skill nik-kale/sre-skills --skill runbook-creatorReview the files below or copy the command above to add this skill to your agents.
---
name: runbook-creator
description: Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams.
---
# Runbook Creator
Templates and best practices for creating effective operational runbooks.
## When to Use This Skill
- Creating runbooks for new services
- Documenting incident response procedures
- Writing operational playbooks
- Standardizing on-call documentation
- Automating common procedures
## Runbook Principles
1. **Actionable**: Every step should be executable
2. **Testable**: Verify each step works
3. **Current**: Update when systems change
4. **Accessible**: Available during incidents (not behind VPN-only)
5. **Linked**: Referenced from alerts
## Standard Runbook Template
Copy and customize this template:
```markdown
# [Service Name] - [Issue Type]
## Overview
Brief description of what this runbook addresses.
**Last Updated**: YYYY-MM-DD
**Owner**: [Team/Person]
**Related Alerts**: [Alert names that link here]
## Symptoms
What indicates this issue is occurring:
- [ ] Symptom 1
- [ ] Symptom 2
- [ ] Symptom 3
## Impact
- **Users Affected**: [Description]
- **Severity**: [SEV1/SEV2/SEV3/SEV4]
- **Business Impact**: [Description]
## Prerequisites
- Access to [system/tool]
- Permissions: [required permissions]
- Tools: [required CLI tools]
## Diagnostic Steps
### Step 1: [Verify the Issue]
```bash
# Command to run
kubectl get pods -n production | grep -v Running
```
**Expected Output**: [What you should see]
**If Different**: [What to do]
### Step 2: [Gather Information]
```bash
# Command to run
kubectl logs deployment/my-service -n production --tail=100
```
**Look For**: [What to look for in output]
## Resolution Steps
### Option A: [Quick Fix - e.g., Restart]
Use when: [conditions]
```bash
# Step 1: Restart the service
kubectl rollout restart deployment/my-service -n production
# Step 2: Verify pods are coming up
kubectl get pods -n production -w
```
**Verification**: [How to confirm fix worked]
### Option B: [Rollback]
Use when: [conditions]
```bash
# Step 1: Check rollout history
kubectl rollout history deployment/my-service -n production
# Step 2: Rollback to previous version
kubectl rollout undo deployment/my-service -n production
```
**Verification**: [How to confirm fix worked]
## Verification
How to confirm the issue is resolved:
- [ ] Error rate returned to normal
- [ ] Latency within SLO
- [ ] No related alerts firing
- [ ] User-facing functionality working
## Escalation
If this runbook doesn't resolve the issue:
1. **First**: Contact [Team/Person] via [Slack/Phone]
2. **Then**: Page [Escalation contact]
3. **Finally**: [Further escalation path]
## Related Resources
- [Dashboard Link](https://grafana/d/xxx)
- [Architecture Diagram](link)
- [Related Runbook](link)
## Revision History
| Date | Author | Change |
|------|--------|--------|
| YYYY-MM-DD | Name | Initial version |
```
## Quick Runbook Templates
### Service Restart
```markdown
# [Service] - Restart Procedure
## When to Use
- Service unresponsive
- Memory leak suspected
- After configuration change
## Steps
1. **Notify team**
```
Post in #incidents: "Restarting [service] due to [reason]"
```
2. **Restart service**
```bash
kubectl rollout restart deployment/[service] -n [namespace]
```
3. **Monitor rollout**
```bash
kubectl rollout status deployment/[service] -n [namespace]
```
4. **Verify health**
```bash
kubectl get pods -n [namespace] | grep [service]
# All pods should be Running, 1/1 Ready
```
5. **Check metrics**
- Error rate: [dashboard link]
- Latency: [dashboard link]
## Rollback
If restart makes things worse:
```bash
kubectl rollout undo deployment/[service] -n [namespace]
```
```
### Database Failover
```markdown
# [Database] - Failover Procedure
## When to Use
- Primary database unresponsive
- Planned maintenance
- Primary showing errors
## Prerequisites
- Database admin access
- Verify replica is in sync
## Pre-Failover Checks
1. **Check replication status**
```sql
SELECT * FROM pg_stat_replication;
```
Verify: `state = 'streaming'`, lag is minimal
2. **Check replica health**
```bash
pg_isready -h replica-host -p 5432
```
## Failover Steps
1. **Stop writes to primary** (if possible)
```sql
ALTER SYSTEM SET default_transaction_read_only = on;
SELECT pg_reload_conf();
```
2. **Promote replica**
```bash
pg_ctl promote -D /var/lib/postgresql/data
```
3. **Update connection strings**
- Update DNS/load balancer to point to new primary
- Or update application config
4. **Verify applications reconnected**
```sql
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
```
## Post-Failover
- [ ] Monitor error rates
- [ ] Set up new replica from old primary
- [ ] Update documentation
```
### Cache Clear
```markdown
# [Service] - Cache Clear Procedure
## When to Use
- Stale data being served
- Cache corruption suspected
- After data migration
## Impact Assessment
- Cache clear will cause temporary latency spike
- Database load will increase temporarily
## Steps
1. **Notify team**
```
Post in #incidents: "Clearing [cache] cache due to [reason]"
```
2. **Clear cache**
**Redis - All keys**:
```bash
redis-cli -h [host] FLUSHALL
```
**Redis - Specific pattern**:
```bash
redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL
```
**Application cache**:
```bash
curl -X POST http://[service]/admin/cache/clear
```
3. **Monitor**
- Watch cache hit rate recover
- Monitor database load
- Check latency
## Verification
- Cache hit rate returning to normal
- No errors from cache operations
- Latency stabilizing
```
## Runbook Checklist
Before publishing a runbook, verify:
```
Runbook Quality Checklist:
- [ ] Title clearly describes the issue/procedure
- [ ] Symptoms section helps identify when to use
- [ ] All commands are copy-pasteable
- [ ] Expected output documented for each command
- [ ] Verification steps confirm success
- [ ] Escalation path is clear
- [ ] Links to dashboards work
- [ ] Tested by someone other than author
- [ ] Linked from relevant alerts
```
## Automation Integration
### Runbook with Automation Hooks
```markdown
# [Service] - Automated Recovery
## Automatic Actions
The following actions run automatically:
1. Pod restart on OOMKilled (Kubernetes)
2. Scale-up on high CPU (HPA)
## Manual Steps (if auto-recovery fails)
### Check why auto-recovery failed
```bash
kubectl describe hpa [service] -n [namespace]
kubectl get events -n [namespace] --sort-by='.lastTimestamp'
```
### Manual intervention
[Steps here]
```
### Script-Backed Runbook
```markdown
# [Service] - Diagnostic Script
## Quick Diagnosis
Run the diagnostic script:
```bash
./scripts/diagnose-service.sh [service-name]
```
This script checks:
- Pod status
- Recent logs
- Resource usage
- Dependency health
## Interpreting Results
| Result | Meaning | Action |
|--------|---------|--------|
| `HEALTHY` | All checks pass | No action needed |
| `DEGRADED` | Some issues | Follow specific recommendations |
| `CRITICAL` | Major issues | Escalate immediately |
```
## Common Runbook Categories
Every service should have runbooks for:
```
Essential Runbooks:
- [ ] Service restart
- [ ] Rollback deployment
- [ ] Scale up/down
- [ ] Clear cache
- [ ] Database failover (if applicable)
- [ ] Dependency failure response
- [ ] High error rate investigation
- [ ] High latency investigation
```
## Additional Resources
- [Example Runbooks](references/example-runbooks.md)
- [Runbook Automation](references/automation.md)
This skill provides templates and practical patterns for creating operational runbooks and playbooks used by SRE and platform teams. It standardizes incident procedures, diagnostic commands, verification steps, and escalation paths so teams can respond quickly and consistently. Use it to produce clear, testable runbooks that are accessible during incidents.
The skill supplies a standard runbook template with sections for overview, symptoms, impact, prerequisites, diagnostics, resolution options, verification, escalation, and revision history. It includes quick templates for common scenarios—service restart, database failover, cache clear—and a checklist to validate runbook quality. It also shows how to integrate automation hooks and script-backed diagnostics for faster recovery.
How do I ensure a runbook stays current?
Assign an owner, include a last-updated date, and require updates after any incident or architecture change; test changes in a staging environment.
What belongs in the verification section?
Concrete checks like error rate thresholds, SLO/latency targets, alerts silenced, and user-facing functionality tests; include dashboard links and exact commands to confirm.