home / skills / nik-kale / sre-skills / runbook-creator

runbook-creator skill

safe

This skill helps you create actionable, testable runbooks and playbooks for on-call teams, accelerating incident response and documentation quality.

npx playbooks add skill nik-kale/sre-skills --skill runbook-creator

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

7.6 KB

---
name: runbook-creator
description: Templates and patterns for creating operational runbooks and playbooks. Use when creating runbooks, writing operational documentation, playbook creation, or documenting procedures for on-call teams.
---

# Runbook Creator

Templates and best practices for creating effective operational runbooks.

## When to Use This Skill

- Creating runbooks for new services
- Documenting incident response procedures
- Writing operational playbooks
- Standardizing on-call documentation
- Automating common procedures

## Runbook Principles

1. **Actionable**: Every step should be executable
2. **Testable**: Verify each step works
3. **Current**: Update when systems change
4. **Accessible**: Available during incidents (not behind VPN-only)
5. **Linked**: Referenced from alerts

## Standard Runbook Template

Copy and customize this template:

```markdown
# [Service Name] - [Issue Type]

## Overview
Brief description of what this runbook addresses.

**Last Updated**: YYYY-MM-DD
**Owner**: [Team/Person]
**Related Alerts**: [Alert names that link here]

## Symptoms
What indicates this issue is occurring:
- [ ] Symptom 1
- [ ] Symptom 2
- [ ] Symptom 3

## Impact
- **Users Affected**: [Description]
- **Severity**: [SEV1/SEV2/SEV3/SEV4]
- **Business Impact**: [Description]

## Prerequisites
- Access to [system/tool]
- Permissions: [required permissions]
- Tools: [required CLI tools]

## Diagnostic Steps

### Step 1: [Verify the Issue]
```bash
# Command to run
kubectl get pods -n production | grep -v Running
```

**Expected Output**: [What you should see]
**If Different**: [What to do]

### Step 2: [Gather Information]
```bash
# Command to run
kubectl logs deployment/my-service -n production --tail=100
```

**Look For**: [What to look for in output]

## Resolution Steps

### Option A: [Quick Fix - e.g., Restart]
Use when: [conditions]

```bash
# Step 1: Restart the service
kubectl rollout restart deployment/my-service -n production

# Step 2: Verify pods are coming up
kubectl get pods -n production -w
```

**Verification**: [How to confirm fix worked]

### Option B: [Rollback]
Use when: [conditions]

```bash
# Step 1: Check rollout history
kubectl rollout history deployment/my-service -n production

# Step 2: Rollback to previous version
kubectl rollout undo deployment/my-service -n production
```

**Verification**: [How to confirm fix worked]

## Verification
How to confirm the issue is resolved:
- [ ] Error rate returned to normal
- [ ] Latency within SLO
- [ ] No related alerts firing
- [ ] User-facing functionality working

## Escalation
If this runbook doesn't resolve the issue:
1. **First**: Contact [Team/Person] via [Slack/Phone]
2. **Then**: Page [Escalation contact]
3. **Finally**: [Further escalation path]

## Related Resources
- [Dashboard Link](https://grafana/d/xxx)
- [Architecture Diagram](link)
- [Related Runbook](link)

## Revision History
| Date | Author | Change |
|------|--------|--------|
| YYYY-MM-DD | Name | Initial version |
```

## Quick Runbook Templates

### Service Restart

```markdown
# [Service] - Restart Procedure

## When to Use
- Service unresponsive
- Memory leak suspected
- After configuration change

## Steps

1. **Notify team**
   ```
   Post in #incidents: "Restarting [service] due to [reason]"
   ```

2. **Restart service**
   ```bash
   kubectl rollout restart deployment/[service] -n [namespace]
   ```

3. **Monitor rollout**
   ```bash
   kubectl rollout status deployment/[service] -n [namespace]
   ```

4. **Verify health**
   ```bash
   kubectl get pods -n [namespace] | grep [service]
   # All pods should be Running, 1/1 Ready
   ```

5. **Check metrics**
   - Error rate: [dashboard link]
   - Latency: [dashboard link]

## Rollback
If restart makes things worse:
```bash
kubectl rollout undo deployment/[service] -n [namespace]
```
```

### Database Failover

```markdown
# [Database] - Failover Procedure

## When to Use
- Primary database unresponsive
- Planned maintenance
- Primary showing errors

## Prerequisites
- Database admin access
- Verify replica is in sync

## Pre-Failover Checks

1. **Check replication status**
   ```sql
   SELECT * FROM pg_stat_replication;
   ```
   Verify: `state = 'streaming'`, lag is minimal

2. **Check replica health**
   ```bash
   pg_isready -h replica-host -p 5432
   ```

## Failover Steps

1. **Stop writes to primary** (if possible)
   ```sql
   ALTER SYSTEM SET default_transaction_read_only = on;
   SELECT pg_reload_conf();
   ```

2. **Promote replica**
   ```bash
   pg_ctl promote -D /var/lib/postgresql/data
   ```

3. **Update connection strings**
   - Update DNS/load balancer to point to new primary
   - Or update application config

4. **Verify applications reconnected**
   ```sql
   SELECT count(*) FROM pg_stat_activity WHERE state = 'active';
   ```

## Post-Failover
- [ ] Monitor error rates
- [ ] Set up new replica from old primary
- [ ] Update documentation
```

### Cache Clear

```markdown
# [Service] - Cache Clear Procedure

## When to Use
- Stale data being served
- Cache corruption suspected
- After data migration

## Impact Assessment
- Cache clear will cause temporary latency spike
- Database load will increase temporarily

## Steps

1. **Notify team**
   ```
   Post in #incidents: "Clearing [cache] cache due to [reason]"
   ```

2. **Clear cache**
   
   **Redis - All keys**:
   ```bash
   redis-cli -h [host] FLUSHALL
   ```
   
   **Redis - Specific pattern**:
   ```bash
   redis-cli -h [host] --scan --pattern "user:*" | xargs redis-cli DEL
   ```
   
   **Application cache**:
   ```bash
   curl -X POST http://[service]/admin/cache/clear
   ```

3. **Monitor**
   - Watch cache hit rate recover
   - Monitor database load
   - Check latency

## Verification
- Cache hit rate returning to normal
- No errors from cache operations
- Latency stabilizing
```

## Runbook Checklist

Before publishing a runbook, verify:

```
Runbook Quality Checklist:
- [ ] Title clearly describes the issue/procedure
- [ ] Symptoms section helps identify when to use
- [ ] All commands are copy-pasteable
- [ ] Expected output documented for each command
- [ ] Verification steps confirm success
- [ ] Escalation path is clear
- [ ] Links to dashboards work
- [ ] Tested by someone other than author
- [ ] Linked from relevant alerts
```

## Automation Integration

### Runbook with Automation Hooks

```markdown
# [Service] - Automated Recovery

## Automatic Actions
The following actions run automatically:
1. Pod restart on OOMKilled (Kubernetes)
2. Scale-up on high CPU (HPA)

## Manual Steps (if auto-recovery fails)

### Check why auto-recovery failed
```bash
kubectl describe hpa [service] -n [namespace]
kubectl get events -n [namespace] --sort-by='.lastTimestamp'
```

### Manual intervention
[Steps here]
```

### Script-Backed Runbook

```markdown
# [Service] - Diagnostic Script

## Quick Diagnosis
Run the diagnostic script:
```bash
./scripts/diagnose-service.sh [service-name]
```

This script checks:
- Pod status
- Recent logs
- Resource usage
- Dependency health

## Interpreting Results
| Result | Meaning | Action |
|--------|---------|--------|
| `HEALTHY` | All checks pass | No action needed |
| `DEGRADED` | Some issues | Follow specific recommendations |
| `CRITICAL` | Major issues | Escalate immediately |
```

## Common Runbook Categories

Every service should have runbooks for:

```
Essential Runbooks:
- [ ] Service restart
- [ ] Rollback deployment
- [ ] Scale up/down
- [ ] Clear cache
- [ ] Database failover (if applicable)
- [ ] Dependency failure response
- [ ] High error rate investigation
- [ ] High latency investigation
```

## Additional Resources

- [Example Runbooks](references/example-runbooks.md)
- [Runbook Automation](references/automation.md)

Overview

This skill provides templates and practical patterns for creating operational runbooks and playbooks used by SRE and platform teams. It standardizes incident procedures, diagnostic commands, verification steps, and escalation paths so teams can respond quickly and consistently. Use it to produce clear, testable runbooks that are accessible during incidents.

How this skill works

The skill supplies a standard runbook template with sections for overview, symptoms, impact, prerequisites, diagnostics, resolution options, verification, escalation, and revision history. It includes quick templates for common scenarios—service restart, database failover, cache clear—and a checklist to validate runbook quality. It also shows how to integrate automation hooks and script-backed diagnostics for faster recovery.

When to use it

Creating runbooks for new services or features
Documenting incident response and on-call playbooks
Standardizing operational runbooks across teams
Automating routine recovery steps and diagnostic scripts
Preparing runbooks before major changes or maintenance windows

Best practices

Make every step actionable and copy-pasteable with expected output noted
Keep runbooks current—update after tests, incidents, or architecture changes
Ensure runbooks are accessible during incidents (avoid VPN-only storage)
Include verification steps and clear escalation paths
Test runbooks by someone other than the author and link them from alerts

Example use cases

Service restart runbook with step-by-step kubectl commands and rollback instructions
Database failover procedure including pre-checks, promote steps, and connection updates
Cache clear playbook that documents impact, notification, and verification metrics
Automated recovery runbook showing triggers, automation hooks, and manual intervention steps
Script-backed diagnostic runbook that runs a health script and interprets results for next actions

FAQ

How do I ensure a runbook stays current?

Assign an owner, include a last-updated date, and require updates after any incident or architecture change; test changes in a staging environment.

What belongs in the verification section?

Concrete checks like error rate thresholds, SLO/latency targets, alerts silenced, and user-facing functionality tests; include dashboard links and exact commands to confirm.