home / skills / ancoleman / ai-design-components / planning-disaster-recovery

planning-disaster-recovery skill

/skills/planning-disaster-recovery

This skill helps design and validate disaster recovery plans with RTO/RPO targets, cross-region replication, and chaotic testing to ensure resilience.

npx playbooks add skill ancoleman/ai-design-components --skill planning-disaster-recovery

Review the files below or copy the command above to add this skill to your agents.

Files (20)
SKILL.md
12.1 KB
---
name: planning-disaster-recovery
description: Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing. Use when implementing backup systems, configuring point-in-time recovery, setting up multi-region failover, or validating DR procedures.
---

# Disaster Recovery

## Purpose

Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.

## When to Use This Skill

Invoke this skill when:
- Defining recovery time objectives (RTO) and recovery point objectives (RPO)
- Implementing database backups with point-in-time recovery (PITR)
- Setting up Kubernetes cluster backup and restore workflows
- Configuring cross-region replication for high availability
- Testing disaster recovery procedures through chaos experiments
- Meeting compliance requirements (GDPR, SOC 2, HIPAA)
- Automating backup monitoring and alerting
- Designing multi-cloud disaster recovery architectures

## Core Concepts

### RTO and RPO Fundamentals

**Recovery Time Objective (RTO):** Maximum acceptable downtime after a disaster before business impact becomes unacceptable.

**Recovery Point Objective (RPO):** Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.

**Criticality Tiers:**
- **Tier 0 (Mission-Critical):** RTO < 1 hour, RPO < 5 minutes
- **Tier 1 (Production):** RTO 1-4 hours, RPO 15-60 minutes
- **Tier 2 (Important):** RTO 4-24 hours, RPO 1-6 hours
- **Tier 3 (Standard):** RTO > 24 hours, RPO > 6 hours

### 3-2-1 Backup Rule

Maintain **3 copies** of data on **2 different media** types with **1 copy offsite**.

Example implementation:
- Primary: Production database
- Secondary: Local backup storage
- Tertiary: Cloud backup (S3/GCS/Azure)

### Backup Types

**Full Backup:** Complete copy of all data. Slowest to create, fastest to restore.

**Incremental Backup:** Only changes since last backup. Fastest to create, requires full + all incrementals to restore.

**Differential Backup:** Changes since last full backup. Balance between storage and restore speed.

**Continuous Backup:** Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.

## Quick Decision Framework

### Step 1: Map RTO/RPO to Strategy

```
RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest

RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High

RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium

RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: Low
```

### Step 2: Select Backup Tools by Use Case

| Use Case | Primary Tool | Alternative | Key Feature |
|----------|-------------|-------------|-------------|
| PostgreSQL production | pgBackRest | WAL-G | PITR, compression, multi-repo |
| MySQL production | Percona XtraBackup | WAL-G | Hot backups, incremental |
| MongoDB | Atlas Backup | mongodump | Continuous backup, PITR |
| Kubernetes cluster | Velero | ArgoCD + Git | PV snapshots, scheduling |
| File/object backup | Restic | Duplicity | Encryption, deduplication |
| Cross-region replication | Aurora Global DB | RDS Read Replica | Active-Active capable |

## Database Backup Patterns

### PostgreSQL with pgBackRest

**Use Case:** Production PostgreSQL with < 5 minute RPO

**Quick Start:** See `examples/postgresql/pgbackrest-config/`

Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with `pgbackrest --stanza=main --delta restore`.

**Detailed Guide:** `references/database-backups.md#postgresql`

### MySQL with Percona XtraBackup

**Use Case:** MySQL production requiring hot backups

**Quick Start:** See `examples/mysql/xtrabackup/`

Perform full (`xtrabackup --backup --parallel=4`) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.

**Detailed Guide:** `references/database-backups.md#mysql`

### MongoDB Backup

**Quick Start:** Use `mongodump --gzip --numParallelCollections=4` for logical backups or MongoDB Atlas for continuous backup with PITR.

**Detailed Guide:** `references/database-backups.md#mongodb`

## Kubernetes Disaster Recovery

### Velero for Cluster Backups

**Quick Start:** `velero install --provider aws --bucket my-backups`

Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with `velero restore create --from-backup <name>`. Support selective restore (namespace mappings, storage class remapping).

**Examples:** `examples/kubernetes/velero/`
**Detailed Guide:** `references/kubernetes-dr.md`

### etcd Backup

**Quick Start:** `ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db`

Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.

**Examples:** `examples/kubernetes/etcd/`

## Cloud-Specific DR Patterns

### AWS

**Key Services:**
- RDS: Automated backups (30-day retention), PITR, Multi-AZ
- Aurora Global DB: Cross-region active-passive with automatic failover
- S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)

**Examples:** `examples/cloud/aws/`
**Detailed Guide:** `references/cloud-dr-patterns.md#aws`

### GCP

**Key Services:**
- Cloud SQL: PITR with 7-day transaction logs, 30-day retention
- GCS Multi-Regional: Automatic replication across 100+ mile separation
- Regional HA: Synchronous replication within region

**Detailed Guide:** `references/cloud-dr-patterns.md#gcp`

### Azure

**Key Services:**
- Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
- Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
- Geo-Redundant Storage: Automatic replication to secondary region

**Detailed Guide:** `references/cloud-dr-patterns.md#azure`

## Cross-Region Replication Patterns

| Pattern | RTO | RPO | Cost | Use Case |
|---------|-----|-----|------|----------|
| **Active-Active** | < 1 min | < 1 min | High | Both regions serve traffic |
| **Active-Passive** | 15-60 min | 5-15 min | Medium | Standby for failover |
| **Pilot Light** | 10-30 min | 5-15 min | Low | Minimal secondary infra |
| **Warm Standby** | 5-15 min | 5-15 min | Med-High | Scaled-down secondary |

**Implementation Examples:**
- PostgreSQL streaming replication (Active-Passive)
- Aurora Global Database (Active-Active)
- ASG scale-up automation (Pilot Light)

**Detailed Guide:** `references/cross-region-replication.md`

## Testing Disaster Recovery

### Chaos Engineering

**Purpose:** Validate DR procedures through controlled failure injection.

**Test Scenarios:**
- Database failover (stop primary, measure promotion time)
- Region failure (block network, trigger DNS failover)
- Kubernetes recovery (delete namespace, restore from Velero)

**Tools:** Chaos Mesh, Gremlin, Litmus, Toxiproxy

**Examples:** `examples/chaos/db-failover-test.sh`, `examples/chaos/region-failure-test.sh`
**Detailed Guide:** `references/chaos-engineering.md`

### Automated DR Drills

**Run Monthly Tests:**
```bash
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db
```

## Compliance and Retention

| Regulation | Retention | Requirements |
|------------|-----------|--------------|
| GDPR | 1-7 years | EU data residency, right to erasure |
| SOC 2 | 1 year+ | Secure deletion, access controls |
| HIPAA | 6 years | Encryption, PHI protection |
| PCI DSS | 3mo-1yr | Secure deletion, quarterly reviews |

**Implement with S3/GCS lifecycle policies:** 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive

**Immutable backups:** Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.

**Detailed Guide:** `references/compliance-retention.md`

## Monitoring and Alerting

**Key Metrics:** Backup success rate, duration, time since last backup, RPO breach, storage utilization

**Prometheus Alerts:** VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend

**Validation Scripts:**
```bash
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf
```

## Automation and Runbooks

**Automate Backup Schedules:** Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)

**DR Runbook Steps:** Detect failure → Verify secondary → Promote → Update DNS → Notify → Document

**Detailed Guide:** `references/runbook-automation.md`

## Integration with Other Skills

### Related Skills

**Prerequisites:**
- `infrastructure-as-code`: Provision backup infrastructure, DR regions
- `kubernetes-operations`: K8s cluster setup for Velero
- `secret-management`: Backup encryption keys, credentials

**Parallel Skills:**
- `databases-postgresql`: PostgreSQL configuration and operations
- `databases-mysql`: MySQL configuration and operations
- `observability`: Backup monitoring, alerting
- `security-hardening`: Secure backup storage, access control

**Consumer Skills:**
- `incident-management`: Invoke DR procedures during incidents
- `compliance-frameworks`: Meet regulatory requirements

### Skill Chaining Example

```
infrastructure-as-code → secret-management → disaster-recovery → observability
       ↓                        ↓                   ↓                ↓
  Create S3 buckets      Store encryption     Configure backups   Monitor jobs
  Provision databases    keys in Vault        Set up replication  Alert failures
  Setup VPCs             Manage credentials   Test DR drills      Track metrics
```

## Best Practices

### Do

✓ Test restores regularly (monthly for critical systems)
✓ Automate backup monitoring and alerting
✓ Encrypt backups at rest and in transit
✓ Implement 3-2-1 backup rule
✓ Define and measure RTO/RPO
✓ Run chaos experiments to validate DR
✓ Document recovery procedures
✓ Store backups in different regions
✓ Use immutable backups for ransomware protection
✓ Automate DR testing in CI/CD

### Don't

✗ Assume backups work without testing
✗ Store all backups in single region
✗ Skip retention policy definition
✗ Forget to encrypt sensitive data
✗ Rely solely on cloud provider backups
✗ Ignore backup monitoring
✗ Perform backups only from primary database under high load
✗ Store encryption keys with backups

## Reference Documentation

- **RTO/RPO Planning:** `references/rto-rpo-planning.md`
- **Database Backups:** `references/database-backups.md`
- **Kubernetes DR:** `references/kubernetes-dr.md`
- **Cloud DR Patterns:** `references/cloud-dr-patterns.md`
- **Cross-Region Replication:** `references/cross-region-replication.md`
- **Chaos Engineering:** `references/chaos-engineering.md`
- **Compliance Requirements:** `references/compliance-retention.md`
- **Runbook Automation:** `references/runbook-automation.md`

## Examples

- **Runbooks:** `examples/runbooks/database-failover.md`, `examples/runbooks/region-failover.md`
- **PostgreSQL:** `examples/postgresql/pgbackrest-config/`, `examples/postgresql/walg-config/`
- **MySQL:** `examples/mysql/xtrabackup/`, `examples/mysql/walg/`
- **Kubernetes:** `examples/kubernetes/velero/`, `examples/kubernetes/etcd/`
- **Cloud:** `examples/cloud/aws/`, `examples/cloud/gcp/`, `examples/cloud/azure/`
- **Chaos:** `examples/chaos/db-failover-test.sh`, `examples/chaos/region-failure-test.sh`

## Scripts

- `scripts/validate-backup.sh`: Verify backup integrity
- `scripts/test-restore.sh`: Automated restore testing
- `scripts/dr-drill.sh`: Run full DR drill
- `scripts/check-retention.sh`: Verify retention policies
- `scripts/generate-dr-report.sh`: Compliance reporting

Overview

This skill helps design and implement practical disaster recovery (DR) strategies across databases, Kubernetes, and cloud infrastructure. It guides teams to define RTO/RPO, choose backup and replication patterns, automate backups, and validate recovery through chaos engineering and runbooks. Outcomes include repeatable backup workflows, tested failover procedures, and compliance-ready retention policies.

How this skill works

The skill inspects system criticality and maps RTO/RPO requirements to an appropriate DR tier and toolset. It provides concrete patterns for database backups (PITR, full/incremental), cluster backup and restore (Velero, etcd snapshots), and cross-region replication (active-active, warm standby, pilot light). It also supplies testing templates for chaos experiments, automated drills, monitoring alerts, and runbooks to operationalize recovery.

When to use it

  • Defining RTO and RPO for services and data
  • Implementing PITR and automated database backups
  • Setting up Kubernetes backups and control-plane snapshots
  • Configuring cross-region replication or multi-region failover
  • Validating DR procedures with chaos engineering and automated drills
  • Meeting regulatory retention and immutable backup requirements

Best practices

  • Classify workloads by RTO/RPO and apply matching DR tier
  • Follow the 3-2-1 backup rule and encrypt backups in transit and at rest
  • Automate scheduled backups and integrity validation with monitoring/alerts
  • Run restore tests monthly for mission-critical systems and include DR drills in CI/CD
  • Use immutable storage or object locking to protect against ransomware
  • Document and maintain runbooks: detect → verify secondary → promote → update DNS → notify

Example use cases

  • PostgreSQL production with pgBackRest + WAL archiving for sub-5-minute RPO
  • Kubernetes cluster protection using Velero with PV snapshots and etcd snapshots for control-plane recovery
  • Cross-region replication: Aurora Global DB for active-active, or pilot-light pattern with ASG scale-up automation
  • Chaos test: simulate primary DB failure and measure promotion time using scripted failover tests
  • Compliance: implement S3 lifecycle, immutability, and retention rules to meet GDPR/SOC2/HIPAA

FAQ

How do I choose between active-active and warm-standby?

Map required RTO/RPO and budget: active-active suits sub-minute RTO/RPO and highest cost; warm-standby gives minutes-level recovery at moderate cost. Use active-passive or pilot-light for lower budgets and longer acceptable recovery.

How often should I test restores?

Test monthly for mission-critical systems, quarterly for important systems, and at least annually for standard services. Include automated validation in CI/CD for frequent, repeatable checks.