home / skills / ancoleman / ai-design-components / planning-disaster-recovery
This skill helps design and validate disaster recovery plans with RTO/RPO targets, cross-region replication, and chaotic testing to ensure resilience.
npx playbooks add skill ancoleman/ai-design-components --skill planning-disaster-recoveryReview the files below or copy the command above to add this skill to your agents.
---
name: planning-disaster-recovery
description: Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing. Use when implementing backup systems, configuring point-in-time recovery, setting up multi-region failover, or validating DR procedures.
---
# Disaster Recovery
## Purpose
Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.
## When to Use This Skill
Invoke this skill when:
- Defining recovery time objectives (RTO) and recovery point objectives (RPO)
- Implementing database backups with point-in-time recovery (PITR)
- Setting up Kubernetes cluster backup and restore workflows
- Configuring cross-region replication for high availability
- Testing disaster recovery procedures through chaos experiments
- Meeting compliance requirements (GDPR, SOC 2, HIPAA)
- Automating backup monitoring and alerting
- Designing multi-cloud disaster recovery architectures
## Core Concepts
### RTO and RPO Fundamentals
**Recovery Time Objective (RTO):** Maximum acceptable downtime after a disaster before business impact becomes unacceptable.
**Recovery Point Objective (RPO):** Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.
**Criticality Tiers:**
- **Tier 0 (Mission-Critical):** RTO < 1 hour, RPO < 5 minutes
- **Tier 1 (Production):** RTO 1-4 hours, RPO 15-60 minutes
- **Tier 2 (Important):** RTO 4-24 hours, RPO 1-6 hours
- **Tier 3 (Standard):** RTO > 24 hours, RPO > 6 hours
### 3-2-1 Backup Rule
Maintain **3 copies** of data on **2 different media** types with **1 copy offsite**.
Example implementation:
- Primary: Production database
- Secondary: Local backup storage
- Tertiary: Cloud backup (S3/GCS/Azure)
### Backup Types
**Full Backup:** Complete copy of all data. Slowest to create, fastest to restore.
**Incremental Backup:** Only changes since last backup. Fastest to create, requires full + all incrementals to restore.
**Differential Backup:** Changes since last full backup. Balance between storage and restore speed.
**Continuous Backup:** Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.
## Quick Decision Framework
### Step 1: Map RTO/RPO to Strategy
```
RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest
RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High
RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium
RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: Low
```
### Step 2: Select Backup Tools by Use Case
| Use Case | Primary Tool | Alternative | Key Feature |
|----------|-------------|-------------|-------------|
| PostgreSQL production | pgBackRest | WAL-G | PITR, compression, multi-repo |
| MySQL production | Percona XtraBackup | WAL-G | Hot backups, incremental |
| MongoDB | Atlas Backup | mongodump | Continuous backup, PITR |
| Kubernetes cluster | Velero | ArgoCD + Git | PV snapshots, scheduling |
| File/object backup | Restic | Duplicity | Encryption, deduplication |
| Cross-region replication | Aurora Global DB | RDS Read Replica | Active-Active capable |
## Database Backup Patterns
### PostgreSQL with pgBackRest
**Use Case:** Production PostgreSQL with < 5 minute RPO
**Quick Start:** See `examples/postgresql/pgbackrest-config/`
Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with `pgbackrest --stanza=main --delta restore`.
**Detailed Guide:** `references/database-backups.md#postgresql`
### MySQL with Percona XtraBackup
**Use Case:** MySQL production requiring hot backups
**Quick Start:** See `examples/mysql/xtrabackup/`
Perform full (`xtrabackup --backup --parallel=4`) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.
**Detailed Guide:** `references/database-backups.md#mysql`
### MongoDB Backup
**Quick Start:** Use `mongodump --gzip --numParallelCollections=4` for logical backups or MongoDB Atlas for continuous backup with PITR.
**Detailed Guide:** `references/database-backups.md#mongodb`
## Kubernetes Disaster Recovery
### Velero for Cluster Backups
**Quick Start:** `velero install --provider aws --bucket my-backups`
Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with `velero restore create --from-backup <name>`. Support selective restore (namespace mappings, storage class remapping).
**Examples:** `examples/kubernetes/velero/`
**Detailed Guide:** `references/kubernetes-dr.md`
### etcd Backup
**Quick Start:** `ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db`
Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.
**Examples:** `examples/kubernetes/etcd/`
## Cloud-Specific DR Patterns
### AWS
**Key Services:**
- RDS: Automated backups (30-day retention), PITR, Multi-AZ
- Aurora Global DB: Cross-region active-passive with automatic failover
- S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)
**Examples:** `examples/cloud/aws/`
**Detailed Guide:** `references/cloud-dr-patterns.md#aws`
### GCP
**Key Services:**
- Cloud SQL: PITR with 7-day transaction logs, 30-day retention
- GCS Multi-Regional: Automatic replication across 100+ mile separation
- Regional HA: Synchronous replication within region
**Detailed Guide:** `references/cloud-dr-patterns.md#gcp`
### Azure
**Key Services:**
- Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
- Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
- Geo-Redundant Storage: Automatic replication to secondary region
**Detailed Guide:** `references/cloud-dr-patterns.md#azure`
## Cross-Region Replication Patterns
| Pattern | RTO | RPO | Cost | Use Case |
|---------|-----|-----|------|----------|
| **Active-Active** | < 1 min | < 1 min | High | Both regions serve traffic |
| **Active-Passive** | 15-60 min | 5-15 min | Medium | Standby for failover |
| **Pilot Light** | 10-30 min | 5-15 min | Low | Minimal secondary infra |
| **Warm Standby** | 5-15 min | 5-15 min | Med-High | Scaled-down secondary |
**Implementation Examples:**
- PostgreSQL streaming replication (Active-Passive)
- Aurora Global Database (Active-Active)
- ASG scale-up automation (Pilot Light)
**Detailed Guide:** `references/cross-region-replication.md`
## Testing Disaster Recovery
### Chaos Engineering
**Purpose:** Validate DR procedures through controlled failure injection.
**Test Scenarios:**
- Database failover (stop primary, measure promotion time)
- Region failure (block network, trigger DNS failover)
- Kubernetes recovery (delete namespace, restore from Velero)
**Tools:** Chaos Mesh, Gremlin, Litmus, Toxiproxy
**Examples:** `examples/chaos/db-failover-test.sh`, `examples/chaos/region-failure-test.sh`
**Detailed Guide:** `references/chaos-engineering.md`
### Automated DR Drills
**Run Monthly Tests:**
```bash
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db
```
## Compliance and Retention
| Regulation | Retention | Requirements |
|------------|-----------|--------------|
| GDPR | 1-7 years | EU data residency, right to erasure |
| SOC 2 | 1 year+ | Secure deletion, access controls |
| HIPAA | 6 years | Encryption, PHI protection |
| PCI DSS | 3mo-1yr | Secure deletion, quarterly reviews |
**Implement with S3/GCS lifecycle policies:** 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive
**Immutable backups:** Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.
**Detailed Guide:** `references/compliance-retention.md`
## Monitoring and Alerting
**Key Metrics:** Backup success rate, duration, time since last backup, RPO breach, storage utilization
**Prometheus Alerts:** VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend
**Validation Scripts:**
```bash
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf
```
## Automation and Runbooks
**Automate Backup Schedules:** Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)
**DR Runbook Steps:** Detect failure → Verify secondary → Promote → Update DNS → Notify → Document
**Detailed Guide:** `references/runbook-automation.md`
## Integration with Other Skills
### Related Skills
**Prerequisites:**
- `infrastructure-as-code`: Provision backup infrastructure, DR regions
- `kubernetes-operations`: K8s cluster setup for Velero
- `secret-management`: Backup encryption keys, credentials
**Parallel Skills:**
- `databases-postgresql`: PostgreSQL configuration and operations
- `databases-mysql`: MySQL configuration and operations
- `observability`: Backup monitoring, alerting
- `security-hardening`: Secure backup storage, access control
**Consumer Skills:**
- `incident-management`: Invoke DR procedures during incidents
- `compliance-frameworks`: Meet regulatory requirements
### Skill Chaining Example
```
infrastructure-as-code → secret-management → disaster-recovery → observability
↓ ↓ ↓ ↓
Create S3 buckets Store encryption Configure backups Monitor jobs
Provision databases keys in Vault Set up replication Alert failures
Setup VPCs Manage credentials Test DR drills Track metrics
```
## Best Practices
### Do
✓ Test restores regularly (monthly for critical systems)
✓ Automate backup monitoring and alerting
✓ Encrypt backups at rest and in transit
✓ Implement 3-2-1 backup rule
✓ Define and measure RTO/RPO
✓ Run chaos experiments to validate DR
✓ Document recovery procedures
✓ Store backups in different regions
✓ Use immutable backups for ransomware protection
✓ Automate DR testing in CI/CD
### Don't
✗ Assume backups work without testing
✗ Store all backups in single region
✗ Skip retention policy definition
✗ Forget to encrypt sensitive data
✗ Rely solely on cloud provider backups
✗ Ignore backup monitoring
✗ Perform backups only from primary database under high load
✗ Store encryption keys with backups
## Reference Documentation
- **RTO/RPO Planning:** `references/rto-rpo-planning.md`
- **Database Backups:** `references/database-backups.md`
- **Kubernetes DR:** `references/kubernetes-dr.md`
- **Cloud DR Patterns:** `references/cloud-dr-patterns.md`
- **Cross-Region Replication:** `references/cross-region-replication.md`
- **Chaos Engineering:** `references/chaos-engineering.md`
- **Compliance Requirements:** `references/compliance-retention.md`
- **Runbook Automation:** `references/runbook-automation.md`
## Examples
- **Runbooks:** `examples/runbooks/database-failover.md`, `examples/runbooks/region-failover.md`
- **PostgreSQL:** `examples/postgresql/pgbackrest-config/`, `examples/postgresql/walg-config/`
- **MySQL:** `examples/mysql/xtrabackup/`, `examples/mysql/walg/`
- **Kubernetes:** `examples/kubernetes/velero/`, `examples/kubernetes/etcd/`
- **Cloud:** `examples/cloud/aws/`, `examples/cloud/gcp/`, `examples/cloud/azure/`
- **Chaos:** `examples/chaos/db-failover-test.sh`, `examples/chaos/region-failure-test.sh`
## Scripts
- `scripts/validate-backup.sh`: Verify backup integrity
- `scripts/test-restore.sh`: Automated restore testing
- `scripts/dr-drill.sh`: Run full DR drill
- `scripts/check-retention.sh`: Verify retention policies
- `scripts/generate-dr-report.sh`: Compliance reporting
This skill helps design and implement practical disaster recovery (DR) strategies across databases, Kubernetes, and cloud infrastructure. It guides teams to define RTO/RPO, choose backup and replication patterns, automate backups, and validate recovery through chaos engineering and runbooks. Outcomes include repeatable backup workflows, tested failover procedures, and compliance-ready retention policies.
The skill inspects system criticality and maps RTO/RPO requirements to an appropriate DR tier and toolset. It provides concrete patterns for database backups (PITR, full/incremental), cluster backup and restore (Velero, etcd snapshots), and cross-region replication (active-active, warm standby, pilot light). It also supplies testing templates for chaos experiments, automated drills, monitoring alerts, and runbooks to operationalize recovery.
How do I choose between active-active and warm-standby?
Map required RTO/RPO and budget: active-active suits sub-minute RTO/RPO and highest cost; warm-standby gives minutes-level recovery at moderate cost. Use active-passive or pilot-light for lower budgets and longer acceptable recovery.
How often should I test restores?
Test monthly for mission-critical systems, quarterly for important systems, and at least annually for standard services. Include automated validation in CI/CD for frequent, repeatable checks.