home / skills / ancoleman / ai-design-components / planning-disaster-recovery

planning-disaster-recovery skill

unsafe

This skill helps design and validate disaster recovery plans with RTO/RPO targets, cross-region replication, and chaotic testing to ensure resilience.

npx playbooks add skill ancoleman/ai-design-components --skill planning-disaster-recovery

Review the files below or copy the command above to add this skill to your agents.

Files (20)

SKILL.md

12.1 KB

---
name: planning-disaster-recovery
description: Design and implement disaster recovery strategies with RTO/RPO planning, database backups, Kubernetes DR, cross-region replication, and chaos engineering testing. Use when implementing backup systems, configuring point-in-time recovery, setting up multi-region failover, or validating DR procedures.
---

# Disaster Recovery

## Purpose

Provide comprehensive guidance for designing disaster recovery (DR) strategies, implementing backup systems, and validating recovery procedures across databases, Kubernetes clusters, and cloud infrastructure. Enable teams to define RTO/RPO objectives, select appropriate backup tools, configure automated failover, and test DR capabilities through chaos engineering.

## When to Use This Skill

Invoke this skill when:
- Defining recovery time objectives (RTO) and recovery point objectives (RPO)
- Implementing database backups with point-in-time recovery (PITR)
- Setting up Kubernetes cluster backup and restore workflows
- Configuring cross-region replication for high availability
- Testing disaster recovery procedures through chaos experiments
- Meeting compliance requirements (GDPR, SOC 2, HIPAA)
- Automating backup monitoring and alerting
- Designing multi-cloud disaster recovery architectures

## Core Concepts

### RTO and RPO Fundamentals

**Recovery Time Objective (RTO):** Maximum acceptable downtime after a disaster before business impact becomes unacceptable.

**Recovery Point Objective (RPO):** Maximum acceptable data loss measured in time. Defines how far back in time recovery must reach.

**Criticality Tiers:**
- **Tier 0 (Mission-Critical):** RTO < 1 hour, RPO < 5 minutes
- **Tier 1 (Production):** RTO 1-4 hours, RPO 15-60 minutes
- **Tier 2 (Important):** RTO 4-24 hours, RPO 1-6 hours
- **Tier 3 (Standard):** RTO > 24 hours, RPO > 6 hours

### 3-2-1 Backup Rule

Maintain **3 copies** of data on **2 different media** types with **1 copy offsite**.

Example implementation:
- Primary: Production database
- Secondary: Local backup storage
- Tertiary: Cloud backup (S3/GCS/Azure)

### Backup Types

**Full Backup:** Complete copy of all data. Slowest to create, fastest to restore.

**Incremental Backup:** Only changes since last backup. Fastest to create, requires full + all incrementals to restore.

**Differential Backup:** Changes since last full backup. Balance between storage and restore speed.

**Continuous Backup:** Real-time or near-real-time backup via WAL/binlog archiving. Lowest RPO.

## Quick Decision Framework

### Step 1: Map RTO/RPO to Strategy

```
RTO < 1 hour, RPO < 5 min
→ Active-Active replication, continuous archiving, automated failover
→ Tools: Aurora Global DB, GCS Multi-Region, pgBackRest PITR
→ Cost: Highest

RTO 1-4 hours, RPO 15-60 min
→ Warm standby, incremental backups, automated failover
→ Tools: pgBackRest, WAL-G, RDS Multi-AZ
→ Cost: High

RTO 4-24 hours, RPO 1-6 hours
→ Daily full + incremental, cross-region backup
→ Tools: pgBackRest, Velero, Restic
→ Cost: Medium

RTO > 24 hours, RPO > 6 hours
→ Weekly full + daily incremental, single region
→ Tools: pg_dump, mysqldump, S3 versioning
→ Cost: Low
```

### Step 2: Select Backup Tools by Use Case

| Use Case | Primary Tool | Alternative | Key Feature |
|----------|-------------|-------------|-------------|
| PostgreSQL production | pgBackRest | WAL-G | PITR, compression, multi-repo |
| MySQL production | Percona XtraBackup | WAL-G | Hot backups, incremental |
| MongoDB | Atlas Backup | mongodump | Continuous backup, PITR |
| Kubernetes cluster | Velero | ArgoCD + Git | PV snapshots, scheduling |
| File/object backup | Restic | Duplicity | Encryption, deduplication |
| Cross-region replication | Aurora Global DB | RDS Read Replica | Active-Active capable |

## Database Backup Patterns

### PostgreSQL with pgBackRest

**Use Case:** Production PostgreSQL with < 5 minute RPO

**Quick Start:** See `examples/postgresql/pgbackrest-config/`

Configure continuous WAL archiving with full/differential/incremental backups to S3/GCS/Azure. Schedule weekly full, daily differential backups. Enable PITR with `pgbackrest --stanza=main --delta restore`.

**Detailed Guide:** `references/database-backups.md#postgresql`

### MySQL with Percona XtraBackup

**Use Case:** MySQL production requiring hot backups

**Quick Start:** See `examples/mysql/xtrabackup/`

Perform full (`xtrabackup --backup --parallel=4`) and incremental backups with binary log archiving for PITR. Restore requires decompress, prepare, apply incrementals, and copy-back steps.

**Detailed Guide:** `references/database-backups.md#mysql`

### MongoDB Backup

**Quick Start:** Use `mongodump --gzip --numParallelCollections=4` for logical backups or MongoDB Atlas for continuous backup with PITR.

**Detailed Guide:** `references/database-backups.md#mongodb`

## Kubernetes Disaster Recovery

### Velero for Cluster Backups

**Quick Start:** `velero install --provider aws --bucket my-backups`

Configure scheduled backups (daily full, hourly production namespace) with PV snapshots. Restore with `velero restore create --from-backup <name>`. Support selective restore (namespace mappings, storage class remapping).

**Examples:** `examples/kubernetes/velero/`
**Detailed Guide:** `references/kubernetes-dr.md`

### etcd Backup

**Quick Start:** `ETCDCTL_API=3 etcdctl snapshot save /backups/etcd/snapshot.db`

Create periodic etcd snapshots for control plane recovery. Restore requires cluster recreation with snapshot data.

**Examples:** `examples/kubernetes/etcd/`

## Cloud-Specific DR Patterns

### AWS

**Key Services:**
- RDS: Automated backups (30-day retention), PITR, Multi-AZ
- Aurora Global DB: Cross-region active-passive with automatic failover
- S3 CRR: Cross-region replication with 15-min SLA (Replication Time Control)

**Examples:** `examples/cloud/aws/`
**Detailed Guide:** `references/cloud-dr-patterns.md#aws`

### GCP

**Key Services:**
- Cloud SQL: PITR with 7-day transaction logs, 30-day retention
- GCS Multi-Regional: Automatic replication across 100+ mile separation
- Regional HA: Synchronous replication within region

**Detailed Guide:** `references/cloud-dr-patterns.md#gcp`

### Azure

**Key Services:**
- Azure Backup: VM backups with flexible retention (daily/weekly/monthly/yearly)
- Azure Site Recovery: Cross-region VM replication with 4-hour app-consistent snapshots
- Geo-Redundant Storage: Automatic replication to secondary region

**Detailed Guide:** `references/cloud-dr-patterns.md#azure`

## Cross-Region Replication Patterns

| Pattern | RTO | RPO | Cost | Use Case |
|---------|-----|-----|------|----------|
| **Active-Active** | < 1 min | < 1 min | High | Both regions serve traffic |
| **Active-Passive** | 15-60 min | 5-15 min | Medium | Standby for failover |
| **Pilot Light** | 10-30 min | 5-15 min | Low | Minimal secondary infra |
| **Warm Standby** | 5-15 min | 5-15 min | Med-High | Scaled-down secondary |

**Implementation Examples:**
- PostgreSQL streaming replication (Active-Passive)
- Aurora Global Database (Active-Active)
- ASG scale-up automation (Pilot Light)

**Detailed Guide:** `references/cross-region-replication.md`

## Testing Disaster Recovery

### Chaos Engineering

**Purpose:** Validate DR procedures through controlled failure injection.

**Test Scenarios:**
- Database failover (stop primary, measure promotion time)
- Region failure (block network, trigger DNS failover)
- Kubernetes recovery (delete namespace, restore from Velero)

**Tools:** Chaos Mesh, Gremlin, Litmus, Toxiproxy

**Examples:** `examples/chaos/db-failover-test.sh`, `examples/chaos/region-failure-test.sh`
**Detailed Guide:** `references/chaos-engineering.md`

### Automated DR Drills

**Run Monthly Tests:**
```bash
./scripts/dr-drill.sh --environment staging --test-type full
./scripts/test-restore.sh --backup latest --target staging-db
```

## Compliance and Retention

| Regulation | Retention | Requirements |
|------------|-----------|--------------|
| GDPR | 1-7 years | EU data residency, right to erasure |
| SOC 2 | 1 year+ | Secure deletion, access controls |
| HIPAA | 6 years | Encryption, PHI protection |
| PCI DSS | 3mo-1yr | Secure deletion, quarterly reviews |

**Implement with S3/GCS lifecycle policies:** 30d→Standard-IA, 90d→Glacier, 365d→Deep Archive

**Immutable backups:** Use S3 Object Lock or Azure Immutable Blob Storage for ransomware protection.

**Detailed Guide:** `references/compliance-retention.md`

## Monitoring and Alerting

**Key Metrics:** Backup success rate, duration, time since last backup, RPO breach, storage utilization

**Prometheus Alerts:** VeleroBackupFailed, VeleroBackupTooOld, BackupSizeTrend

**Validation Scripts:**
```bash
./scripts/validate-backup.sh --backup latest --verify-integrity
./scripts/check-retention.sh --report-violations
./scripts/generate-dr-report.sh --format pdf
```

## Automation and Runbooks

**Automate Backup Schedules:** Cron for pgBackRest (weekly full, daily differential), Velero schedules (K8s)

**DR Runbook Steps:** Detect failure → Verify secondary → Promote → Update DNS → Notify → Document

**Detailed Guide:** `references/runbook-automation.md`

## Integration with Other Skills

### Related Skills

**Prerequisites:**
- `infrastructure-as-code`: Provision backup infrastructure, DR regions
- `kubernetes-operations`: K8s cluster setup for Velero
- `secret-management`: Backup encryption keys, credentials

**Parallel Skills:**
- `databases-postgresql`: PostgreSQL configuration and operations
- `databases-mysql`: MySQL configuration and operations
- `observability`: Backup monitoring, alerting
- `security-hardening`: Secure backup storage, access control

**Consumer Skills:**
- `incident-management`: Invoke DR procedures during incidents
- `compliance-frameworks`: Meet regulatory requirements

### Skill Chaining Example

```
infrastructure-as-code → secret-management → disaster-recovery → observability
       ↓                        ↓                   ↓                ↓
  Create S3 buckets      Store encryption     Configure backups   Monitor jobs
  Provision databases    keys in Vault        Set up replication  Alert failures
  Setup VPCs             Manage credentials   Test DR drills      Track metrics
```

## Best Practices

### Do

✓ Test restores regularly (monthly for critical systems)
✓ Automate backup monitoring and alerting
✓ Encrypt backups at rest and in transit
✓ Implement 3-2-1 backup rule
✓ Define and measure RTO/RPO
✓ Run chaos experiments to validate DR
✓ Document recovery procedures
✓ Store backups in different regions
✓ Use immutable backups for ransomware protection
✓ Automate DR testing in CI/CD

### Don't

✗ Assume backups work without testing
✗ Store all backups in single region
✗ Skip retention policy definition
✗ Forget to encrypt sensitive data
✗ Rely solely on cloud provider backups
✗ Ignore backup monitoring
✗ Perform backups only from primary database under high load
✗ Store encryption keys with backups

## Reference Documentation

- **RTO/RPO Planning:** `references/rto-rpo-planning.md`
- **Database Backups:** `references/database-backups.md`
- **Kubernetes DR:** `references/kubernetes-dr.md`
- **Cloud DR Patterns:** `references/cloud-dr-patterns.md`
- **Cross-Region Replication:** `references/cross-region-replication.md`
- **Chaos Engineering:** `references/chaos-engineering.md`
- **Compliance Requirements:** `references/compliance-retention.md`
- **Runbook Automation:** `references/runbook-automation.md`

## Examples

- **Runbooks:** `examples/runbooks/database-failover.md`, `examples/runbooks/region-failover.md`
- **PostgreSQL:** `examples/postgresql/pgbackrest-config/`, `examples/postgresql/walg-config/`
- **MySQL:** `examples/mysql/xtrabackup/`, `examples/mysql/walg/`
- **Kubernetes:** `examples/kubernetes/velero/`, `examples/kubernetes/etcd/`
- **Cloud:** `examples/cloud/aws/`, `examples/cloud/gcp/`, `examples/cloud/azure/`
- **Chaos:** `examples/chaos/db-failover-test.sh`, `examples/chaos/region-failure-test.sh`

## Scripts

- `scripts/validate-backup.sh`: Verify backup integrity
- `scripts/test-restore.sh`: Automated restore testing
- `scripts/dr-drill.sh`: Run full DR drill
- `scripts/check-retention.sh`: Verify retention policies
- `scripts/generate-dr-report.sh`: Compliance reporting

Overview

This skill helps design and implement practical disaster recovery (DR) strategies across databases, Kubernetes, and cloud infrastructure. It guides teams to define RTO/RPO, choose backup and replication patterns, automate backups, and validate recovery through chaos engineering and runbooks. Outcomes include repeatable backup workflows, tested failover procedures, and compliance-ready retention policies.

How this skill works

The skill inspects system criticality and maps RTO/RPO requirements to an appropriate DR tier and toolset. It provides concrete patterns for database backups (PITR, full/incremental), cluster backup and restore (Velero, etcd snapshots), and cross-region replication (active-active, warm standby, pilot light). It also supplies testing templates for chaos experiments, automated drills, monitoring alerts, and runbooks to operationalize recovery.

When to use it

Defining RTO and RPO for services and data
Implementing PITR and automated database backups
Setting up Kubernetes backups and control-plane snapshots
Configuring cross-region replication or multi-region failover
Validating DR procedures with chaos engineering and automated drills
Meeting regulatory retention and immutable backup requirements

Best practices

Classify workloads by RTO/RPO and apply matching DR tier
Follow the 3-2-1 backup rule and encrypt backups in transit and at rest
Automate scheduled backups and integrity validation with monitoring/alerts
Run restore tests monthly for mission-critical systems and include DR drills in CI/CD
Use immutable storage or object locking to protect against ransomware
Document and maintain runbooks: detect → verify secondary → promote → update DNS → notify

Example use cases

PostgreSQL production with pgBackRest + WAL archiving for sub-5-minute RPO
Kubernetes cluster protection using Velero with PV snapshots and etcd snapshots for control-plane recovery
Cross-region replication: Aurora Global DB for active-active, or pilot-light pattern with ASG scale-up automation
Chaos test: simulate primary DB failure and measure promotion time using scripted failover tests
Compliance: implement S3 lifecycle, immutability, and retention rules to meet GDPR/SOC2/HIPAA

FAQ

How do I choose between active-active and warm-standby?

Map required RTO/RPO and budget: active-active suits sub-minute RTO/RPO and highest cost; warm-standby gives minutes-level recovery at moderate cost. Use active-passive or pilot-light for lower budgets and longer acceptable recovery.

How often should I test restores?

Test monthly for mission-critical systems, quarterly for important systems, and at least annually for standard services. Include automated validation in CI/CD for frequent, repeatable checks.