home / skills / dexploarer / hyper-forge / disaster-recovery-planner

disaster-recovery-planner skill

Q: How do I choose RTO and RPO for a service?

Choose based on business impact: use Critical for payment/auth, Important for order systems, Standard for analytics. Balance cost vs. acceptable downtime/data loss.

Q: How often should I test disaster recovery?

Run full DR tests at least quarterly and perform more frequent partial tests (e.g., backup restores) monthly; automate verification where possible.

safe

/.claude/skills/disaster-recovery-planner

This skill designs disaster recovery strategies with rto/rpo targets, multi-region failover, and backup automation to ensure business continuity.

npx playbooks add skill dexploarer/hyper-forge --skill disaster-recovery-planner

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.2 KB

---
name: disaster-recovery-planner
description: Design disaster recovery strategies including backup, failover, RTO/RPO planning, and multi-region deployment for business continuity.
allowed-tools: [Read, Write, Edit, Bash, Grep, Glob]
---

# Disaster Recovery Planner

Design comprehensive disaster recovery strategies for business continuity.

## RTO/RPO Targets

| Tier | RTO | RPO | Cost | Use Case |
|------|-----|-----|------|----------|
| Critical | < 1 hour | < 5 min | High | Payment, Auth |
| Important | < 4 hours | < 1 hour | Medium | Orders, Inventory |
| Standard | < 24 hours | < 24 hours | Low | Reports, Analytics |

## Multi-Region Failover

```yaml
# AWS Route53 Health Checks and Failover
Resources:
  PrimaryHealthCheck:
    Type: AWS::Route53::HealthCheck
    Properties:
      HealthCheckConfig:
        Type: HTTPS
        ResourcePath: /health
        FullyQualifiedDomainName: api-us-east-1.example.com
        Port: 443
        RequestInterval: 30
        FailureThreshold: 3

  DNSFailover:
    Type: AWS::Route53::RecordSet
    Properties:
      HostedZoneId: Z123456
      Name: api.example.com
      Type: A
      SetIdentifier: Primary
      Failover: PRIMARY
      AliasTarget:
        HostedZoneId: Z123456
        DNSName: api-us-east-1.example.com
      HealthCheckId: !Ref PrimaryHealthCheck
```

## Database Backup Strategy

```bash
# Automated backup script
#!/bin/bash
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DB_NAME="production_db"
S3_BUCKET="s3://backups-${DB_NAME}"
RETENTION_DAYS=30

# Full backup daily
pg_dump -Fc $DB_NAME | \
  aws s3 cp - "${S3_BUCKET}/full/${TIMESTAMP}.dump"

# Point-in-time recovery (WAL archiving)
aws s3 sync /var/lib/postgresql/wal_archive \
  "${S3_BUCKET}/wal/" --delete

# Cleanup old backups
aws s3 ls "${S3_BUCKET}/full/" | \
  while read -r line; do
    createDate=$(echo $line | awk '{print $1" "$2}')
    if [[ $(date -d "$createDate" +%s) -lt $(date -d "-${RETENTION_DAYS} days" +%s) ]]; then
      fileName=$(echo $line | awk '{print $4}')
      aws s3 rm "${S3_BUCKET}/full/${fileName}"
    fi
  done
```

## Best Practices
- ✅ Test DR procedures quarterly
- ✅ Automate backup verification
- ✅ Document runbooks thoroughly
- ✅ Multi-region for critical systems
- ✅ Monitor backup success/failure

Overview

This skill designs disaster recovery strategies to keep your platform and services resilient. It covers RTO/RPO tiering, automated backups, WAL-based point-in-time recovery, and multi-region failover patterns. The guidance is practical and focused on minimizing downtime and data loss for production systems.

How this skill works

The skill assesses service criticality and maps each component to RTO/RPO targets. It prescribes concrete backup scripts, retention rules, and health-check backed DNS failover configurations. It also produces runbooks for failover, recovery verification steps, and test schedules to validate readiness.

When to use it

Planning business continuity for production services
Defining backup and retention policies for databases
Designing multi-region or cross-AZ failover for critical APIs
Developing runbooks and automation for recovery drills
Setting RTO/RPO targets for new or changing systems

Best practices

Classify systems by criticality and assign RTO/RPO tiers (Critical/Important/Standard)
Automate full and incremental backups, and archive WAL for point-in-time recovery
Use health-checked DNS failover or load-balancer-based cross-region failover for services
Test DR procedures quarterly and verify backups automatically after creation
Document step-by-step runbooks and maintain observable alerts for backup/failover failures

Example use cases

Designing a DR plan for payment and authentication services with <1 hour RTO and <5 minute RPO
Implementing daily full backups plus WAL archiving to S3 for PostgreSQL with 30-day retention
Configuring Route53 health checks and primary/secondary DNS failover for an API endpoint
Creating automated backup verification jobs and alerting on failures
Drafting runbooks and scheduled quarterly failover tests for a multi-region deployment

FAQ

How do I choose RTO and RPO for a service?

Choose based on business impact: use Critical for payment/auth, Important for order systems, Standard for analytics. Balance cost vs. acceptable downtime/data loss.

How often should I test disaster recovery?

Run full DR tests at least quarterly and perform more frequent partial tests (e.g., backup restores) monthly; automate verification where possible.