home / skills / lerianstudio / ring / ops-disaster-recovery

ops-disaster-recovery skill

Q: How do I pick a DR strategy for a service?

Start from the service tier and target RTO/RPO. Map those targets to strategies in the comparison table, then validate budget and operational complexity before selecting Pilot Light, Warm Standby, Hot Standby, or Multi-Active.

Q: What test cadence should I follow?

Use a mix: tabletop quarterly, component monthly, partial quarterly, and at least one full production failover annually. Adjust frequency for higher-tier services or regulatory requirements.

safe

/.archive/ops-team/skills/ops-disaster-recovery

This skill helps you design and validate disaster recovery plans by guiding phase-based planning, architecture, runbooks, and testing.

npx playbooks add skill lerianstudio/ring --skill ops-disaster-recovery

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

12.1 KB

---
name: ops-disaster-recovery
description: |
  Structured workflow for disaster recovery planning, implementation, and testing
  including RTO/RPO definition, DR strategy selection, and failover procedures.

trigger: |
  - DR strategy development
  - DR plan review/update
  - DR testing/drills
  - Post-incident DR improvement

skip_when: |
  - Day-to-day backup operations -> standard procedures
  - Application-level redundancy -> use ring-dev-team specialists
  - Single-instance failure recovery -> standard runbooks

related:
  similar: [ops-capacity-planning]
  uses: [infrastructure-architect]
---

# Disaster Recovery Workflow

This skill defines the structured process for disaster recovery planning and testing. Use it for comprehensive DR strategy development and validation.

---

## DR Planning Phases

| Phase | Focus | Output |
|-------|-------|--------|
| **1. Business Impact** | Define criticality and requirements | BIA document |
| **2. Strategy Selection** | Choose appropriate DR strategy | DR strategy |
| **3. Architecture Design** | Design DR infrastructure | DR architecture |
| **4. Runbook Development** | Document failover procedures | DR runbooks |
| **5. Testing** | Validate DR capabilities | Test report |
| **6. Maintenance** | Keep DR current | Update schedule |

---

## Phase 1: Business Impact Analysis

### Service Classification

Classify services by business criticality:

| Tier | Definition | RTO | RPO | Example Services |
|------|------------|-----|-----|------------------|
| **Tier 1** | Critical - business cannot operate | <15 min | <1 min | Payment processing |
| **Tier 2** | Important - significant impact | <1 hour | <15 min | Customer portal |
| **Tier 3** | Standard - moderate impact | <4 hours | <1 hour | Internal tools |
| **Tier 4** | Low - minimal impact | <24 hours | <24 hours | Dev environments |

### BIA Template

```markdown
## Business Impact Analysis

**Assessment Date:** YYYY-MM-DD
**Assessed By:** [name]

### Service Classification

| Service | Business Function | Revenue Impact | Tier | RTO | RPO |
|---------|------------------|----------------|------|-----|-----|
| payment-api | Process transactions | $X,XXX/hour | 1 | 15 min | 1 min |
| customer-portal | Customer access | $XXX/hour | 2 | 1 hour | 15 min |
| admin-tools | Internal operations | $0/hour | 3 | 4 hours | 1 hour |

### Data Classification

| Data Type | Classification | Backup Frequency | Retention |
|-----------|---------------|------------------|-----------|
| Transaction data | Critical | Continuous | 7 years |
| Customer data | Important | Hourly | 3 years |
| Application logs | Standard | Daily | 90 days |

### Dependencies

| Service | Dependencies | DR Impact |
|---------|--------------|-----------|
| payment-api | Database, payment-gateway | All must fail over together |
| customer-portal | Database, auth-service | Sequential failover possible |
```

---

## Phase 2: Strategy Selection

### DR Strategy Comparison

| Strategy | RTO | RPO | Cost | Complexity | Best For |
|----------|-----|-----|------|------------|----------|
| **Backup & Restore** | Hours | Hours | $ | Low | Tier 4 services |
| **Pilot Light** | 30-60 min | Minutes | $$ | Medium | Tier 3 services |
| **Warm Standby** | 10-30 min | Seconds-Minutes | $$$ | Medium-High | Tier 2 services |
| **Hot Standby** | <10 min | Seconds | $$$$ | High | Tier 1 services |
| **Multi-Active** | Near-zero | Near-zero | $$$$$ | Very High | Ultra-critical |

### Strategy Selection Matrix

```markdown
## DR Strategy Selection

### Requirements Summary

| Requirement | Value |
|-------------|-------|
| Target RTO | [X minutes/hours] |
| Target RPO | [X minutes/hours] |
| Budget | $[X,XXX]/month for DR |
| Compliance | [frameworks] |

### Strategy Decision

**Selected Strategy:** [Pilot Light / Warm Standby / Hot Standby]

**Rationale:**
1. RTO requirement of [X] achieved by [strategy]
2. RPO requirement of [X] achieved with [replication method]
3. Budget of $[X]/month supports [strategy] (~XX% of production cost)
4. Compliance requirement for [X] met with [features]

### Trade-offs Accepted

| Trade-off | Impact | Mitigation |
|-----------|--------|------------|
| Higher DR cost | +$X/month | Justified by RTO requirement |
| Manual failover steps | 5-10 min added | Automation planned Q2 |
```

---

## Phase 3: Architecture Design

### DR Architecture Components

| Component | Primary | DR | Replication |
|-----------|---------|----|----|
| DNS | Route53 | Route53 | Global service |
| Load Balancer | ALB (us-east-1) | ALB (us-west-2) | Configuration sync |
| Compute | EKS (us-east-1) | EKS (us-west-2) | GitOps deployment |
| Database | Aurora (us-east-1) | Aurora Global (us-west-2) | Async replication |
| Storage | S3 (us-east-1) | S3 (us-west-2) | Cross-region replication |
| Secrets | Secrets Manager | Secrets Manager | Manual sync |

### Architecture Diagram Template

```
Primary Region (us-east-1)          DR Region (us-west-2)
┌─────────────────────────┐         ┌─────────────────────────┐
│                         │         │                         │
│  ┌─────────────────┐    │         │  ┌─────────────────┐    │
│  │     ALB         │    │         │  │     ALB         │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │ (standby)   │
│  ┌────────┴────────┐    │         │  ┌────────┴────────┐    │
│  │  EKS Cluster    │    │         │  │  EKS Cluster    │    │
│  │  (Active)       │    │         │  │  (Standby)      │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │             │
│  ┌────────┴────────┐    │  async  │  ┌────────┴────────┐    │
│  │  Aurora         │────┼────────►│  │  Aurora         │    │
│  │  (Primary)      │    │         │  │  (Replica)      │    │
│  └─────────────────┘    │         │  └─────────────────┘    │
│                         │         │                         │
└─────────────────────────┘         └─────────────────────────┘
              │                               │
              └───────────┬───────────────────┘
                          │
                   ┌──────┴──────┐
                   │   Route53   │
                   │   (Global)  │
                   └─────────────┘
```

---

## Phase 4: Runbook Development

### Failover Runbook Structure

```markdown
## Failover Runbook: [Service Name]

**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** [team]

### Pre-Conditions

- [ ] DR region healthy (check dashboard)
- [ ] Replication lag <[X seconds/minutes]
- [ ] On-call personnel available
- [ ] Communication channels ready

### Failover Decision Criteria

| Criteria | Automatic | Manual |
|----------|-----------|--------|
| Primary region unavailable >5 min | Yes | - |
| Replication lag >15 min | - | Yes |
| Data corruption detected | - | Yes |
| Planned maintenance | - | Yes |

### Failover Steps

1. **Verify DR Readiness** (2 min)
   ```bash
   # Check DR database status
   aws rds describe-db-clusters --region us-west-2

   # Check EKS cluster status
   kubectl --context=dr get nodes
   ```

2. **Stop Writes to Primary** (1 min)
   ```bash
   # Scale down primary services
   kubectl --context=primary scale deployment/api --replicas=0
   ```

3. **Promote DR Database** (5 min)
   ```bash
   # Promote Aurora replica
   aws rds failover-global-cluster \
     --global-cluster-identifier my-global-cluster \
     --target-db-cluster-identifier dr-cluster
   ```

4. **Activate DR Services** (2 min)
   ```bash
   # Scale up DR services
   kubectl --context=dr scale deployment/api --replicas=10
   ```

5. **Update DNS** (1-5 min propagation)
   ```bash
   # Update Route53 health check
   aws route53 update-health-check \
     --health-check-id xxx \
     --disabled
   ```

6. **Verify Service** (5 min)
   ```bash
   # Health check
   curl https://api.example.com/health

   # Synthetic transaction
   ./scripts/synthetic-test.sh
   ```

### Rollback Steps

[If failover causes issues, steps to return to primary]

### Communication Template

**Internal:**
> DR failover initiated for [service] at [time UTC].
> Estimated completion: [X minutes].
> IC: [name]

**External (if customer-facing):**
> We are currently experiencing issues with [service].
> Our team is working to restore service.
> Status page: [url]
```

---

## Phase 5: Testing

### DR Test Types

| Test Type | Frequency | Scope | Impact |
|-----------|-----------|-------|--------|
| **Tabletop** | Quarterly | Full scenario walkthrough | None |
| **Component** | Monthly | Individual component failover | Minimal |
| **Partial** | Quarterly | Non-production failover | Low |
| **Full** | Annually | Production failover | Moderate |

### DR Test Template

```markdown
## DR Test Report

**Test Date:** YYYY-MM-DD
**Test Type:** [Tabletop/Component/Partial/Full]
**Scope:** [services tested]

### Test Objectives

1. Validate RTO of <[X minutes]
2. Validate RPO of <[X minutes]
3. Verify runbook accuracy
4. Identify gaps in DR readiness

### Test Results

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | 15 min | 12 min | PASS |
| RPO | 1 min | 45 sec | PASS |
| Data integrity | 100% | 100% | PASS |
| Runbook accuracy | 100% | 85% | PARTIAL |

### Timeline

| Time | Action | Status |
|------|--------|--------|
| 10:00 | Test initiated | OK |
| 10:02 | Primary shutdown simulated | OK |
| 10:08 | DR database promoted | OK |
| 10:12 | DR services activated | OK |
| 10:15 | Service verified | OK |

### Issues Found

| Issue | Severity | Action Required |
|-------|----------|-----------------|
| Step 4 command incorrect | Medium | Update runbook |
| DNS propagation slower | Low | Reduce TTL |

### Lessons Learned

1. [Lesson 1]
2. [Lesson 2]

### Action Items

| Item | Owner | Due Date |
|------|-------|----------|
| Update runbook step 4 | @ops | YYYY-MM-DD |
| Reduce DNS TTL | @platform | YYYY-MM-DD |
```

---

## Phase 6: Maintenance

### DR Maintenance Schedule

| Activity | Frequency | Owner |
|----------|-----------|-------|
| Runbook review | Quarterly | Platform team |
| DR test | Per test schedule | SRE team |
| Replication monitoring | Daily (automated) | Monitoring |
| Cost review | Monthly | FinOps |
| Architecture review | Annually | Architecture team |

---

## Anti-Rationalization Table

| Rationalization | Why It's WRONG | Required Action |
|-----------------|----------------|-----------------|
| "DR can be added later" | DR added later is rarely tested | **DR is day-1 requirement** |
| "Backups are good enough" | Backups != DR. RTO is hours vs minutes. | **Design proper DR strategy** |
| "Too expensive for DR" | DR cost << outage cost | **Calculate business impact** |
| "We'll figure it out during incident" | Panic != good decisions | **Document runbooks NOW** |
| "Tested last year, still good" | Systems change constantly | **Test regularly** |

---

## Dispatch Specialist

For DR planning tasks, dispatch:

```
Task tool:
  subagent_type: "ring:infrastructure-architect"
  model: "opus"
  prompt: |
    DR PLANNING REQUEST
    Services: [services requiring DR]
    RTO Requirement: [target]
    RPO Requirement: [target]
    Current State: [existing DR if any]
    REQUEST: [design/review/test planning]
```

Overview

This skill provides a structured workflow for disaster recovery (DR) planning, implementation, and testing. It guides teams through Business Impact Analysis, strategy selection, architecture design, runbook creation, testing, and ongoing maintenance. The workflow enforces measurable RTO/RPO goals and repeatable failover procedures to reduce outage risk and recovery time.

How this skill works

The skill inspects service criticality, captures RTO/RPO requirements, and maps them to appropriate DR strategies (Backup & Restore, Pilot Light, Warm Standby, Hot Standby, Multi-Active). It produces artifacts: BIA templates, strategy decisions, DR architecture diagrams, failover runbooks, test plans, and maintenance schedules. It also defines test types, metrics to validate RTO/RPO, and a cadence for reviews and updates.

When to use it

Design DR for new or existing services with uptime requirements
Translate business RTO/RPO requirements into technical strategy
Create runnable failover runbooks for on-call teams
Validate DR capability through scheduled tests and post-test reports
Maintain DR readiness across architecture and cost reviews

Best practices

Classify services into tiers (Tier 1–4) and assign specific RTO/RPO targets
Choose a strategy that balances RTO/RPO, cost, and operational complexity
Document precise pre-conditions, decision criteria, and step-by-step failover commands
Runtable and component tests frequently; run full production failover tests annually
Automate replication, monitoring, and synthetic checks; keep runbooks and diagrams versioned

Example use cases

Create a BIA for payment and customer-facing services to set Tier 1/2 requirements
Select Warm Standby for a customer portal to meet <1 hour RTO with acceptable cost
Build runbooks to promote read replicas and scale DR EKS clusters during failover
Run quarterly tabletop exercises and document gap-driven action items
Schedule daily replication monitoring and monthly cost reviews to prevent drift

FAQ

How do I pick a DR strategy for a service?

Start from the service tier and target RTO/RPO. Map those targets to strategies in the comparison table, then validate budget and operational complexity before selecting Pilot Light, Warm Standby, Hot Standby, or Multi-Active.

What test cadence should I follow?

Use a mix: tabletop quarterly, component monthly, partial quarterly, and at least one full production failover annually. Adjust frequency for higher-tier services or regulatory requirements.