home / skills / proffesor-for-testing / agentic-qe / chaos-engineering-resilience

chaos-engineering-resilience skill

safe

/v3/assets/skills/chaos-engineering-resilience

This skill helps you validate distributed systems resilience by orchestrating controlled failures and automatic rollbacks with measurable steady-state metrics.

npx playbooks add skill proffesor-for-testing/agentic-qe --skill chaos-engineering-resilience

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.6 KB

---
name: chaos-engineering-resilience
description: "Chaos engineering principles, controlled failure injection, resilience testing, and system recovery validation. Use when testing distributed systems, building confidence in fault tolerance, or validating disaster recovery."
category: specialized-testing
priority: high
tokenEstimate: 900
agents: [qe-chaos-engineer, qe-performance-tester, qe-production-intelligence]
implementation_status: optimized
optimization_version: 1.0
last_optimized: 2025-12-02
dependencies: []
quick_reference_card: true
tags: [chaos, resilience, fault-injection, distributed-systems, recovery, netflix]
---

# Chaos Engineering & Resilience Testing

<default_to_action>
When testing system resilience or injecting failures:
1. DEFINE steady state (normal metrics: error rate, latency, throughput)
2. HYPOTHESIZE system continues in steady state during failure
3. INJECT real-world failures (network, instance, disk, CPU)
4. OBSERVE and measure deviation from steady state
5. FIX weaknesses discovered, document runbooks, repeat

**Quick Chaos Steps:**
- Start small: Dev → Staging → 1% prod → gradual rollout
- Define clear rollback triggers (error_rate > 5%)
- Measure blast radius, never exceed planned scope
- Document findings → runbooks → improved resilience

**Critical Success Factors:**
- Controlled experiments with automatic rollback
- Steady state must be measurable
- Start in non-production, graduate to production
</default_to_action>

## Quick Reference Card

### When to Use
- Distributed systems validation
- Disaster recovery testing
- Building confidence in fault tolerance
- Pre-production resilience verification

### Failure Types to Inject
| Category | Failures | Tools |
|----------|----------|-------|
| **Network** | Latency, packet loss, partition | tc, toxiproxy |
| **Infrastructure** | Instance kill, disk failure, CPU | Chaos Monkey |
| **Application** | Exceptions, slow responses, leaks | Gremlin, LitmusChaos |
| **Dependencies** | Service outage, timeout | WireMock |

### Blast Radius Progression
```
Dev (safe) → Staging → 1% prod → 10% → 50% → 100%
     ↓           ↓         ↓        ↓
  Learn      Validate   Careful   Full confidence
```

### Steady State Metrics
| Metric | Normal | Alert Threshold |
|--------|--------|-----------------|
| Error rate | < 0.1% | > 1% |
| p99 latency | < 200ms | > 500ms |
| Throughput | baseline | -20% |

---

## Chaos Experiment Structure

```typescript
// Chaos experiment definition
const experiment = {
  name: 'Database latency injection',
  hypothesis: 'System handles 500ms DB latency gracefully',
  steadyState: {
    errorRate: '< 0.1%',
    p99Latency: '< 300ms'
  },
  method: {
    type: 'network-latency',
    target: 'database',
    delay: '500ms',
    duration: '5m'
  },
  rollback: {
    automatic: true,
    trigger: 'errorRate > 5%'
  }
};
```

---

## Agent-Driven Chaos

```typescript
// qe-chaos-engineer runs controlled experiments
await Task("Chaos Experiment", {
  target: 'payment-service',
  failure: 'terminate-random-instance',
  blastRadius: '10%',
  duration: '5m',
  steadyStateHypothesis: {
    metric: 'success-rate',
    threshold: 0.99
  },
  autoRollback: true
}, "qe-chaos-engineer");

// Validates:
// - System recovers automatically
// - Error rate stays within threshold
// - No data loss
// - Alerts triggered appropriately
```

---

## Agent Coordination Hints

### Memory Namespace
```
aqe/chaos-engineering/
├── experiments/*       - Experiment definitions & results
├── steady-states/*     - Baseline measurements
├── runbooks/*          - Generated recovery procedures
└── blast-radius/*      - Impact analysis
```

### Fleet Coordination
```typescript
const chaosFleet = await FleetManager.coordinate({
  strategy: 'chaos-engineering',
  agents: [
    'qe-chaos-engineer',          // Experiment execution
    'qe-performance-tester',      // Baseline metrics
    'qe-production-intelligence'  // Production monitoring
  ],
  topology: 'sequential'
});
```

---

## Related Skills
- [shift-right-testing](../shift-right-testing/) - Production testing
- [performance-testing](../performance-testing/) - Load testing
- [test-environment-management](../test-environment-management/) - Environment stability

---

## Remember

**Break things on purpose to prevent unplanned outages.** Find weaknesses before users do. Define steady state, inject failures, measure impact, fix weaknesses, create runbooks. Start small, increase blast radius gradually.

**With Agents:** `qe-chaos-engineer` automates chaos experiments with blast radius control, automatic rollback, and comprehensive resilience validation. Generates runbooks from experiment results.

Overview

This skill codifies chaos engineering and resilience testing practices for distributed systems. It provides structured experiments, controlled failure injection, automatic rollback controls, and artifacts like runbooks and blast-radius analysis to validate recovery and fault tolerance. Use it to build measurable confidence in system resilience across environments.

How this skill works

Define a measurable steady state (error rate, latency, throughput) and state a hypothesis that the system will remain within that steady state during an injected failure. Execute controlled failures (network latency, instance termination, CPU/disk stress, dependency outages) with a scoped blast radius and automated rollback triggers. Collect metrics, compare deviations, generate findings and runbooks, then iterate to remediate weaknesses.

When to use it

Validating fault tolerance of distributed services before release
Testing disaster recovery procedures and failover behavior
Measuring system behavior under degraded dependencies
Progressively rolling experiments from dev to production (1% → 100%)
Building operational runbooks and incident playbooks from experiment results

Best practices

Always define and measure a clear steady state before injecting failures
Start small: run in dev/staging, then gradual production rollouts with capped blast radius
Automate rollback with concrete triggers (e.g., error_rate > 5%) to limit impact
Record experiment metadata, observations, and generate runbooks for fixes
Limit blast radius and monitor key metrics (error rate, p99 latency, throughput) continuously

Example use cases

Inject 500ms database latency to verify p99 latency and error-rate thresholds
Terminate a subset of instances (10% blast radius) to validate auto-scaling and failover
Simulate downstream service outage to ensure graceful degradation and retries
Run scheduled chaos in staging to generate runbooks for common failure modes
Coordinate agents to run sequential experiments: baseline collection, injection, and recovery validation

FAQ

How do I limit blast radius safely?

Scope experiments to specific services or percentages of traffic, start in non-production, set automatic rollback triggers, and incrementally increase scope only after validating outcomes.

What metrics should I monitor during experiments?

At minimum monitor error rate, p99 latency, and throughput relative to baseline. Add business-level success metrics relevant to your service.