home / skills / copyleftdev / sk1llz / chaos-engineering

chaos-engineering skill

/organizations/netflix/chaos-engineering

This skill helps you validate system resilience using chaos engineering principles to safely inject failures and verify steady-state throughout production.

npx playbooks add skill copyleftdev/sk1llz --skill chaos-engineering

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
18.1 KB
---
name: netflix-chaos-engineering
description: Apply Netflix's chaos engineering methodology to build resilient systems. Emphasizes controlled failure injection, steady-state hypothesis testing, and building confidence through experimentation. Use when you need to verify system resilience under turbulent conditions.
---

# Netflix Chaos Engineering

## Overview

Netflix invented chaos engineering in response to their 2008 migration to AWS. Facing the reality that cloud infrastructure fails unpredictably, they created Chaos Monkey—and eventually the entire Simian Army—to proactively inject failures and build confidence in system resilience.

## The Pioneers

### Casey Rosenthal (Father of Chaos Engineering)

Led Netflix's Chaos Engineering team from 2015, formalizing the discipline and co-authoring the definitive O'Reilly book. Now CEO of Verica. His key insight: chaos engineering is about **building confidence**, not breaking things.

### Nora Jones

Co-pioneered chaos engineering at Netflix, co-authored the book, and later founded Jeli to apply these principles to incident analysis. Emphasized the human factors in resilience.

## References

- **Book**: "Chaos Engineering: System Resiliency in Practice" (O'Reilly, 2020)
- **Principles**: https://principlesofchaos.org/
- **Netflix Tech Blog**: https://netflixtechblog.com/

## Core Philosophy

> "The best way to avoid failure is to fail constantly."

> "Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."

> "We're not trying to break things. We're trying to build confidence."

Chaos engineering is NOT about breaking things randomly. It's a disciplined approach to discovering systemic weaknesses before they cause outages.

## The Principles of Chaos Engineering

```
1. Build a Hypothesis around Steady State Behavior
   Define what "normal" looks like in measurable terms
   
2. Vary Real-World Events
   Inject failures that actually happen: server crashes, network issues, etc.
   
3. Run Experiments in Production
   Staging environments hide real-world complexity
   
4. Automate Experiments to Run Continuously
   One-time tests give false confidence
   
5. Minimize Blast Radius
   Start small, expand as confidence grows
```

## The Simian Army

Netflix's suite of chaos tools:

| Tool | Purpose |
|------|---------|
| **Chaos Monkey** | Randomly terminates instances |
| **Chaos Kong** | Simulates entire region failure |
| **Latency Monkey** | Injects artificial delays |
| **Conformity Monkey** | Finds instances not following best practices |
| **Janitor Monkey** | Cleans up unused resources |
| **Security Monkey** | Finds security vulnerabilities |

## When Implementing

### Always

- Define steady-state hypothesis before experimenting
- Start with smallest blast radius possible
- Have a "stop button" to halt experiments
- Run experiments in production (with safeguards)
- Automate experiments to run continuously
- Involve the whole team, not just SRE

### Never

- Inject chaos without a hypothesis
- Start with catastrophic failures
- Run experiments without monitoring
- Chaos without stakeholder buy-in
- Treat chaos as a one-time activity
- Forget to document learnings

### Prefer

- Gradual expansion of blast radius
- Automated experiments over manual
- Production over staging (with safeguards)
- Hypothesis-driven experiments
- Business metrics over technical metrics

## Implementation Patterns

### Chaos Experiment Structure

```python
# chaos_experiment.py
# The anatomy of a chaos experiment

from dataclasses import dataclass
from typing import Callable, Optional
from datetime import datetime, timedelta
import time

@dataclass
class SteadyStateHypothesis:
    """Define what 'normal' looks like"""
    name: str
    description: str
    probe: Callable[[], float]       # Returns a metric value
    tolerance_min: float
    tolerance_max: float
    
    def is_satisfied(self) -> bool:
        value = self.probe()
        return self.tolerance_min <= value <= self.tolerance_max

@dataclass
class ChaosAction:
    """The failure to inject"""
    name: str
    description: str
    execute: Callable[[], None]      # Inject the failure
    rollback: Callable[[], None]     # Undo the failure

@dataclass
class ChaosExperiment:
    """A complete chaos experiment"""
    name: str
    description: str
    hypothesis: SteadyStateHypothesis
    action: ChaosAction
    duration_seconds: int
    
    def run(self) -> dict:
        results = {
            'experiment': self.name,
            'started_at': datetime.now().isoformat(),
            'hypothesis_before': None,
            'hypothesis_during': None,
            'hypothesis_after': None,
            'success': False
        }
        
        # 1. Verify steady state BEFORE
        print(f"Checking steady state before experiment...")
        results['hypothesis_before'] = self.hypothesis.is_satisfied()
        
        if not results['hypothesis_before']:
            print("Steady state not satisfied before experiment. Aborting.")
            return results
        
        try:
            # 2. Inject the failure
            print(f"Injecting chaos: {self.action.name}")
            self.action.execute()
            
            # 3. Monitor during experiment
            print(f"Monitoring for {self.duration_seconds} seconds...")
            time.sleep(self.duration_seconds)
            results['hypothesis_during'] = self.hypothesis.is_satisfied()
            
        finally:
            # 4. Always rollback
            print(f"Rolling back: {self.action.name}")
            self.action.rollback()
        
        # 5. Verify steady state AFTER
        print("Checking steady state after rollback...")
        time.sleep(5)  # Allow recovery
        results['hypothesis_after'] = self.hypothesis.is_satisfied()
        
        # Success = hypothesis held during and after
        results['success'] = (
            results['hypothesis_during'] and 
            results['hypothesis_after']
        )
        
        results['completed_at'] = datetime.now().isoformat()
        return results


# Example: Test resilience to instance termination
def create_instance_termination_experiment(instance_id: str):
    
    def check_error_rate():
        # Query your monitoring system
        return get_error_rate_percentage()
    
    def terminate_instance():
        # Actually terminate the instance
        ec2.terminate_instances(InstanceIds=[instance_id])
    
    def noop_rollback():
        # Auto-scaling should replace the instance
        pass
    
    hypothesis = SteadyStateHypothesis(
        name="Error rate within tolerance",
        description="Error rate should remain below 1%",
        probe=check_error_rate,
        tolerance_min=0,
        tolerance_max=1.0
    )
    
    action = ChaosAction(
        name=f"Terminate instance {instance_id}",
        description="Simulate instance failure",
        execute=terminate_instance,
        rollback=noop_rollback
    )
    
    return ChaosExperiment(
        name="Instance Termination Resilience",
        description="Verify system handles instance loss gracefully",
        hypothesis=hypothesis,
        action=action,
        duration_seconds=300
    )
```

### Chaos Monkey Implementation

```python
# chaos_monkey.py
# Simplified Chaos Monkey - random instance termination

import random
import time
from datetime import datetime
from typing import List, Optional

class ChaosMonkey:
    """
    Netflix's Chaos Monkey: randomly terminates instances
    to ensure services can handle instance failures.
    """
    
    def __init__(self, 
                 cloud_client,
                 excluded_services: List[str] = None,
                 probability: float = 0.1,
                 schedule_start_hour: int = 9,
                 schedule_end_hour: int = 15):
        """
        Args:
            cloud_client: AWS/GCP/Azure client
            excluded_services: Services to never touch
            probability: Chance of termination per run (0-1)
            schedule_start_hour: Only run after this hour
            schedule_end_hour: Stop running after this hour
        """
        self.client = cloud_client
        self.excluded = set(excluded_services or [])
        self.probability = probability
        self.start_hour = schedule_start_hour
        self.end_hour = schedule_end_hour
        self.termination_log = []
    
    def is_within_schedule(self) -> bool:
        """Only cause chaos during business hours (when humans can respond)"""
        hour = datetime.now().hour
        weekday = datetime.now().weekday()
        
        # Monday-Friday, 9am-3pm
        return weekday < 5 and self.start_hour <= hour < self.end_hour
    
    def get_eligible_instances(self) -> List[dict]:
        """Get instances that can be terminated"""
        all_instances = self.client.list_instances()
        
        eligible = []
        for instance in all_instances:
            service = instance.get('service_name', '')
            
            # Skip excluded services
            if service in self.excluded:
                continue
            
            # Skip if service has < 2 instances (no redundancy)
            service_count = sum(
                1 for i in all_instances 
                if i.get('service_name') == service
            )
            if service_count < 2:
                continue
            
            # Skip if instance is too new (let it warm up)
            age_minutes = instance.get('age_minutes', 0)
            if age_minutes < 30:
                continue
            
            eligible.append(instance)
        
        return eligible
    
    def run(self) -> Optional[dict]:
        """Execute one round of chaos"""
        
        # Check schedule
        if not self.is_within_schedule():
            return {'action': 'skipped', 'reason': 'outside schedule'}
        
        # Check probability
        if random.random() > self.probability:
            return {'action': 'skipped', 'reason': 'probability check'}
        
        # Get eligible instances
        eligible = self.get_eligible_instances()
        if not eligible:
            return {'action': 'skipped', 'reason': 'no eligible instances'}
        
        # Select random victim
        victim = random.choice(eligible)
        
        # Terminate!
        result = {
            'action': 'terminated',
            'instance_id': victim['id'],
            'service': victim.get('service_name'),
            'timestamp': datetime.now().isoformat()
        }
        
        self.client.terminate_instance(victim['id'])
        self.termination_log.append(result)
        
        return result
    
    def run_continuously(self, interval_seconds: int = 300):
        """Run chaos monkey on a schedule"""
        print("Chaos Monkey starting... 🐵")
        
        while True:
            result = self.run()
            if result['action'] == 'terminated':
                print(f"🔥 Terminated {result['instance_id']} "
                      f"({result['service']})")
            else:
                print(f"😴 Skipped: {result['reason']}")
            
            time.sleep(interval_seconds)
```

### Steady State Metrics

```python
# steady_state.py
# Define and monitor steady state

from dataclasses import dataclass
from typing import List, Callable
from prometheus_client import CollectorRegistry, Gauge

@dataclass
class BusinessMetric:
    """
    Netflix insight: measure BUSINESS metrics, not just technical ones.
    Users don't care about CPU; they care about streams starting.
    """
    name: str
    description: str
    query: Callable[[], float]
    unit: str
    
    # Steady state bounds
    min_healthy: float
    max_healthy: float

# Netflix's key business metric
streams_per_second = BusinessMetric(
    name="streams_starting_per_second",
    description="Rate of successful stream starts",
    query=lambda: prometheus.query("rate(streams_started_total[1m])"),
    unit="streams/sec",
    min_healthy=50000,
    max_healthy=200000
)

class SteadyStateMonitor:
    """Monitor steady state during chaos experiments"""
    
    def __init__(self, metrics: List[BusinessMetric]):
        self.metrics = metrics
        self.baseline = {}
    
    def capture_baseline(self, duration_seconds: int = 300):
        """Capture baseline metrics before experiment"""
        samples = {m.name: [] for m in self.metrics}
        
        for _ in range(duration_seconds // 10):
            for metric in self.metrics:
                samples[metric.name].append(metric.query())
            time.sleep(10)
        
        # Calculate baseline statistics
        for metric in self.metrics:
            values = samples[metric.name]
            self.baseline[metric.name] = {
                'mean': sum(values) / len(values),
                'min': min(values),
                'max': max(values)
            }
    
    def check_steady_state(self) -> dict:
        """Check if all metrics are within healthy bounds"""
        results = {}
        all_healthy = True
        
        for metric in self.metrics:
            current = metric.query()
            healthy = metric.min_healthy <= current <= metric.max_healthy
            
            results[metric.name] = {
                'current': current,
                'healthy_range': (metric.min_healthy, metric.max_healthy),
                'is_healthy': healthy
            }
            
            if not healthy:
                all_healthy = False
        
        results['all_healthy'] = all_healthy
        return results
    
    def deviation_from_baseline(self) -> dict:
        """How far are we from baseline?"""
        deviations = {}
        
        for metric in self.metrics:
            current = metric.query()
            baseline = self.baseline.get(metric.name, {}).get('mean', current)
            
            if baseline != 0:
                deviation_pct = ((current - baseline) / baseline) * 100
            else:
                deviation_pct = 0
            
            deviations[metric.name] = {
                'current': current,
                'baseline': baseline,
                'deviation_percent': deviation_pct
            }
        
        return deviations
```

### Blast Radius Control

```python
# blast_radius.py
# Control the scope of chaos experiments

from enum import Enum
from dataclasses import dataclass
from typing import List, Optional

class BlastRadius(Enum):
    """Start small, expand as confidence grows"""
    SINGLE_INSTANCE = 1      # One instance
    SERVICE_PERCENTAGE = 2   # X% of a service
    ENTIRE_SERVICE = 3       # All instances of a service
    AVAILABILITY_ZONE = 4    # Entire AZ
    REGION = 5               # Entire region (Chaos Kong)

@dataclass
class ExperimentScope:
    """Define the scope of an experiment"""
    blast_radius: BlastRadius
    target_service: Optional[str] = None
    target_percentage: float = 0.1
    target_az: Optional[str] = None
    target_region: Optional[str] = None
    
    def get_targets(self, all_instances: List[dict]) -> List[dict]:
        """Get instances within the blast radius"""
        
        if self.blast_radius == BlastRadius.SINGLE_INSTANCE:
            # Just one random instance
            import random
            eligible = [i for i in all_instances 
                       if i.get('service') == self.target_service]
            return [random.choice(eligible)] if eligible else []
        
        elif self.blast_radius == BlastRadius.SERVICE_PERCENTAGE:
            # X% of a service
            import random
            eligible = [i for i in all_instances 
                       if i.get('service') == self.target_service]
            count = max(1, int(len(eligible) * self.target_percentage))
            return random.sample(eligible, min(count, len(eligible)))
        
        elif self.blast_radius == BlastRadius.ENTIRE_SERVICE:
            return [i for i in all_instances 
                   if i.get('service') == self.target_service]
        
        elif self.blast_radius == BlastRadius.AVAILABILITY_ZONE:
            return [i for i in all_instances 
                   if i.get('az') == self.target_az]
        
        elif self.blast_radius == BlastRadius.REGION:
            # Chaos Kong - nuclear option
            return [i for i in all_instances 
                   if i.get('region') == self.target_region]
        
        return []


class GraduatedChaos:
    """Gradually increase blast radius as confidence grows"""
    
    def __init__(self, service: str):
        self.service = service
        self.current_level = BlastRadius.SINGLE_INSTANCE
        self.success_streak = 0
        self.required_successes = 5  # Before escalating
    
    def record_result(self, success: bool):
        if success:
            self.success_streak += 1
            if self.success_streak >= self.required_successes:
                self.escalate()
        else:
            self.success_streak = 0
            self.de_escalate()
    
    def escalate(self):
        """Increase blast radius"""
        levels = list(BlastRadius)
        current_idx = levels.index(self.current_level)
        if current_idx < len(levels) - 1:
            self.current_level = levels[current_idx + 1]
            self.success_streak = 0
            print(f"Escalating to {self.current_level.name}")
    
    def de_escalate(self):
        """Decrease blast radius after failure"""
        levels = list(BlastRadius)
        current_idx = levels.index(self.current_level)
        if current_idx > 0:
            self.current_level = levels[current_idx - 1]
            print(f"De-escalating to {self.current_level.name}")
```

## Mental Model

Netflix chaos engineering asks:

1. **What is steady state?** Define normal in measurable terms
2. **What could go wrong?** Real-world failures to simulate
3. **What's our hypothesis?** System should maintain steady state
4. **How small can we start?** Minimize blast radius
5. **Did we learn something?** Every experiment should teach us

## Signature Netflix Moves

- Chaos Monkey for random instance termination
- Steady state hypothesis before every experiment
- Business metrics over technical metrics
- Production experiments (with safeguards)
- Graduated blast radius expansion
- Automated, continuous chaos

Overview

This skill applies Netflix's chaos engineering methodology to help teams build resilient systems through disciplined failure injection and continuous experimentation. It emphasizes defining steady-state hypotheses, running controlled experiments (often in production with safeguards), and automating tests to increase confidence. The goal is to reveal systemic weaknesses before they cause outages, not to randomly break things. Practical patterns and lightweight Python primitives are provided for probes, actions, and experiments.

How this skill works

The skill provides a small Python framework that models a SteadyStateHypothesis, ChaosAction, and ChaosExperiment. It verifies steady state before injecting an action, monitors during a bounded window, always executes rollback, and then rechecks steady state to determine success. It includes a ChaosMonkey pattern for randomized instance termination and a SteadyStateMonitor focused on business metrics rather than low-level signals. The components are intended to be adapted to your cloud client, monitoring, and incident-control tooling.

When to use it

  • When you need to validate that production systems tolerate realistic component failures
  • Before major launches or architecture changes to gain confidence in recovery paths
  • To automate continuous resilience checks as part of operational hygiene
  • When you want to prioritize business metrics (e.g., user-facing KPIs) over infrastructure signals
  • When onboarding SRE or service teams to a hypothesis-driven testing practice

Best practices

  • Always define a measurable steady-state hypothesis before any experiment
  • Minimize blast radius: start small, expand only as confidence grows
  • Automate experiments and run them continuously with a clear stop button
  • Prefer production experiments with safeguards over pristine staging tests
  • Involve product, SRE, and on-call teams and document learnings after every run

Example use cases

  • Terminate a single instance in a service group to verify load redistribution and error rate tolerance
  • Inject latency into a downstream dependency and observe business metrics like successful requests/sec
  • Simulate regional outage with a controlled tool to validate failover and traffic routing
  • Run a scheduled Chaos Monkey to ensure auto-scaling replaces terminated instances and no single-service outage occurs
  • Automate continuous smoke experiments that check top-level user flows every day

FAQ

Is chaos engineering the same as randomly breaking production?

No. Chaos engineering is hypothesis-driven, measurable, and constrained. Experiments are designed to minimize blast radius and to build confidence, not cause uncontrolled damage.

Should I run experiments in production or staging?

Prefer production with safeguards when possible. Staging often lacks real-world complexity. If you must use staging, ensure it mirrors production behavior closely and accept limitations in validity.