home / skills / williamzujkowski / cognitive-toolworks / testing-chaos-designer

testing-chaos-designer skill

/skills/testing-chaos-designer

This skill helps design hypothesis-driven chaos experiments to test resilience, define blast radii, and generate tool-specific configurations.

npx playbooks add skill williamzujkowski/cognitive-toolworks --skill testing-chaos-designer

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
15.3 KB
---
name: Chaos Engineering Experiment Designer
slug: testing-chaos-designer
description: Design chaos engineering experiments to test system resilience with controlled failure injection, hypothesis formulation, and blast radius control.
capabilities:
  - Define steady-state hypotheses for distributed systems
  - Design controlled chaos experiments with measurable outcomes
  - Configure blast radius limits to minimize production impact
  - Generate experiment specifications for Chaos Mesh, LitmusChaos, and Chaos Monkey
  - Implement progressive failure injection strategies
  - Create experiment reports with resilience metrics
inputs:
  - system_architecture: "Description of target system components, dependencies, and deployment topology"
  - resilience_goals: "Specific reliability objectives (e.g., RTO, RPO, availability targets)"
  - experiment_scope: "Boundaries for chaos testing (services, regions, blast radius)"
  - existing_monitoring: "Available observability tools and steady-state metrics"
outputs:
  - experiment_plan: "Complete chaos experiment specification with hypothesis, variables, and success criteria"
  - implementation_config: "Tool-specific configuration (Chaos Mesh YAML, LitmusChaos CRDs, etc.)"
  - safety_controls: "Blast radius limits, abort conditions, and rollback procedures"
  - reporting_template: "Experiment execution report structure with resilience metrics"
keywords:
  - chaos engineering
  - resilience testing
  - failure injection
  - steady state hypothesis
  - blast radius
  - chaos mesh
  - litmuschaos
  - chaos monkey
  - SRE
  - distributed systems
version: 1.0.0
owner: cognitive-toolworks
license: CC-BY-SA-4.0
security: Public - no sensitive data
links:
  - https://principlesofchaos.org/
  - https://chaos-mesh.org/
  - https://litmuschaos.io/
  - https://netflix.github.io/chaosmonkey/
---

## Purpose & When-To-Use

**Trigger conditions:**
- Resilience testing needed for distributed system or microservices architecture
- Disaster recovery validation beyond traditional testing
- SRE practice adoption requiring systematic failure experimentation
- Production confidence gaps in system behavior under failure conditions
- Pre-deployment validation of fault tolerance mechanisms
- Post-incident chaos engineering to prevent recurrence

**Use this skill to:**
- Design hypothesis-driven chaos experiments with measurable outcomes
- Define steady-state baselines and deviation thresholds
- Configure controlled failure injection with progressive escalation
- Generate tool-specific experiment configurations (Chaos Mesh, LitmusChaos, Chaos Monkey)
- Establish blast radius controls and abort conditions
- Create reproducible experiment workflows integrated with CI/CD

**Do NOT use for:**
- Traditional load or performance testing (use testing-strategy-composer)
- Security penetration testing (use security-assessment-framework)
- Functional correctness testing (use testing-strategy-composer)
- One-off manual fault injection without hypothesis or measurement

## Pre-Checks

**Time normalization:**
```
NOW_ET = 2025-10-25T21:30:36-04:00
```

**Required inputs validation:**
- [ ] `system_architecture` includes component diagram with dependencies
- [ ] `resilience_goals` specify quantitative targets (e.g., 99.9% availability)
- [ ] `experiment_scope` defines clear boundaries (services, environments, regions)
- [ ] `existing_monitoring` lists available metrics, dashboards, and alerting

**Source freshness checks:**
- Principles of Chaos Engineering (accessed 2025-10-25T21:30:36-04:00): https://principlesofchaos.org/
- Chaos Mesh v2.x documentation (accessed 2025-10-25T21:30:36-04:00): https://chaos-mesh.org/
- LitmusChaos 3.x framework (accessed 2025-10-25T21:30:36-04:00): https://litmuschaos.io/
- Netflix Chaos Monkey practices (accessed 2025-10-25T21:30:36-04:00): https://netflix.github.io/chaosmonkey/

**Abort conditions:**
- If `system_architecture` lacks dependency information → request clarification
- If no monitoring baseline exists → emit TODO: establish steady-state metrics first
- If production environment lacks rollback capabilities → restrict to non-prod only

## Procedure

### Tier 1: Quick Experiment Design (≤2k tokens)

**Fast path for common scenarios:**

1. **Validate experiment readiness**
   - Check monitoring baseline exists
   - Verify rollback capabilities
   - Confirm blast radius boundaries

2. **Define steady-state hypothesis**
   - Identify key user-facing metrics (latency, error rate, throughput)
   - Establish normal operating ranges from historical data
   - Example: "P95 latency < 200ms AND error rate < 0.1% during business hours"

3. **Select failure scenario** (common patterns)
   - Pod/instance termination (Chaos Monkey pattern)
   - Network latency/partition injection
   - Resource exhaustion (CPU, memory, I/O)
   - Cloud region/availability zone failure

4. **Configure minimal experiment**
   - Start with 1-5% traffic/instances
   - 5-minute duration maximum
   - Auto-abort if steady state violated by >20%
   - Single service scope

5. **Output T1 experiment spec**
   ```yaml
   experiment_name: "<service>-<failure-type>-v1"
   hypothesis: "<steady-state-assertion>"
   scope: "<service-name> in <environment>"
   blast_radius: "<percentage> of instances"
   duration: "5m"
   abort_conditions: "<steady-state-threshold>"
   ```

**T1 deliverable:** Minimal experiment specification ready for review.

---

### Tier 2: Production-Ready Experiment (≤6k tokens)

**Extended validation with tool-specific configuration:**

1. **Enhanced steady-state definition**
   - Define multiple observability signals (Golden Signals: latency, traffic, errors, saturation)
   - Specify SLO-aligned thresholds
   - Include downstream dependency health checks
   - Configure Prometheus/Datadog queries for real-time validation

2. **Advanced failure scenario design**
   - Select from chaos engineering taxonomy (accessed 2025-10-25T21:30:36-04:00): https://principlesofchaos.org/
     - Infrastructure failures: instance termination, disk failure, network partition
     - Application failures: process crash, memory leak simulation, database connection pool exhaustion
     - Dependency failures: upstream service degradation, third-party API timeout
   - Define progressive escalation path: 1% → 5% → 25% → 50%

3. **Blast radius and safety controls**
   - Geographic boundaries: single AZ, multi-AZ, or multi-region
   - Service boundaries: leaf services before core platform services
   - Time boundaries: off-peak hours, maintenance windows
   - Automated abort triggers:
     - Steady state deviation > configured threshold (e.g., 15%)
     - Customer-facing SLO breach
     - Manual kill switch activation
   - Rollback procedures: immediate fault injection termination, traffic rerouting, instance replacement

4. **Generate tool-specific configuration**

   **For Kubernetes + Chaos Mesh:**
   - PodChaos for instance termination
   - NetworkChaos for latency/partition injection
   - StressChaos for resource exhaustion
   - IOChaos for disk failure simulation

   **For Kubernetes + LitmusChaos:**
   - ChaosExperiment CRD definition
   - ChaosEngine linking workload to fault
   - Probes for steady-state validation
   - ChaosResult for metrics export

   **For AWS + Chaos Monkey:**
   - ASG-scoped termination policies
   - Conformity Monkey for architectural validation
   - Simian Army integration

5. **Monitoring and reporting setup**
   - Pre-experiment baseline capture (15-30 minutes)
   - During-experiment real-time dashboards
   - Post-experiment comparison analysis
   - Prometheus metrics export for:
     - `chaos_experiment_duration_seconds`
     - `chaos_steady_state_deviation_percent`
     - `chaos_blast_radius_instances_affected`

6. **Experiment execution workflow**
   ```
   1. Baseline collection (pre-experiment)
   2. Fault injection start
   3. Continuous steady-state monitoring
   4. Auto-abort on threshold breach OR manual intervention
   5. Fault injection termination
   6. Recovery validation (post-experiment)
   7. Results analysis and report generation
   ```

**T2 deliverable:** Production-ready experiment with tool configs, safety controls, and monitoring integration.

**T2 sources:**
- Chaos Mesh fault types (accessed 2025-10-25T21:30:36-04:00): https://chaos-mesh.org/ - supports PodChaos, NetworkChaos, IOChaos, TimeChaos, StressChaos
- LitmusChaos CRD architecture (accessed 2025-10-25T21:30:36-04:00): https://litmuschaos.io/ - uses ChaosExperiment, ChaosEngine, ChaosResult custom resources
- Netflix best practices (accessed 2025-10-25T21:30:36-04:00): Start small (single node), enable monitoring first, gradual escalation, automate over time
- Google Cloud chaos engineering (accessed 2025-10-25T21:30:36-04:00): Build hypothesis around steady state, replicate real-world conditions, minimize blast radius

---

### Tier 3: Advanced Experiment Suite (≤12k tokens)

**Comprehensive resilience validation (use only when explicitly requested):**

1. **Multi-dimensional experiment matrix**
   - Combine failure modes: network partition + instance termination
   - Cascade scenarios: upstream dependency failure → downstream impact
   - Time-based variations: gradual degradation vs sudden failure
   - Geographic distribution: multi-region failover validation

2. **Automated experiment pipelines**
   - CI/CD integration for continuous chaos testing
   - GameDay automation with scheduled experiment runs
   - Regression testing for resilience (post-deployment validation)

3. **Advanced metrics and analysis**
   - MTTR (Mean Time To Recovery) calculation
   - Blast radius expansion rate
   - Failure propagation graph
   - Resilience score calculation

4. **Org-wide chaos engineering program**
   - Skill development and training plans
   - Runbook generation from experiments
   - Blameless postmortem templates
   - Chaos engineering maturity assessment

**T3 deliverable:** Enterprise-scale chaos engineering program with automation, metrics, and cultural integration.

## Decision Rules

**Experiment scope selection:**
- If system is new (<6 months in production) → T1 minimal experiment, non-production only
- If system has established monitoring + SLOs → T2 production experiment with 1-5% blast radius
- If mature resilience practice exists → T2 with progressive escalation to 25-50%
- If multi-team coordination needed → T3 with GameDay orchestration

**Tool selection:**
- If Kubernetes-native deployment → prefer Chaos Mesh or LitmusChaos
- If AWS EC2/ASG workloads → consider Chaos Monkey or AWS Fault Injection Simulator
- If multi-cloud or hybrid → Chaos Mesh (cloud-agnostic) or Gremlin (SaaS)
- If budget constraints → open-source LitmusChaos or Chaos Mesh over commercial Gremlin

**Safety thresholds:**
- **Abort if:** Steady-state deviation >15% OR customer SLO breach OR manual intervention
- **Start conservatively:** 1-5% blast radius, 5-minute duration
- **Escalate gradually:** 2x blast radius per iteration if previous experiment passed
- **Production readiness gate:** 3+ successful non-production experiments before production testing

**Ambiguity handling:**
- If steady-state metrics unclear → work with SRE/ops to define; emit TODO list
- If blast radius boundaries ambiguous → default to most conservative (1%, single AZ, leaf services)
- If rollback procedures undefined → restrict to non-production until procedures documented

## Output Contract

**Primary output: `experiment_plan` (JSON)**
```json
{
  "experiment_id": "string (unique identifier)",
  "hypothesis": {
    "steady_state": "string (measurable assertion)",
    "metrics": [
      {
        "name": "string (e.g., p95_latency_ms)",
        "baseline": "number (historical average)",
        "threshold": "number (max acceptable deviation)"
      }
    ]
  },
  "failure_injection": {
    "type": "string (pod-kill|network-delay|cpu-stress|region-failure)",
    "target": "string (service/component name)",
    "parameters": "object (tool-specific config)"
  },
  "blast_radius": {
    "scope": "string (service|AZ|region)",
    "percentage": "number (1-100)",
    "max_instances": "number"
  },
  "duration": "string (ISO 8601 duration, e.g., PT5M)",
  "abort_conditions": [
    "string (condition triggering experiment termination)"
  ],
  "rollback_procedure": "string (steps to restore normal state)"
}
```

**Secondary output: `implementation_config` (tool-specific YAML/JSON)**
- Chaos Mesh: PodChaos, NetworkChaos, or StressChaos YAML manifest
- LitmusChaos: ChaosExperiment, ChaosEngine CRDs
- Chaos Monkey: Configuration properties or API payloads

**Tertiary output: `safety_controls` (checklist)**
- [ ] Monitoring dashboards configured
- [ ] Alerting thresholds set
- [ ] Rollback runbook accessible
- [ ] Stakeholder notification plan
- [ ] Manual abort procedure documented
- [ ] Post-experiment cleanup steps defined

**Required fields:** All JSON schema fields above are mandatory. Missing fields → skill emits TODO and stops.

## Examples

**Example: Pod termination experiment for payment service**

```yaml
# Input
system_architecture: "Payment service (3 replicas) → Database (RDS)"
resilience_goals: "99.9% availability, P95 latency <200ms"
experiment_scope: "Payment service pods in staging, 1 pod max"
existing_monitoring: "Prometheus + Grafana, payment_request_duration_ms"

# Output (Chaos Mesh PodChaos)
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: payment-pod-kill-exp
  namespace: staging
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: payment-service
  duration: 5m
  scheduler:
    cron: "@every 1h"  # Automated GameDay
```

## Quality Gates

**Token budgets (enforced):**
- T1: ≤2000 tokens (minimal experiment spec)
- T2: ≤6000 tokens (production-ready with tool config)
- T3: ≤12000 tokens (advanced suite + program design)

**Safety requirements:**
- Every experiment MUST define abort conditions
- Blast radius MUST be explicitly bounded
- Production experiments REQUIRE successful non-prod validation first

**Auditability:**
- All experiments logged with timestamp, executor, and results
- Changes to experiment parameters tracked in version control
- Results exported to observability platform (Prometheus/Datadog)

**Determinism:**
- Same experiment specification → reproducible results (within statistical variance)
- Randomized failure injection uses seeded RNG for replay capability

**Quality checklist:**
- [ ] Steady-state hypothesis is measurable and falsifiable
- [ ] Failure injection reflects real-world scenarios
- [ ] Blast radius minimizes customer impact
- [ ] Monitoring captures experiment success/failure
- [ ] Rollback procedure tested and documented

## Resources

**Official documentation:**
- Principles of Chaos Engineering (accessed 2025-10-25T21:30:36-04:00): https://principlesofchaos.org/
- Chaos Mesh documentation (accessed 2025-10-25T21:30:36-04:00): https://chaos-mesh.org/
- LitmusChaos framework (accessed 2025-10-25T21:30:36-04:00): https://litmuschaos.io/
- Netflix Chaos Monkey (accessed 2025-10-25T21:30:36-04:00): https://netflix.github.io/chaosmonkey/

**Templates and examples:**
- See `resources/experiment-template.yaml` for full experiment specification
- See `resources/blast-radius-config.json` for safety boundary examples

**Related skills:**
- `cloud-native-deployment-orchestrator` - for understanding Kubernetes deployment topology
- `devops-pipeline-architect` - for CI/CD integration of chaos experiments
- `observability-slo-calculator` - for defining steady-state thresholds aligned with SLOs

Overview

This skill designs hypothesis-driven chaos engineering experiments to validate and improve system resilience. It produces minimal-to-production-ready experiment plans, tool-specific configs (Chaos Mesh, LitmusChaos, Chaos Monkey), and safety controls to limit blast radius and enable safe execution. The focus is on measurable steady-state assertions, controlled failure injection, and integrated monitoring and rollback procedures.

How this skill works

The skill validates readiness inputs (architecture, resilience goals, scope, monitoring) and selects an appropriate tier: quick, production-ready, or advanced suite. It formulates a steady-state hypothesis, chooses failure scenarios, bounds blast radius and duration, and emits a JSON experiment_plan plus tool manifests and a safety checklist. Abort conditions, progressive escalation rules, and monitoring queries are included to ensure automated and manual safeguards.

When to use it

  • You need to test resilience of distributed systems or microservices.
  • To validate disaster recovery or fault tolerance beyond unit and integration tests.
  • When adopting SRE practices and wanting systematic failure experiments.
  • Before production deployment to verify fault handling and rollback paths.
  • After incidents to confirm mitigations and prevent recurrence.

Best practices

  • Start conservatively: 1–5% blast radius, short duration (e.g., 5 minutes), single-service scope.
  • Define a falsifiable steady-state hypothesis with SLO-aligned metrics and baselines.
  • Require monitoring baseline and rollback capabilities before production runs.
  • Use progressive escalation (1% → 5% → 25% → 50%) only after passing earlier stages.
  • Include automated abort triggers and a manual kill switch in every experiment.
  • Log experiments, export results to observability, and run blameless postmortems.

Example use cases

  • T1 quick experiment: Pod termination on a non-prod payment service with a 5m duration and 1% blast radius.
  • T2 production-ready: Network latency injection in staging with Prometheus probes, auto-abort on >15% steady-state deviation, and Chaos Mesh YAML output.
  • T2 tool conversion: Generate LitmusChaos CRD with probes and ChaosEngine linking to workload.
  • T3 advanced: CI-integrated GameDay pipeline that runs scheduled experiments and computes MTTR and resilience score.
  • Post-incident validation: Recreate partial outage scenario to confirm remediation and update runbooks.

FAQ

What inputs are required to generate a valid experiment plan?

You must provide system_architecture with dependency details, resilience_goals with quantitative targets, experiment_scope, and existing_monitoring listing metrics and dashboards.

When must experiments be restricted to non-production?

If monitoring baselines or rollback procedures are missing, or if the architecture is new (<6 months) or lacks dependency info, restrict experiments to non-production until gaps are closed.