home / skills / proffesor-for-testing / agentic-qe / qe-chaos-resilience

qe-chaos-resilience skill

/v3/assets/skills/qe-chaos-resilience

This skill helps you perform chaos engineering and resilience testing using controlled fault injection, load testing, and disaster recovery validation.

npx playbooks add skill proffesor-for-testing/agentic-qe --skill qe-chaos-resilience

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.7 KB
---
name: "QE Chaos Resilience"
description: "Chaos engineering and resilience testing including fault injection, load testing, and system recovery validation."
---

# QE Chaos Resilience

## Purpose

Guide the use of v3's chaos engineering capabilities including controlled fault injection, load/stress testing, resilience validation, and disaster recovery testing.

## Activation

- When testing system resilience
- When performing chaos experiments
- When load/stress testing
- When validating disaster recovery
- When testing circuit breakers

## Quick Start

```bash
# Run chaos experiment
aqe chaos run --experiment network-latency --target api-service

# Load test
aqe chaos load --scenario peak-traffic --duration 30m

# Stress test to breaking point
aqe chaos stress --endpoint /api/users --max-users 10000

# Test circuit breaker
aqe chaos circuit-breaker --service payment-service
```

## Agent Workflow

```typescript
// Chaos experiment
Task("Run chaos experiment", `
  Execute controlled chaos on api-service:
  - Inject 500ms network latency
  - Monitor service health metrics
  - Verify circuit breaker activation
  - Measure recovery time
  - Document findings
`, "qe-chaos-engineer")

// Load testing
Task("Performance load test", `
  Run load test simulating Black Friday traffic:
  - Ramp up to 10,000 concurrent users
  - Maintain load for 30 minutes
  - Monitor response times and error rates
  - Identify bottlenecks
  - Compare against SLAs
`, "qe-load-tester")
```

## Chaos Experiments

### 1. Fault Injection

```typescript
await chaosEngineer.injectFault({
  target: 'api-service',
  fault: {
    type: 'latency',
    parameters: {
      delay: '500ms',
      jitter: '100ms',
      percentage: 50
    }
  },
  duration: '5m',
  monitoring: {
    metrics: ['response_time', 'error_rate', 'throughput'],
    alerts: true
  },
  rollback: {
    automatic: true,
    trigger: 'error_rate > 10%'
  }
});
```

### 2. Load Testing

```typescript
await loadTester.execute({
  scenario: 'peak-traffic',
  profile: {
    rampUp: '5m',
    steadyState: '30m',
    rampDown: '5m'
  },
  users: {
    initial: 100,
    target: 5000,
    pattern: 'linear'
  },
  assertions: {
    p95_latency: '<500ms',
    error_rate: '<1%',
    throughput: '>1000rps'
  }
});
```

### 3. Stress Testing

```typescript
await loadTester.stressTest({
  endpoint: '/api/checkout',
  strategy: 'step-increase',
  steps: [100, 500, 1000, 2000, 5000],
  stepDuration: '5m',
  findBreakingPoint: true,
  monitoring: {
    resourceUtilization: true,
    databaseConnections: true,
    memoryUsage: true
  }
});
```

### 4. Resilience Validation

```typescript
await resilienceTester.validate({
  scenarios: [
    'database-failover',
    'cache-failure',
    'external-service-timeout',
    'pod-termination'
  ],
  expectations: {
    gracefulDegradation: true,
    automaticRecovery: true,
    dataIntegrity: true,
    recoveryTime: '<30s'
  }
});
```

## Fault Types

| Fault | Description | Use Case |
|-------|-------------|----------|
| Latency | Add network delay | Test timeouts |
| Packet Loss | Drop network packets | Test retry logic |
| CPU Stress | Consume CPU | Test resource limits |
| Memory Pressure | Consume memory | Test OOM handling |
| Disk Full | Fill disk space | Test disk errors |
| Process Kill | Terminate process | Test recovery |

## Chaos Report

```typescript
interface ChaosReport {
  experiment: {
    name: string;
    target: string;
    fault: FaultConfig;
    duration: number;
  };
  results: {
    hypothesis: string;
    validated: boolean;
    metrics: {
      before: MetricSnapshot;
      during: MetricSnapshot;
      after: MetricSnapshot;
    };
    events: ChaosEvent[];
    recovery: {
      detected: boolean;
      time: number;
      automatic: boolean;
    };
  };
  findings: {
    severity: 'critical' | 'high' | 'medium' | 'low';
    description: string;
    recommendation: string;
  }[];
  artifacts: {
    logs: string;
    metrics: string;
    traces: string;
  };
}
```

## Safety Controls

```yaml
safety:
  blast_radius:
    max_affected_pods: 1
    max_affected_percentage: 10

  abort_conditions:
    - error_rate > 50%
    - p99_latency > 10s
    - service_unavailable

  excluded_environments:
    - production-critical

  required_approvals:
    production: 2
    staging: 0
```

## SLA Validation

```typescript
await resilienceTester.validateSLA({
  slas: {
    availability: 99.9,
    p95_latency: 500,
    error_rate: 0.1
  },
  period: '30d',
  report: {
    breaches: true,
    trends: true,
    projections: true
  }
});
```

## Coordination

**Primary Agents**: qe-chaos-engineer, qe-load-tester, qe-resilience-tester
**Coordinator**: qe-chaos-coordinator
**Related Skills**: qe-performance, qe-security-compliance

Overview

This skill enables chaos engineering and resilience testing across services, combining fault injection, load/stress testing, and recovery validation. It provides controlled experiments, safety controls, and structured reporting to identify weaknesses and verify SLAs. Use it to validate graceful degradation, automatic recovery, and system behavior under real-world failure modes.

How this skill works

The skill runs targeted experiments that inject faults (latency, packet loss, CPU/memory pressure, disk faults, process kills) and executes load or stress profiles. It monitors key metrics before, during, and after tests, enforces abort conditions and blast-radius limits, and produces a Chaos Report with findings and remediation recommendations. Recovery and SLA validation routines measure detection and repair times and confirm compliance against defined thresholds.

When to use it

  • Validate service resilience before major releases or infrastructure changes
  • Confirm circuit breakers, retries, and graceful degradation behave under faults
  • Identify breaking points via stress and peak-load scenarios
  • Test disaster recovery, failover, and automatic recovery paths
  • Verify SLA metrics and produce compliance reports for stakeholders

Best practices

  • Start in staging with narrow blast radius and escalate cautiously toward production
  • Define clear hypotheses, success criteria, and rollback/abort conditions before each experiment
  • Automate monitoring and alerts for p95/p99 latency, error rate, throughput, and resource usage
  • Pair chaos runs with postmortem artifacts: logs, traces, metrics snapshots, and remediation actions
  • Use coordinated agent roles (chaos engineer, load tester, resilience tester) to separate concerns

Example use cases

  • Inject 500ms latency on an API to validate timeouts and retry behavior
  • Run a Black Friday-style load profile: ramp to 10k concurrent users and validate p95 < 500ms
  • Step-increase stress test to find the system breaking point and resource bottlenecks
  • Simulate database failover and verify data integrity and <30s recovery time
  • Test circuit breaker activation for a payment service under sustained errors

FAQ

How do safety controls prevent harmful experiments?

Safety controls enforce blast-radius limits, abort conditions (e.g., error_rate thresholds, p99 latency bounds), excluded environments, and required approvals to stop experiments automatically if risk is detected.

What artifacts are produced after a chaos run?

The Chaos Report includes experiment details, metric snapshots (before/during/after), events, recovery timelines, severity-tagged findings, and artifacts such as logs, metrics exports, and traces.