home / skills / lerianstudio / ring / dev-chaos-testing

dev-chaos-testing skill

safe

This skill orchestrates chaos testing to verify resilient error handling by injecting faults via Toxiproxy and validating graceful degradation.

npx playbooks add skill lerianstudio/ring --skill dev-chaos-testing

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

10.3 KB

---
name: ring:dev-chaos-testing
title: Development cycle chaos testing (Gate 7)
category: development-cycle
tier: 1
when_to_use: |
  Use after integration testing (Gate 6) is complete.
  MANDATORY for all development tasks with external dependencies - verifies graceful degradation under failure.
description: |
  Gate 7 of development cycle - ensures chaos tests exist using Toxiproxy
  to verify graceful degradation under connection loss, latency, and partitions.

trigger: |
  - After integration testing complete (Gate 6)
  - MANDATORY for all development tasks with external dependencies
  - Verifies system behavior under failure conditions

NOT_skip_when: |
  - "Infrastructure is reliable" - All infrastructure fails eventually. Be prepared.
  - "Integration tests cover failures" - Integration tests verify happy path. Chaos verifies failures.
  - "Toxiproxy is complex" - One container, 20 minutes setup. Prevents production incidents.

sequence:
  after: [ring:dev-integration-testing]
  before: [ring:requesting-code-review]

related:
  complementary: [ring:dev-cycle, ring:dev-integration-testing, ring:qa-analyst]

input_schema:
  required:
    - name: unit_id
      type: string
      description: "Task or subtask identifier"
    - name: external_dependencies
      type: array
      items: string
      description: "External services (postgres, redis, rabbitmq, etc.)"
    - name: language
      type: string
      enum: [go, typescript]
      description: "Programming language"
  optional:
    - name: gate6_handoff
      type: object
      description: "Full handoff from Gate 6 (integration testing)"

output_schema:
  format: markdown
  required_sections:
    - name: "Chaos Testing Summary"
      pattern: "^## Chaos Testing Summary"
      required: true
    - name: "Failure Scenarios"
      pattern: "^## Failure Scenarios"
      required: true
    - name: "Handoff to Next Gate"
      pattern: "^## Handoff to Next Gate"
      required: true
  metrics:
    - name: result
      type: enum
      values: [PASS, FAIL]
    - name: dependencies_tested
      type: integer
    - name: scenarios_tested
      type: integer
    - name: recovery_verified
      type: boolean
    - name: iterations
      type: integer

verification:
  automated:
    - command: "grep -rn 'TestIntegration_Chaos_' --include='*_test.go' ."
      description: "Chaos test functions exist"
      success_pattern: "TestIntegration_Chaos_"
    - command: "grep -rn 'CHAOS.*1' --include='*_test.go' ."
      description: "CHAOS env check present"
      success_pattern: "CHAOS"
  manual:
    - "Chaos tests follow TestIntegration_Chaos_{Component}_{Scenario} naming"
    - "All external dependencies have failure scenarios"
    - "Recovery verified after each failure injection"

examples:
  - name: "Chaos tests for database operations"
    input:
      unit_id: "task-001"
      external_dependencies: ["postgres", "redis"]
      language: "go"
    expected_output: |
      ## Chaos Testing Summary
      **Status:** PASS
      **Dependencies Tested:** 2
      **Scenarios Tested:** 6
      **Recovery Verified:** Yes

      ## Failure Scenarios
      | Component | Scenario | Status | Recovery |
      |-----------|----------|--------|----------|
      | PostgreSQL | Connection Loss | PASS | Yes |
      | PostgreSQL | High Latency | PASS | Yes |
      | PostgreSQL | Network Partition | PASS | Yes |
      | Redis | Connection Loss | PASS | Yes |
      | Redis | High Latency | PASS | Yes |
      | Redis | Network Partition | PASS | Yes |

      ## Handoff to Next Gate
      - Ready for Gate 8 (Code Review): YES
---

# Dev Chaos Testing (Gate 7)

## Overview

Ensure code handles **failure conditions gracefully** by injecting faults using Toxiproxy. Verify connection loss, latency, and network partitions don't cause crashes.

**Core principle:** All infrastructure fails. Chaos testing ensures your code handles it gracefully.

<block_condition>
- No chaos tests = FAIL
- Any dependency without failure test = FAIL
- Recovery not verified = FAIL
- System crashes on failure = FAIL
</block_condition>

## CRITICAL: Role Clarification

**This skill ORCHESTRATES. QA Analyst Agent (chaos mode) EXECUTES.**

| Who | Responsibility |
|-----|----------------|
| **This Skill** | Gather requirements, dispatch agent, track iterations |
| **QA Analyst Agent** | Write chaos tests, setup Toxiproxy, verify recovery |

---

## Standards Reference

**MANDATORY:** Load testing-chaos.md standards via WebFetch.

<fetch_required>
https://raw.githubusercontent.com/LerianStudio/ring/main/dev-team/docs/standards/golang/testing-chaos.md
</fetch_required>

---

## Step 0: Detect External Dependencies (Auto-Detection)

**MANDATORY:** When `external_dependencies` is empty or not provided, scan the codebase to detect them automatically before validation.

```text
if external_dependencies is empty or not provided:

  detected_dependencies = []

  1. Scan docker-compose.yml / docker-compose.yaml for service images:
     - Grep tool: pattern "postgres" in docker-compose* files → add "postgres"
     - Grep tool: pattern "mongo" in docker-compose* files → add "mongodb"
     - Grep tool: pattern "valkey" in docker-compose* files → add "valkey"
     - Grep tool: pattern "redis" in docker-compose* files → add "redis"
     - Grep tool: pattern "rabbitmq" in docker-compose* files → add "rabbitmq"

  2. Scan dependency manifests:
     if language == "go":
       - Grep tool: pattern "github.com/lib/pq" in go.mod → add "postgres"
       - Grep tool: pattern "github.com/jackc/pgx" in go.mod → add "postgres"
       - Grep tool: pattern "go.mongodb.org/mongo-driver" in go.mod → add "mongodb"
       - Grep tool: pattern "github.com/redis/go-redis" in go.mod → add "redis"
       - Grep tool: pattern "github.com/valkey-io/valkey-go" in go.mod → add "valkey"
       - Grep tool: pattern "github.com/rabbitmq/amqp091-go" in go.mod → add "rabbitmq"

     if language == "typescript":
       - Grep tool: pattern "\"pg\"" in package.json → add "postgres"
       - Grep tool: pattern "@prisma/client" in package.json → add "postgres"
       - Grep tool: pattern "\"mongodb\"" in package.json → add "mongodb"
       - Grep tool: pattern "\"mongoose\"" in package.json → add "mongodb"
       - Grep tool: pattern "\"redis\"" in package.json → add "redis"
       - Grep tool: pattern "\"ioredis\"" in package.json → add "redis"
       - Grep tool: pattern "@valkey" in package.json → add "valkey"
       - Grep tool: pattern "\"amqplib\"" in package.json → add "rabbitmq"
       - Grep tool: pattern "amqp-connection-manager" in package.json → add "rabbitmq"

  3. Deduplicate detected_dependencies
  4. Set external_dependencies = detected_dependencies

  Log: "Auto-detected external dependencies: [detected_dependencies]"
```

<auto_detect_reason>
PM team task files often omit external_dependencies. If the codebase uses postgres, mongodb, valkey, or rabbitmq, these are external dependencies that MUST have chaos tests. Auto-detection prevents silent skips.
</auto_detect_reason>

---

## Step 1: Validate Input

```text
REQUIRED INPUT:
- unit_id: [task/subtask being tested]
- external_dependencies: [postgres, mongodb, valkey, redis, rabbitmq, etc.] (from input OR auto-detected in Step 0)
- language: [go|typescript]

OPTIONAL INPUT:
- gate6_handoff: [full Gate 6 output]

if any REQUIRED input is missing:
  → STOP and report: "Missing required input: [field]"

if external_dependencies is empty (AFTER auto-detection in Step 0):
  → STOP and report: "No external dependencies found after codebase scan - chaos testing requires dependencies"
```

## Step 2: Dispatch QA Analyst Agent (Chaos Mode)

```text
Task tool:
  subagent_type: "ring:qa-analyst"
  model: "opus"
  prompt: |
    **MODE:** CHAOS TESTING (Gate 7)

    **Standards:** Load testing-chaos.md

    **Input:**
    - Unit ID: {unit_id}
    - External Dependencies: {external_dependencies}
    - Language: {language}

    **Requirements:**
    1. Setup Toxiproxy infrastructure in tests/utils/chaos/
    2. Create chaos tests (TestIntegration_Chaos_{Component}_{Scenario} naming)
    3. Use dual-gate pattern (CHAOS=1 env + testing.Short())
    4. Test failure scenarios: Connection Loss, High Latency, Network Partition
    5. Verify 5-phase structure: Normal → Inject → Verify → Restore → Recovery

    **Output Sections Required:**
    - ## Chaos Testing Summary
    - ## Failure Scenarios
    - ## Handoff to Next Gate
```

## Step 3: Evaluate Results

```text
Parse agent output:

if "Status: PASS" in output:
  → Gate 7 PASSED
  → Return success with metrics

if "Status: FAIL" in output:
  → Dispatch fix to implementation agent
  → Re-run chaos tests (max 3 iterations)
  → If still failing: ESCALATE to user
```

## Step 4: Generate Output

```text
## Chaos Testing Summary
**Status:** {PASS|FAIL}
**Dependencies Tested:** {count}
**Scenarios Tested:** {count}
**Recovery Verified:** {Yes|No}

## Failure Scenarios
| Component | Scenario | Status | Recovery |
|-----------|----------|--------|----------|
| {component} | {scenario} | {PASS|FAIL} | {Yes|No} |

## Handoff to Next Gate
- Ready for Gate 8 (Code Review): {YES|NO}
- Iterations: {count}
```

---

## Failure Scenarios by Dependency

| Dependency | Required Scenarios |
|------------|-------------------|
| PostgreSQL | Connection Loss, High Latency, Network Partition |
| MongoDB | Connection Loss, High Latency, Network Partition |
| Valkey | Connection Loss, High Latency, Timeout |
| Redis | Connection Loss, High Latency, Timeout |
| RabbitMQ | Connection Loss, Network Partition, Slow Consumer |
| HTTP APIs | Timeout, 5xx Errors, Connection Refused |

---

## Anti-Rationalization Table

| Rationalization | Why It's WRONG | Required Action |
|-----------------|----------------|-----------------|
| "Infrastructure is reliable" | AWS, GCP, Azure all have outages. Your code must handle them. | **Write chaos tests** |
| "Integration tests cover failures" | Integration tests verify happy path. Chaos tests verify failure handling. | **Write chaos tests** |
| "Toxiproxy is complex" | One container. 20 minutes setup. Prevents production incidents. | **Write chaos tests** |
| "We have monitoring" | Monitoring detects problems. Chaos testing prevents them. | **Write chaos tests** |
| "Circuit breakers handle it" | Circuit breakers need testing too. Chaos tests verify they work. | **Write chaos tests** |

---

Overview

This skill orchestrates Gate 7 chaos testing for a development unit, ensuring services degrade gracefully under connection loss, latency, and partitions using Toxiproxy. It enforces that chaos tests exist for every external dependency and that recovery is verified. The skill gathers requirements, dispatches a QA Analyst Agent in chaos mode, and tracks iterations until pass or escalation.

How this skill works

When invoked it validates required input (unit_id, language, external_dependencies) and auto-detects common external services if dependencies are not provided. It dispatches a QA Analyst Agent with a concrete Toxiproxy-based test brief, parses the agent output for PASS/FAIL, and runs up to three remediation iterations before escalating. Final output summarizes scenarios, dependency coverage, and readiness for the next gate.

When to use it

Before promoting a service to integration/staging to verify graceful degradation
When external dependencies (Postgres, MongoDB, Redis, RabbitMQ, Valkey, HTTP APIs) are present
If Gate 6 passed but failure-handling tests are not yet implemented
When you need automated enforcement that chaos tests and recovery verification exist

Best practices

Auto-detect external dependencies from docker-compose and go/package manifests when not supplied
Require Toxiproxy setup under tests/utils/chaos/ and use descriptive test names (TestIntegration_Chaos_{Component}_{Scenario})
Use the dual-gate pattern (CHAOS=1 env var + testing.Short()) to control chaos test execution
Cover five phases in each test: Normal → Inject → Verify → Restore → Recovery
Verify recovery behavior explicitly; tests must assert service remains available or recovers cleanly

Example use cases

Detecting how the app reacts when Postgres connections drop and verifying reconnection logic
Injecting high latency to MongoDB to confirm request timeouts and fallback behavior
Partitioning RabbitMQ to validate message handling and slow-consumer detection
Simulating HTTP upstream 5xx and connection refused to ensure circuit breakers and retries work
Adding Toxiproxy-based chaos tests as a mandatory step in CI for a Go microservice

FAQ

What if external_dependencies is not provided?

The skill auto-scans docker-compose and language manifests to detect common services and sets external_dependencies; it will stop with an error if none are found.

Who runs the actual chaos tests?

This skill orchestrates and dispatches a QA Analyst Agent in CHAOS mode; the QA agent implements Toxiproxy setup and test code.