home / skills / dexploarer / hyper-forge / chaos-engineering-setup

chaos-engineering-setup skill

/.claude/skills/chaos-engineering-setup

This skill helps you implement chaos engineering practices and tools for resilience testing across distributed systems.

npx playbooks add skill dexploarer/hyper-forge --skill chaos-engineering-setup

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
1.1 KB
---
name: chaos-engineering-setup
description: Implement chaos engineering practices and tools for resilience testing
allowed-tools: [Read, Write, Edit, Bash, Grep, Glob]
---

# chaos engineering setup

Implement chaos engineering practices and tools for resilience testing

## When to Use

This skill activates when you need to implement chaos engineering practices and tools for resilience testing.

## Quick Example

```yaml
# Configuration example for chaos-engineering-setup
# See full documentation in the skill implementation
```

## Best Practices

- ✅ Follow industry standards
- ✅ Document all configurations
- ✅ Test thoroughly before production
- ✅ Monitor and alert appropriately
- ✅ Regular maintenance and updates

## Related Skills

- `microservices-orchestrator`
- `compliance-auditor`
- Use `enterprise-architect` agent for design consultation

## Implementation Guide

[Detailed implementation steps would go here in production]

This skill provides comprehensive guidance for implement chaos engineering practices and tools for resilience testing.

Overview

This skill implements chaos engineering practices and tooling to validate and improve system resilience for an AI-powered 3D asset generation platform. It provides practical steps to design, run, and analyze failure experiments that target services, infrastructure, and workflows. The goal is to reduce downtime, surface weak dependencies, and verify recovery procedures before incidents reach production.

How this skill works

The skill defines experiment templates, integrates chaos tooling with TypeScript-based services and CI/CD pipelines, and connects experiments to observability and alerting. It automates controlled fault injection (CPU, memory, network, GPU/node failures, storage, and dependency outages), measures SLO impacts, and enforces blast-radius and safety gates. Results feed back into runbooks, monitoring dashboards, and automated rollback or remediation hooks.

When to use it

  • Before major releases or architecture changes to validate resilience assumptions
  • When onboarding new infrastructure (GPU clusters, object stores, message queues)
  • To verify incident response playbooks and automated recovery paths
  • When reducing mean time to recovery (MTTR) or validating SLOs
  • During periodic resilience audits or compliance-driven testing

Best practices

  • Define clear hypotheses, metrics, and success criteria for every experiment
  • Start small with limited blast radius and progressively increase scope
  • Document configurations, experiment histories, and outcomes centrally
  • Integrate experiments into CI/CD with safety gates and canary windows
  • Ensure observability (traces, metrics, logs) and alerting are reliable before running tests
  • Automate remediation and rollback paths; rehearse runbooks regularly

Example use cases

  • Inject GPU node loss during a model training job to verify job resumption and checkpointing
  • Introduce network latency between asset generation service and storage to validate timeouts and retries
  • Simulate a message broker outage to ensure job queue persistence and backpressure handling
  • Throttle CPU or memory on rendering workers to observe autoscaling and graceful degradation
  • Run dependency chaos against an external model hosting API to test fallback models and degradation UX

FAQ

How do you keep experiments safe for production?

Enforce blast-radius limits, require approvals, run during low-traffic windows, use canaries, and ensure automated abort triggers based on SLO breaches or critical alerts.

What happens if an experiment causes a real outage?

Predefine rollback and remediation actions, integrate automatic rollback hooks into experiments, and maintain detailed runbooks and postmortems to prevent recurrence.