home / skills / dexploarer / hyper-forge / chaos-engineering-setup
This skill helps you implement chaos engineering practices and tools for resilience testing across distributed systems.
npx playbooks add skill dexploarer/hyper-forge --skill chaos-engineering-setupReview the files below or copy the command above to add this skill to your agents.
---
name: chaos-engineering-setup
description: Implement chaos engineering practices and tools for resilience testing
allowed-tools: [Read, Write, Edit, Bash, Grep, Glob]
---
# chaos engineering setup
Implement chaos engineering practices and tools for resilience testing
## When to Use
This skill activates when you need to implement chaos engineering practices and tools for resilience testing.
## Quick Example
```yaml
# Configuration example for chaos-engineering-setup
# See full documentation in the skill implementation
```
## Best Practices
- ✅ Follow industry standards
- ✅ Document all configurations
- ✅ Test thoroughly before production
- ✅ Monitor and alert appropriately
- ✅ Regular maintenance and updates
## Related Skills
- `microservices-orchestrator`
- `compliance-auditor`
- Use `enterprise-architect` agent for design consultation
## Implementation Guide
[Detailed implementation steps would go here in production]
This skill provides comprehensive guidance for implement chaos engineering practices and tools for resilience testing.
This skill implements chaos engineering practices and tooling to validate and improve system resilience for an AI-powered 3D asset generation platform. It provides practical steps to design, run, and analyze failure experiments that target services, infrastructure, and workflows. The goal is to reduce downtime, surface weak dependencies, and verify recovery procedures before incidents reach production.
The skill defines experiment templates, integrates chaos tooling with TypeScript-based services and CI/CD pipelines, and connects experiments to observability and alerting. It automates controlled fault injection (CPU, memory, network, GPU/node failures, storage, and dependency outages), measures SLO impacts, and enforces blast-radius and safety gates. Results feed back into runbooks, monitoring dashboards, and automated rollback or remediation hooks.
How do you keep experiments safe for production?
Enforce blast-radius limits, require approvals, run during low-traffic windows, use canaries, and ensure automated abort triggers based on SLO breaches or critical alerts.
What happens if an experiment causes a real outage?
Predefine rollback and remediation actions, integrate automatic rollback hooks into experiments, and maintain detailed runbooks and postmortems to prevent recurrence.