home / skills / 404kidwiz / claude-supercode-skills / error-coordinator-skill
This skill helps design resilient multi-agent systems by detecting loops, mitigating hallucinations, and coordinating self-healing workflows with robust error
npx playbooks add skill 404kidwiz/claude-supercode-skills --skill error-coordinator-skillReview the files below or copy the command above to add this skill to your agents.
---
name: error-coordinator
description: Expert in making multi-agent systems resilient. Specializes in detecting loops, hallucinations, and failures, and implementing self-healing workflows. Use when designing error handling for agent systems, implementing retry strategies, or building resilient AI workflows.
---
# Error Coordinator
## Purpose
Provides expertise in building resilient multi-agent systems with robust error handling, failure detection, and recovery mechanisms. Covers loop detection, hallucination mitigation, and self-healing agent workflows.
## When to Use
- Designing error handling for agent systems
- Implementing retry and recovery strategies
- Building self-healing AI workflows
- Detecting agent loops and infinite recursion
- Mitigating hallucinations in agent outputs
- Implementing circuit breakers for agents
- Coordinating failure recovery across agents
## Quick Start
**Invoke this skill when:**
- Designing error handling for agent systems
- Implementing retry and recovery strategies
- Building self-healing AI workflows
- Detecting agent loops and infinite recursion
- Coordinating failure recovery across agents
**Do NOT invoke when:**
- Organizing agent teams (use agent-organizer)
- Debugging application errors (use debugger)
- Handling production incidents (use incident-responder)
- Detecting code error patterns (use error-detective)
## Decision Framework
```
Error Type Handling:
├── Transient failure → Retry with backoff
├── Rate limiting → Backoff + queue
├── Invalid output → Validation + retry with feedback
├── Loop detected → Break + escalate
├── Hallucination → Ground with context, retry
├── Agent timeout → Cancel + fallback
└── Cascading failure → Circuit breaker
Recovery Strategy:
├── Idempotent operation → Simple retry
├── Stateful operation → Checkpoint + resume
├── Critical path → Fallback agent
└── Best effort → Log + continue
```
## Core Workflows
### 1. Loop Detection System
1. Track agent invocation history
2. Detect repeated state patterns
3. Set maximum iteration limits
4. Implement escape hatch triggers
5. Log loop occurrences for analysis
6. Escalate to supervisor or human
### 2. Hallucination Mitigation
1. Ground responses with source data
2. Implement output validation
3. Cross-check with retrieval
4. Add confidence scoring
5. Flag low-confidence outputs
6. Provide feedback for retry
### 3. Circuit Breaker Implementation
1. Track failure rates per agent
2. Define failure threshold
3. Open circuit on threshold breach
4. Provide fallback behavior
5. Implement half-open state for testing
6. Close circuit on recovery
7. Monitor and alert on breaker state
## Best Practices
- Implement timeouts for all agent calls
- Use exponential backoff with jitter
- Log all failures with full context
- Design for graceful degradation
- Test failure scenarios explicitly
- Monitor error rates and patterns
## Anti-Patterns
| Anti-Pattern | Problem | Correct Approach |
|--------------|---------|------------------|
| Infinite retries | Resource exhaustion | Max retry limits |
| Silent failures | Hidden problems | Log and alert |
| No timeouts | Hung processes | Always set timeouts |
| Same retry interval | Thundering herd | Exponential backoff |
| No fallbacks | Complete failure | Graceful degradation |
This skill provides actionable expertise for making multi-agent systems resilient by detecting loops, hallucinations, and operational failures and by implementing self-healing workflows. It focuses on practical error-handling patterns—retries, circuit breakers, validation, and escalation—so agent networks continue to operate reliably. Use it to design, audit, or implement robust recovery strategies across agent interactions.
The skill inspects agent invocation histories, output validation signals, and telemetry to detect failure modes such as infinite loops, inconsistent outputs (hallucinations), timeouts, and cascading failures. It prescribes concrete responses: retries with backoff, grounding and cross-checks for hallucinations, circuit breaker state transitions, and explicit escape hatches for loop interruptions. It also defines logging, escalation, and checkpointing patterns to enable automated or human-led recovery.
How do I detect a loop reliably in a distributed agent system?
Track invocation history and key state hashes across agents, set repetition thresholds and maximum iteration counts, then trigger an escape hatch and escalate when thresholds are exceeded.
When should I open a circuit breaker versus retrying?
Retry with backoff for transient failures; open the breaker when failure rates exceed a defined threshold to prevent cascading impact and switch to a fallback behavior.
How can I reduce hallucinations without blocking agent creativity?
Add lightweight grounding: validate critical facts against trusted sources, attach confidence scores, flag low-confidence outputs for review, and allow creative responses when not mission-critical.