home / skills / 404kidwiz / claude-supercode-skills / error-coordinator-skill

error-coordinator-skill skill

safe

This skill helps design resilient multi-agent systems by detecting loops, mitigating hallucinations, and coordinating self-healing workflows with robust error

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill error-coordinator-skill

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.3 KB

---
name: error-coordinator
description: Expert in making multi-agent systems resilient. Specializes in detecting loops, hallucinations, and failures, and implementing self-healing workflows. Use when designing error handling for agent systems, implementing retry strategies, or building resilient AI workflows.
---

# Error Coordinator

## Purpose
Provides expertise in building resilient multi-agent systems with robust error handling, failure detection, and recovery mechanisms. Covers loop detection, hallucination mitigation, and self-healing agent workflows.

## When to Use
- Designing error handling for agent systems
- Implementing retry and recovery strategies
- Building self-healing AI workflows
- Detecting agent loops and infinite recursion
- Mitigating hallucinations in agent outputs
- Implementing circuit breakers for agents
- Coordinating failure recovery across agents

## Quick Start
**Invoke this skill when:**
- Designing error handling for agent systems
- Implementing retry and recovery strategies
- Building self-healing AI workflows
- Detecting agent loops and infinite recursion
- Coordinating failure recovery across agents

**Do NOT invoke when:**
- Organizing agent teams (use agent-organizer)
- Debugging application errors (use debugger)
- Handling production incidents (use incident-responder)
- Detecting code error patterns (use error-detective)

## Decision Framework
```
Error Type Handling:
├── Transient failure → Retry with backoff
├── Rate limiting → Backoff + queue
├── Invalid output → Validation + retry with feedback
├── Loop detected → Break + escalate
├── Hallucination → Ground with context, retry
├── Agent timeout → Cancel + fallback
└── Cascading failure → Circuit breaker

Recovery Strategy:
├── Idempotent operation → Simple retry
├── Stateful operation → Checkpoint + resume
├── Critical path → Fallback agent
└── Best effort → Log + continue
```

## Core Workflows

### 1. Loop Detection System
1. Track agent invocation history
2. Detect repeated state patterns
3. Set maximum iteration limits
4. Implement escape hatch triggers
5. Log loop occurrences for analysis
6. Escalate to supervisor or human

### 2. Hallucination Mitigation
1. Ground responses with source data
2. Implement output validation
3. Cross-check with retrieval
4. Add confidence scoring
5. Flag low-confidence outputs
6. Provide feedback for retry

### 3. Circuit Breaker Implementation
1. Track failure rates per agent
2. Define failure threshold
3. Open circuit on threshold breach
4. Provide fallback behavior
5. Implement half-open state for testing
6. Close circuit on recovery
7. Monitor and alert on breaker state

## Best Practices
- Implement timeouts for all agent calls
- Use exponential backoff with jitter
- Log all failures with full context
- Design for graceful degradation
- Test failure scenarios explicitly
- Monitor error rates and patterns

## Anti-Patterns
| Anti-Pattern | Problem | Correct Approach |
|--------------|---------|------------------|
| Infinite retries | Resource exhaustion | Max retry limits |
| Silent failures | Hidden problems | Log and alert |
| No timeouts | Hung processes | Always set timeouts |
| Same retry interval | Thundering herd | Exponential backoff |
| No fallbacks | Complete failure | Graceful degradation |

Overview

This skill provides actionable expertise for making multi-agent systems resilient by detecting loops, hallucinations, and operational failures and by implementing self-healing workflows. It focuses on practical error-handling patterns—retries, circuit breakers, validation, and escalation—so agent networks continue to operate reliably. Use it to design, audit, or implement robust recovery strategies across agent interactions.

How this skill works

The skill inspects agent invocation histories, output validation signals, and telemetry to detect failure modes such as infinite loops, inconsistent outputs (hallucinations), timeouts, and cascading failures. It prescribes concrete responses: retries with backoff, grounding and cross-checks for hallucinations, circuit breaker state transitions, and explicit escape hatches for loop interruptions. It also defines logging, escalation, and checkpointing patterns to enable automated or human-led recovery.

When to use it

Designing error handling and recovery for a multi-agent workflow
Implementing retry strategies, exponential backoff, and jitter
Adding loop detection and escape hatches to prevent infinite recursion
Mitigating hallucinations by grounding outputs and adding validation
Creating circuit breakers and fallback agents to contain cascading failures
Coordinating cross-agent failure recovery and escalation paths

Best practices

Always set timeouts for agent calls and treat elapsed requests as failures
Use exponential backoff with jitter and cap retry attempts to avoid exhaustion
Validate outputs with schema checks and source grounding before acting
Track invocation history and state snapshots to detect repeated patterns
Design graceful degradation and fallback agents for critical paths
Log failures with full context and monitor error rates and breaker states

Example use cases

Preventing an assistant network from looping between subagents by detecting repeated state sequences and triggering an escape hatch
Mitigating hallucinations by cross-checking generated facts against retrieved sources and retrying with corrective feedback
Implementing per-agent circuit breakers that open on high failure rates and route requests to fallback agents
Designing retry policies for transient API errors with idempotency checks and checkpoint resume for stateful flows
Coordinating escalation to human supervisors for unresolved failures after automated recovery attempts

FAQ

How do I detect a loop reliably in a distributed agent system?

Track invocation history and key state hashes across agents, set repetition thresholds and maximum iteration counts, then trigger an escape hatch and escalate when thresholds are exceeded.

When should I open a circuit breaker versus retrying?

Retry with backoff for transient failures; open the breaker when failure rates exceed a defined threshold to prevent cascading impact and switch to a fallback behavior.

How can I reduce hallucinations without blocking agent creativity?

Add lightweight grounding: validate critical facts against trusted sources, attach confidence scores, flag low-confidence outputs for review, and allow creative responses when not mission-critical.