home / skills / dasien / claudemultiagenttemplate / error-handling

error-handling skill

safe

/templates/.claude/skills/error-handling

This skill helps you implement robust error handling with validation, recovery, and clear feedback to improve system stability.

npx playbooks add skill dasien/claudemultiagenttemplate --skill error-handling

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

1.9 KB

---
name: "Error Handling Strategies"
description: "Implement robust error handling with proper validation, recovery mechanisms, and clear feedback for system stability"
category: "implementation"
required_tools: ["Read", "Write", "Edit", "Grep"]
---

# Error Handling Strategies

## Purpose
Implement robust error handling that gracefully manages failures, provides clear feedback, and maintains system stability.

## When to Use
- Writing any code that can fail
- Handling external API calls
- Processing user input
- Managing file I/O operations
- Dealing with network requests

## Key Capabilities
1. **Error Detection** - Identify potential failure points
2. **Error Recovery** - Implement fallback strategies
3. **Error Communication** - Provide clear, actionable messages

## Approach
1. Identify what can go wrong (invalid input, network failure, etc.)
2. Validate inputs before processing
3. Use try-catch or error returns appropriately
4. Provide context in error messages
5. Log errors with sufficient debugging information
6. Implement retry logic for transient failures

## Example
**Context**: File reading operation
````python
def read_config(filepath):
    try:
        with open(filepath, 'r') as f:
            return json.load(f)
    except FileNotFoundError:
        logger.error(f"Config file not found: {filepath}")
        return get_default_config()
    except json.JSONDecodeError as e:
        logger.error(f"Invalid JSON in {filepath}: {e}")
        raise ConfigurationError(f"Config file is malformed at line {e.lineno}")
    except PermissionError:
        logger.error(f"Cannot read {filepath}: Permission denied")
        raise ConfigurationError(f"Insufficient permissions for {filepath}")
````

## Best Practices
- ✅ Fail fast for programming errors
- ✅ Recover gracefully from external failures
- ✅ Include context in error messages
- ❌ Avoid: Silent failures or generic error messages

Overview

This skill teaches practical error handling strategies to improve reliability and observability in multi-agent systems. It focuses on validation, recovery mechanisms, and clear feedback so failures are managed predictably. The goal is to maintain system stability while giving developers and users actionable information.

How this skill works

The skill inspects code paths for likely failure points, enforces input validation, and applies structured exception handling or error returns. It integrates logging, contextual error messages, and retry/fallback logic for transient faults. For critical failures it recommends fail-fast behavior and for external errors it prescribes graceful recovery and clear user-facing feedback.

When to use it

When writing any code that can fail, including agents and background workers
When calling external APIs or services with potential timeouts or errors
During user input processing and validation paths
When performing file I/O, configuration loading, or parsing
For network requests, database access, and inter-process communication

Best practices

Identify and document failure modes early in design
Validate inputs and fail fast for programming errors
Log errors with context (operation, inputs, correlation IDs) without leaking secrets
Implement retries with backoff for transient failures and circuit breakers for persistent issues
Return clear, actionable messages to callers; escalate unexpected errors

Example use cases

Wrap configuration loading with specific handlers for missing files, parse errors, and permission issues
Add retry logic and exponential backoff when interacting with flaky external APIs
Validate agent task payloads at the queue boundary and reject malformed tasks with reasons
Log and surface errors in automated workflows so operators can triage failures quickly
Use typed error classes to distinguish recoverable vs. fatal failures in worker processes

FAQ

How do I choose between retrying and failing immediately?

Retry transient failures (network blips, timeouts) with limited attempts and backoff; fail fast for deterministic programming errors or invalid inputs.

What information should I include in logs without exposing secrets?

Include operation name, input identifiers (not raw secrets), timestamps, stack traces, and correlation IDs; redact or omit sensitive fields.