home / skills / zhanghandong / rust-skills / m13-domain-error

m13-domain-error skill

safe

This skill helps design robust domain error handling by classifying errors, guiding recovery, and applying retry, fallback, and circuit-breaker patterns.

npx playbooks add skill zhanghandong/rust-skills --skill m13-domain-error

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.8 KB

---
name: m13-domain-error
description: "Use when designing domain error handling. Keywords: domain error, error categorization, recovery strategy, retry, fallback, domain error hierarchy, user-facing vs internal errors, error code design, circuit breaker, graceful degradation, resilience, error context, backoff, retry with backoff, error recovery, transient vs permanent error, 领域错误, 错误分类, 恢复策略, 重试, 熔断器, 优雅降级"
user-invocable: false
---

# Domain Error Strategy

> **Layer 2: Design Choices**

## Core Question

**Who needs to handle this error, and how should they recover?**

Before designing error types:
- Is this user-facing or internal?
- Is recovery possible?
- What context is needed for debugging?

---

## Error Categorization

| Error Type | Audience | Recovery | Example |
|------------|----------|----------|---------|
| User-facing | End users | Guide action | `InvalidEmail`, `NotFound` |
| Internal | Developers | Debug info | `DatabaseError`, `ParseError` |
| System | Ops/SRE | Monitor/alert | `ConnectionTimeout`, `RateLimited` |
| Transient | Automation | Retry | `NetworkError`, `ServiceUnavailable` |
| Permanent | Human | Investigate | `ConfigInvalid`, `DataCorrupted` |

---

## Thinking Prompt

Before designing error types:

1. **Who sees this error?**
   - End user → friendly message, actionable
   - Developer → detailed, debuggable
   - Ops → structured, alertable

2. **Can we recover?**
   - Transient → retry with backoff
   - Degradable → fallback value
   - Permanent → fail fast, alert

3. **What context is needed?**
   - Call chain → anyhow::Context
   - Request ID → structured logging
   - Input data → error payload

---

## Trace Up ↑

To domain constraints (Layer 3):

```
"How should I handle payment failures?"
    ↑ Ask: What are the business rules for retries?
    ↑ Check: domain-fintech (transaction requirements)
    ↑ Check: SLA (availability requirements)
```

| Question | Trace To | Ask |
|----------|----------|-----|
| Retry policy | domain-* | What's acceptable latency for retry? |
| User experience | domain-* | What message should users see? |
| Compliance | domain-* | What must be logged for audit? |

---

## Trace Down ↓

To implementation (Layer 1):

```
"Need typed errors"
    ↓ m06-error-handling: thiserror for library
    ↓ m04-zero-cost: Error enum design

"Need error context"
    ↓ m06-error-handling: anyhow::Context
    ↓ Logging: tracing with fields

"Need retry logic"
    ↓ m07-concurrency: async retry patterns
    ↓ Crates: tokio-retry, backoff
```

---

## Quick Reference

| Recovery Pattern | When | Implementation |
|------------------|------|----------------|
| Retry | Transient failures | exponential backoff |
| Fallback | Degraded mode | cached/default value |
| Circuit Breaker | Cascading failures | failsafe-rs |
| Timeout | Slow operations | `tokio::time::timeout` |
| Bulkhead | Isolation | separate thread pools |

## Error Hierarchy

```rust
#[derive(thiserror::Error, Debug)]
pub enum AppError {
    // User-facing
    #[error("Invalid input: {0}")]
    Validation(String),

    // Transient (retryable)
    #[error("Service temporarily unavailable")]
    ServiceUnavailable(#[source] reqwest::Error),

    // Internal (log details, show generic)
    #[error("Internal error")]
    Internal(#[source] anyhow::Error),
}

impl AppError {
    pub fn is_retryable(&self) -> bool {
        matches!(self, Self::ServiceUnavailable(_))
    }
}
```

## Retry Pattern

```rust
use tokio_retry::{Retry, strategy::ExponentialBackoff};

async fn with_retry<F, T, E>(f: F) -> Result<T, E>
where
    F: Fn() -> impl Future<Output = Result<T, E>>,
    E: std::fmt::Debug,
{
    let strategy = ExponentialBackoff::from_millis(100)
        .max_delay(Duration::from_secs(10))
        .take(5);

    Retry::spawn(strategy, || f()).await
}
```

---

## Common Mistakes

| Mistake | Why Wrong | Better |
|---------|-----------|--------|
| Same error for all | No actionability | Categorize by audience |
| Retry everything | Wasted resources | Only transient errors |
| Infinite retry | DoS self | Max attempts + backoff |
| Expose internal errors | Security risk | User-friendly messages |
| No context | Hard to debug | .context() everywhere |

---

## Anti-Patterns

| Anti-Pattern | Why Bad | Better |
|--------------|---------|--------|
| String errors | No structure | thiserror types |
| panic! for recoverable | Bad UX | Result with context |
| Ignore errors | Silent failures | Log or propagate |
| Box<dyn Error> everywhere | Lost type info | thiserror |
| Error in happy path | Performance | Early validation |

---

## Related Skills

| When | See |
|------|-----|
| Error handling basics | m06-error-handling |
| Retry implementation | m07-concurrency |
| Domain modeling | m09-domain |
| User-facing APIs | domain-* |

Overview

This skill guides design of domain error handling so errors are actionable, secure, and recoverable. It focuses on categorizing errors by audience, deciding recovery strategies (retry, fallback, degrade), and providing the right context for debugging and observability. The goal is clear, consistent error types and policies that support resilient systems.

How this skill works

The skill inspects the error's audience (user, developer, ops) and maps it to recovery patterns (retry, fallback, fail-fast). It defines a domain error hierarchy and helper methods (e.g., is_retryable) and prescribes structured context and logging to trace failures. It also links recovery choices to implementation primitives like exponential backoff, circuit breakers, timeouts, and bulkheads.

When to use it

Designing error enums or typed error values for a domain
Choosing retry, fallback or fail-fast policies for operations
Defining user-facing vs internal error messages and logging
Specifying observability needs: IDs, request context, and structured fields
Creating resilience rules (circuit breakers, timeouts, bulkheads)

Best practices

Categorize errors by audience: user-friendly for end users, detailed for developers, structured for Ops
Treat transient vs permanent differently: retry with backoff only for transient errors
Include contextual data (request ID, input snapshot, call chain) for debugging
Avoid exposing internal error details in user messages; log details securely
Set limits: max attempts, max backoff, and clear circuit-breaker thresholds

Example use cases

Payment processing: classify network vs validation failures and apply retry or fail-fast according to business rules
API input validation: return typed user-facing errors with actionable guidance
Backend service calls: treat 5xx as transient with exponential backoff and circuit breaker
Feature degrade: use fallback/cached values when downstream is unavailable
SRE alerting: emit structured system errors for monitoring and escalation

FAQ

How do I decide retry vs fallback?

If the error is transient (network, temporary service) prefer retry with exponential backoff; if recovery by substitution is acceptable, use a fallback to preserve availability.

What context should errors carry?

Include request ID, relevant input data, and call-chain context. Use structured logging and attach context via error libraries to aid post-mortem without leaking sensitive data.