home / skills / yonatangross / orchestkit / resilience-patterns

resilience-patterns skill

/plugins/ork/skills/resilience-patterns

This skill helps you implement circuit breakers, bulkheads, and retry strategies to build fault-tolerant distributed systems and resilient LLM integrations.

npx playbooks add skill yonatangross/orchestkit --skill resilience-patterns

Review the files below or copy the command above to add this skill to your agents.

Files (14)
SKILL.md
14.1 KB
---
name: resilience-patterns
description: Production-grade fault tolerance for distributed systems. Use when implementing circuit breakers, retry with exponential backoff, bulkhead isolation patterns, or building resilience into LLM API integrations.
context: fork
agent: backend-system-architect
version: 1.0.0
author: OrchestKit AI Agent Hub
tags: [resilience, circuit-breaker, bulkhead, retry, fault-tolerance]
user-invocable: false
---

# Resilience Patterns Skill

Production-grade resilience patterns for distributed systems and LLM-based workflows. Covers circuit breakers, bulkheads, retry strategies, and LLM-specific resilience techniques.

## Overview

- Building fault-tolerant multi-agent systems
- Implementing LLM API integrations with proper error handling
- Designing distributed workflows that need graceful degradation
- Adding observability to failure scenarios
- Protecting systems from cascade failures

## Core Patterns

### 1. Circuit Breaker Pattern (reference: circuit-breaker.md)

Prevents cascade failures by "tripping" when a service exceeds failure thresholds.

```
+-------------------------------------------------------------------+
|                    Circuit Breaker States                         |
+-------------------------------------------------------------------+
|                                                                   |
|    +----------+     failures >= threshold    +----------+         |
|    |  CLOSED  | ----------------------------> |   OPEN   |        |
|    | (normal) |                              | (reject) |         |
|    +----+-----+                              +----+-----+         |
|         |                                         |               |
|         | success                    timeout      |               |
|         |                            expires      |               |
|         |         +------------+                  |               |
|         |         | HALF_OPEN  |<-----------------+               |
|         +---------+  (probe)   |                                  |
|                   +------------+                                  |
|                                                                   |
|   CLOSED:    Allow requests, count failures                       |
|   OPEN:      Reject immediately, return fallback                  |
|   HALF_OPEN: Allow probe request to test recovery                 |
|                                                                   |
+-------------------------------------------------------------------+
```

**Key Configuration:**
- `failure_threshold`: Failures before opening (default: 5)
- `recovery_timeout`: Seconds before attempting recovery (default: 30)
- `half_open_requests`: Probes to allow in half-open (default: 1)

### 2. Bulkhead Pattern (reference: bulkhead-pattern.md)

Isolates failures by partitioning resources into independent pools.

```
+-------------------------------------------------------------------+
|                      Bulkhead Isolation                           |
+-------------------------------------------------------------------+
|                                                                   |
|   +------------------+  +------------------+                      |
|   | TIER 1: Critical |  | TIER 2: Standard |                      |
|   |  (5 workers)     |  |  (3 workers)     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  |#| |#| | |     |  |  |#| | | | |     |                      |
|   |  +-+ +-+ +-+     |  |  +-+ +-+ +-+     |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  | | | |         |  |  Queue: 2        |                      |
|   |  +-+ +-+         |  |                  |                      |
|   |  Queue: 0        |  +------------------+                      |
|   +------------------+                                            |
|                                                                   |
|   +------------------+                                            |
|   | TIER 3: Optional |   # = Active request                       |
|   |  (2 workers)     |     = Available slot                       |
|   |  +-+ +-+         |                                            |
|   |  |#| |#| FULL!   |   Tier 1: synthesis, quality_gate          |
|   |  +-+ +-+         |   Tier 2: analysis agents                  |
|   |  Queue: 5        |   Tier 3: enrichment, optional features    |
|   +------------------+                                            |
|                                                                   |
+-------------------------------------------------------------------+
```

**Tier Configuration (OrchestKit):**
| Tier | Workers | Queue | Timeout | Use Case |
|------|---------|-------|---------|----------|
| 1 (Critical) | 5 | 10 | 300s | Synthesis, quality gate |
| 2 (Standard) | 3 | 5 | 120s | Content analysis agents |
| 3 (Optional) | 2 | 3 | 60s | Enrichment, caching |

### 3. Retry Strategies (reference: retry-strategies.md)

Intelligent retry logic with exponential backoff and jitter.

```
+-------------------------------------------------------------------+
|                   Exponential Backoff + Jitter                    |
+-------------------------------------------------------------------+
|                                                                   |
|   Attempt 1:  --> X (fail)                                        |
|               wait: 1s +/- 0.5s                                   |
|                                                                   |
|   Attempt 2:  --> X (fail)                                        |
|               wait: 2s +/- 1s                                     |
|                                                                   |
|   Attempt 3:  --> X (fail)                                        |
|               wait: 4s +/- 2s                                     |
|                                                                   |
|   Attempt 4:  --> OK (success)                                    |
|                                                                   |
|   Formula: delay = min(base * 2^attempt, max_delay) * jitter      |
|   Jitter:  random(0.5, 1.5) to prevent thundering herd            |
|                                                                   |
+-------------------------------------------------------------------+
```

**Error Classification for Retries:**
```python
RETRYABLE_ERRORS = {
    # HTTP/Network
    408, 429, 500, 502, 503, 504,  # HTTP status codes
    ConnectionError, TimeoutError,  # Network errors

    # LLM-specific
    "rate_limit_exceeded",
    "model_overloaded",
    "context_length_exceeded",  # Retry with truncation
}

NON_RETRYABLE_ERRORS = {
    400, 401, 403, 404,  # Client errors
    "invalid_api_key",
    "content_policy_violation",
    "invalid_request_error",
}
```

### 4. LLM-Specific Resilience (reference: llm-resilience.md)

Patterns specific to LLM API integrations.

```
+-------------------------------------------------------------------+
|                    LLM Fallback Chain                             |
+-------------------------------------------------------------------+
|                                                                   |
|   Request --> [Primary Model] --success--> Response               |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Fallback Model] --success--> Response              |
|                     |                                             |
|                   fail                                            |
|                     v                                             |
|               [Cached Response] --hit--> Response                 |
|                     |                                             |
|                   miss                                            |
|                     v                                             |
|               [Default Response] --> Graceful Degradation         |
|                                                                   |
|   Example Chain:                                                  |
|   1. claude-sonnet-4-20250514 (primary)                           |
|   2. gpt-4o-mini (fallback)                                       |
|   3. Semantic cache lookup                                        |
|   4. "Analysis unavailable" + partial results                     |
|                                                                   |
+-------------------------------------------------------------------+
```

**Token Budget Management:**
```
+-------------------------------------------------------------------+
|                     Token Budget Guard                            |
+-------------------------------------------------------------------+
|                                                                   |
|   Input: 8,000 tokens                                             |
|   +---------------------------------------------+                 |
|   |#################################            |                 |
|   +---------------------------------------------+                 |
|                                          ^                        |
|                                          |                        |
|                                    Context Limit (16K)            |
|                                                                   |
|   Strategy when approaching limit:                                |
|   1. Summarize earlier context (compress 4:1)                     |
|   2. Drop low-priority content (optional fields)                  |
|   3. Split into multiple requests                                 |
|   4. Fail fast with "content too large" error                     |
|                                                                   |
+-------------------------------------------------------------------+
```

## Quick Reference

| Pattern | When to Use | Key Benefit |
|---------|-------------|-------------|
| Circuit Breaker | External service calls | Prevent cascade failures |
| Bulkhead | Multi-tenant/multi-agent | Isolate failures |
| Retry + Backoff | Transient failures | Automatic recovery |
| Fallback Chain | Critical operations | Graceful degradation |
| Token Budget | LLM calls | Cost control, prevent failures |

## OrchestKit Integration Points

1. **Workflow Agents**: Each agent wrapped with circuit breaker + bulkhead tier
2. **LLM Calls**: All model invocations use fallback chain + retry logic
3. **External APIs**: Circuit breaker on YouTube, arXiv, GitHub APIs
4. **Database Ops**: Bulkhead isolation for read vs write operations

## Files in This Skill

### References (Conceptual Guides)
- `references/circuit-breaker.md` - Deep dive on circuit breaker pattern
- `references/bulkhead-pattern.md` - Bulkhead isolation strategies
- `references/retry-strategies.md` - Retry algorithms and error classification
- `references/llm-resilience.md` - LLM-specific patterns
- `references/error-classification.md` - How to categorize errors

### Templates (Code Patterns)
- `scripts/circuit-breaker.py` - Ready-to-use circuit breaker class
- `scripts/bulkhead.py` - Semaphore-based bulkhead implementation
- `scripts/retry-handler.py` - Configurable retry decorator
- `scripts/llm-fallback-chain.py` - Multi-model fallback pattern
- `scripts/token-budget.py` - Token budget guard implementation

### Examples
- `examples/orchestkit-workflow-resilience.md` - Full OrchestKit integration example

### Checklists
- `checklists/pre-deployment-resilience.md` - Production readiness checklist
- `checklists/circuit-breaker-setup.md` - Circuit breaker configuration guide

## 2026 Best Practices

1. **Adaptive Thresholds**: Use sliding windows, not fixed counters
2. **Observability First**: Every circuit trip = alert + metric + trace
3. **Graceful Degradation**: Always have a fallback, even if partial
4. **Health Endpoints**: Separate health check from circuit state
5. **Chaos Testing**: Regularly test failure scenarios in staging

---

## Related Skills

- `observability-monitoring` - Metrics and alerting for circuit breaker state changes
- `caching-strategies` - Cache as fallback layer in degradation scenarios
- `error-handling-rfc9457` - Structured error responses for resilience failures
- `background-jobs` - Async processing with retry and failure handling

## Key Decisions

| Decision | Choice | Rationale |
|----------|--------|-----------|
| Circuit breaker recovery | Half-open probe | Gradual recovery, prevents immediate re-failure |
| Retry algorithm | Exponential backoff + jitter | Prevents thundering herd, respects rate limits |
| Bulkhead isolation | Semaphore-based tiers | Simple, efficient, prioritizes critical operations |
| LLM fallback | Model chain with cache | Graceful degradation, cost optimization, availability |

---

## Capability Details

### circuit-breaker
**Keywords:** circuit breaker, failure threshold, cascade failure, trip, half-open
**Solves:**
- Prevent cascade failures when external services fail
- Automatically recover when services come back online
- Fail fast instead of waiting for timeouts

### bulkhead
**Keywords:** bulkhead, isolation, semaphore, thread pool, resource pool, tier
**Solves:**
- Isolate failures to prevent entire system crashes
- Prioritize critical operations over optional ones
- Limit concurrent requests to protect resources

### retry-strategies
**Keywords:** retry, backoff, exponential, jitter, thundering herd
**Solves:**
- Handle transient failures automatically
- Avoid overwhelming recovering services
- Classify errors as retryable vs non-retryable

### llm-resilience
**Keywords:** LLM, fallback, model, token budget, rate limit, context length
**Solves:**
- Handle LLM API rate limits gracefully
- Fall back to alternative models when primary fails
- Manage token budgets to prevent context overflow

### error-classification
**Keywords:** error, retryable, transient, permanent, classification
**Solves:**
- Determine which errors should be retried
- Categorize errors by severity and recoverability
- Map HTTP status codes to resilience actions

Overview

This skill provides production-grade resilience patterns for distributed systems and LLM integrations. It bundles circuit breakers, bulkhead isolation, retry with exponential backoff and jitter, and LLM-specific fallback and token budget strategies. The content includes templates, examples, and checklists to harden multi-agent workflows and API calls.

How this skill works

The skill inspects call sites and workflow boundaries to apply patterns: it wraps external calls with circuit breaker logic, partitions work with bulkhead semaphores, and applies configurable retry handlers with exponential backoff and jitter. For LLMs it implements a fallback chain, semantic cache lookups, and token budget guards to prevent context overflow and to provide graceful degradation.

When to use it

  • Protect external API calls that can cause cascade failures (e.g., GitHub, YouTube, arXiv).
  • Isolate multi-tenant or multi-agent workloads to prevent noisy neighbors from impacting critical paths.
  • Implement robust LLM integrations that need rate-limit handling, fallback models, and token management.
  • Add automated retry and backoff for transient network errors while avoiding thundering-herd effects.
  • Establish graceful degradation and observability for critical production workflows.

Best practices

  • Use adaptive thresholds (sliding windows) rather than fixed counters for circuit decisions.
  • Instrument every circuit state change with metrics, logs, and traces to drive alerts and postmortems.
  • Always include a fallback chain for critical operations: alternative model, cache hit, then default response.
  • Classify errors explicitly into retryable and non-retryable sets before applying retries.
  • Run chaos testing in staging to validate bulkhead and circuit behavior under realistic failure modes.

Example use cases

  • Wrap a third-party metadata API with a circuit breaker and retry handler to fail fast and recover gracefully.
  • Partition LLM request processing into critical synthesis workers and optional enrichment workers using bulkheads.
  • Implement a model fallback chain: primary model → cheaper fallback model → semantic cache → safe default.
  • Apply token budget guard to long conversation contexts: summarize, drop low-priority fields, or split requests.
  • Add observability hooks so every circuit trip triggers alerts and links to traces for rapid diagnosis.

FAQ

How do I choose failure thresholds for a circuit breaker?

Start with conservative defaults (e.g., 5 failures, 30s recovery) and use traffic patterns and SLOs to tune thresholds; prefer sliding windows to capture recent behavior.

When should I retry vs. fall back to another model?

Retry transient errors (timeouts, 429, 5xx). If retries fail or the model is overloaded, invoke the fallback model or cached response to preserve user experience.