home / skills / harborgrid-justin / lexiflow-premium / error-recovery-and-resilience

error-recovery-and-resilience skill

/frontend/.github-skills/error-recovery-and-resilience

This skill helps you build resilient UI with layered error boundaries, retries, and fallback orchestration to improve user experience.

npx playbooks add skill harborgrid-justin/lexiflow-premium --skill error-recovery-and-resilience

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

903 B

---
name: error-recovery-and-resilience
description: Engineer resilient UI systems with layered error boundaries, retries, and fallback orchestration.
---

# Error Recovery and Resilience (React 18)

## Summary

Engineer resilient UI systems with layered error boundaries, retries, and fallback orchestration.

## Key Capabilities

- Define error containment strategies for nested UI regions.
- Implement retry policies with exponential backoff and jitter.
- Integrate error telemetry and automated recovery flows.

## PhD-Level Challenges

- Prove containment of error cascades across boundaries.
- Model user impact of fallback UI pathways.
- Evaluate resilience improvements with chaos testing.

## Acceptance Criteria

- Demonstrate isolated error recovery without full app reload.
- Provide telemetry of error boundaries and recovery paths.
- Include chaos-test results and mitigation strategies.

Overview

This skill teaches how to engineer resilient UI systems using layered error boundaries, retry policies, and fallback orchestration. It focuses on isolating failures so user-facing regions can recover independently without a full app reload. The guidance covers telemetry integration and chaos-testing strategies to validate improvements.

How this skill works

The skill inspects UI component hierarchies and recommends placement of nested error boundaries to contain failures. It defines retry policies (exponential backoff with jitter) and coordinates fallback UIs and automated recovery flows. Telemetry hooks record boundary activations, retries, and final outcomes to drive further tuning and alerting.

When to use it

When single-component failures must not crash the entire application
When network-dependent UI regions need automated retry and graceful degradation
When you need observable recovery paths for incident analysis
When introducing features that should tolerate partial outages
When validating resilience improvements with chaos-testing

Best practices

Place fine-grained error boundaries around independent UI regions, and broader boundaries at layout-level
Combine exponential backoff with jitter to avoid synchronized retry storms
Expose clear, minimal fallback UIs that offer retry or safe alternatives
Emit structured telemetry for boundary triggers, retry attempts, and user actions
Run targeted chaos tests and use results to refine containment and timeout thresholds

Example use cases

Recovering a comments widget while keeping the rest of the page interactive
Retrying a payment method fetch with exponential backoff and showing a cached fallback
Orchestrating a staged fallback: lightweight view, then cached data, then retry option
Measuring user impact by correlating boundary activations with session metrics
Validating error containment via simulated downstream service failures

FAQ

How granular should error boundaries be?

Start with boundaries around independently meaningful UI regions (widgets, panels) and adjust granularity based on failure patterns and telemetry.

When should I prefer fallback UI over automatic retry?

Use retries for transient failures with short-lived recoverability; show a fallback UI when retries exceed limits or when user action is required.