home / skills / plurigrid / asi / jepsen-testing

jepsen-testing skill

safe

This skill helps you architect, run, and interpret Jepsen-style correctness tests for distributed systems under faults, with formal checkers and actionable

npx playbooks add skill plurigrid/asi --skill jepsen-testing

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.5 KB

---
name: jepsen-testing
description: Jepsen-style correctness testing for distributed systems under faults (partitions, crashes, clock skew) using concurrent operation histories and formal checkers (linearizability/serializability and Elle-style anomalies). Use when designing, implementing, or running Jepsen tests, or interpreting histories/violations.
---

# Jepsen Testing

## Intake

- Identify the system under test and the exact client surface (Redis, S3, Kafka, HTTP, gRPC).
- Define what "acknowledged" means for each operation (what does the client treat as committed).
- Write the claimed consistency guarantees as a checkable property (linearizable register, RYW, monotonic reads, serializable txns).
- Specify the failure model to test (crash-stop, partitions, clock skew, disk stalls, restarts).
- Decide whether the test must be multi-surface (write via Redis, read via S3) to validate cross-frontend coherence.

## Workload Design

- Prefer the smallest workload that can falsify the claim.
- Use a mix of reads and writes that creates ambiguous interleavings.
- Add a "witness" invariant that is easy to explain:
- Lost acknowledged write.
- Read sees a value that cannot be explained by any sequential execution respecting real-time order.
- List-append: element lost/duplicated or observed order implies a cycle.

## Checker Selection

- Register or map semantics: use a linearizability checker.
- Transactional / multi-object semantics: use Elle-style anomaly detection (write cycles, dirty reads, lost updates).
- If linearizability is too strong for the product, explicitly select a weaker model and encode it (do not silently downgrade).

## Fault (Nemesis) Selection

- Partitions: majority/minority splits, bridge partitions, flapping partitions.
- Process faults: kill and restart, node reboot, rolling restarts.
- Time faults: clock offsets and jumps if the system relies on time.
- Storage faults: fsync latency, I/O stalls, disk-full behavior (only if safe and reversible).

## Run and Minimize

- Start with a short, low-concurrency run until the harness is stable.
- When a failure appears, minimize by reducing:
- Keys, operation count, and concurrency.
- Fault intensity and schedule complexity.
- Preserve determinism (fixed seeds, fixed partition schedule) so a failing history can be reproduced.

## Reporting

- State the exact claim under test and the precise pass/fail property.
- Include the workload, nemesis schedule, and a minimal failing history excerpt.
- Distinguish availability failures (timeouts) from safety failures (incorrect ok results).

Overview

This skill provides Jepsen-style correctness testing for distributed systems, focusing on faults like partitions, crashes, and clock skew, and on interpreting concurrent histories with formal checkers. It helps design workloads, choose checkers (linearizability, serializability, Elle-style anomaly detection), run fault injectors, and minimize failing histories for reproducible debugging. Use it when designing, implementing, or evaluating distributed consistency claims.

How this skill works

The skill inspects the system under test by exercising its client surface with controlled, concurrent workloads while injecting nemeses (partitions, process kills, clock faults, storage stalls). It records a fine-grained operation history (timestamps, observed results, acknowledgements) and runs formal checkers to detect violations like non-linearizable operations, write-loss, or Elle-style cycles. When failures occur it provides guidance to minimize the failing history and to produce a compact, reproducible report that separates availability from safety failures.

When to use it

Validating a written consistency claim (e.g., linearizability, RYW, serializability) before release
Investigating surprising client-observed anomalies in production or tests
Comparing alternative failure models and their impact on correctness
Designing minimal workloads that expose concurrency bugs
Reducing noisy Jepsen runs to a small, reproducible failing scenario

Best practices

Define the exact client surface and what an "acknowledged" operation means before testing
Start small: use minimal keys, low concurrency, and short runs to stabilize the harness
Choose the strongest checker that matches your claimed guarantees; encode weaker models explicitly if needed
Inject faults conservatively and preserve determinism with fixed seeds and schedules for reproducibility
When a violation appears, systematically minimize keys, concurrency, and fault complexity to isolate the root cause

Example use cases

Prove or disprove a claimed linearizable API for a distributed key-value store under network partitions
Detect lost acknowledged writes and provide a minimized history demonstrating the loss
Validate serializability of a multi-shard transaction protocol using Elle-style anomaly detection
Test system behavior under clock skew to reveal time-dependent correctness violations
Compare safety and availability trade-offs across different restart and partition schedules

FAQ

What should I record during a run to make checking possible?

Record operation type, key(s), arguments, client-observed result, start and end timestamps, and the exact nemesis schedule so the history is replayable and checkable.

If linearizability is too strong, what do I do?

Explicitly define a weaker model you intend to satisfy, encode it as a checker, and document that choice rather than silently downgrading the test.