home / skills / copyleftdev / sk1llz / gray
This skill helps design reliable data systems using Jim Gray's ACID principles, WAL, and fault-tolerant recovery to ensure data integrity.
npx playbooks add skill copyleftdev/sk1llz --skill grayReview the files below or copy the command above to add this skill to your agents.
---
name: gray-transaction-systems
description: Design data systems using Jim Gray's principles of transaction processing and fault tolerance. Emphasizes ACID properties, recovery mechanisms, and the science of making systems reliable. Use when building databases, payment systems, or any system where data integrity is paramount.
---
# Jim Gray Style Guide
## Overview
Jim Gray (1944–2007) was the father of transaction processing. He formalized ACID properties, invented key recovery algorithms, pioneered database benchmarking (TPC), and advanced our understanding of fault tolerance. Turing Award winner (1998). His work underpins every reliable database and financial system in existence.
## Core Philosophy
> "A transaction is a transformation of state that has the properties of atomicity, consistency, isolation, and durability."
> "Simplicity does not precede complexity, but follows it."
> "The key to performance is elegance, not battalions of special cases."
## Design Principles
1. **ACID is Non-Negotiable**: For critical data, atomicity, consistency, isolation, and durability are requirements, not optimizations.
2. **Failures are Certain**: Hardware fails, software has bugs, operators make mistakes. Design for recovery, not just operation.
3. **Measure Everything**: You can't improve what you can't measure. Benchmarks reveal truth.
4. **Modularity Enables Reliability**: Separate concerns cleanly. Each component should be independently testable and replaceable.
5. **The Log is Truth**: The write-ahead log is the foundation of durability and recovery.
## The ACID Properties
```text
ATOMICITY: All or nothing. A transaction either completes entirely or has no effect.
CONSISTENCY: Transactions transform the database from one valid state to another.
ISOLATION: Concurrent transactions appear to execute serially.
DURABILITY: Once committed, data survives any subsequent failure.
```
### Isolation Levels (from weakest to strongest)
```text
READ UNCOMMITTED: See uncommitted changes (dirty reads possible)
READ COMMITTED: Only see committed changes (non-repeatable reads possible)
REPEATABLE READ: Same query returns same rows (phantom reads possible)
SERIALIZABLE: Full isolation (no anomalies, but limits concurrency)
```
## When Designing Systems
### Always
- Use write-ahead logging (WAL) for durability
- Design idempotent operations where possible
- Plan for crash recovery from the start
- Test failure scenarios explicitly
- Measure transaction throughput AND latency
- Document consistency guarantees clearly
- Use timeouts on all distributed operations
### Never
- Assume commits are durable without fsync
- Mix transaction boundaries with business logic haphazardly
- Ignore the difference between isolation levels
- Design recovery as an afterthought
- Trust in-memory state without persistence guarantees
- Assume network operations will succeed
### Prefer
- Pessimistic locking over optimistic when conflicts are common
- Shorter transactions over longer ones
- Explicit transaction boundaries over implicit
- Simple recovery mechanisms over clever optimizations
- Proven algorithms over novel approaches for critical paths
## Key Concepts
### Write-Ahead Logging (WAL)
```text
The fundamental rule: WRITE THE LOG BEFORE THE DATA
1. Before modifying data, write the intended change to the log
2. Ensure the log record is durable (fsync)
3. Only then apply the change to the data pages
4. Periodically checkpoint (flush dirty pages, truncate log)
Recovery:
1. Read the log from last checkpoint
2. REDO all committed transactions
3. UNDO all uncommitted transactions
```
### The Five-Minute Rule (1987, updated over time)
```text
Original insight: There's a break-even point for caching
If data is accessed more frequently than once per break-even interval,
keep it in memory. Otherwise, fetch from disk.
The rule: Break-even interval ≈ (Price per MB of disk) / (Price per MB of RAM × disk accesses/sec)
In 1987: ~5 minutes
In 2007: Still ~5 minutes (both got cheaper proportionally)
Today: SSD changes the math, but the principle remains
```
### Transaction States
```text
┌─────────────────────────┐
▼ │
┌──────────┐ ┌──────────┐ │
│ ACTIVE │───▶│ PARTIALLY│────┘
└──────────┘ │ COMMITTED│
│ └──────────┘
│ │
▼ ▼
┌──────────┐ ┌──────────┐
│ FAILED │ │COMMITTED │
└──────────┘ └──────────┘
│
▼
┌──────────┐
│ ABORTED │
└──────────┘
```
### Two-Phase Commit (2PC)
```text
Coordinator Participants
│ │
│──── PREPARE ────────────────▶│
│ │ (write to log, lock resources)
│◀─── VOTE (YES/NO) ──────────│
│ │
│ (if all YES) │
│──── COMMIT ─────────────────▶│
│ │ (commit, release locks)
│◀─── ACK ────────────────────│
│ │
If any participant votes NO, or timeout: ABORT all.
```
## Code Patterns
### Implementing WAL in Principle
```python
from dataclasses import dataclass
from enum import Enum
from typing import Any
import os
class LogRecordType(Enum):
BEGIN = "BEGIN"
UPDATE = "UPDATE"
COMMIT = "COMMIT"
ABORT = "ABORT"
CHECKPOINT = "CHECKPOINT"
@dataclass
class LogRecord:
lsn: int # Log Sequence Number
txn_id: int
record_type: LogRecordType
table: str = ""
key: Any = None
before_value: Any = None # For UNDO
after_value: Any = None # For REDO
class WriteAheadLog:
"""
Jim Gray's WAL protocol: Log before data.
"""
def __init__(self, log_path: str):
self.log_path = log_path
self.lsn = 0
self.log_file = open(log_path, 'a+b')
def append(self, record: LogRecord) -> int:
"""Append record to log and force to disk."""
self.lsn += 1
record.lsn = self.lsn
# Serialize and write
data = self._serialize(record)
self.log_file.write(data)
# CRITICAL: Force to stable storage
self.log_file.flush()
os.fsync(self.log_file.fileno())
return self.lsn
def _serialize(self, record: LogRecord) -> bytes:
# Implementation detail
pass
```
### Transaction Manager Pattern
```python
from contextlib import contextmanager
from threading import Lock
from typing import Generator
class TransactionManager:
"""
Manages transaction lifecycle with ACID guarantees.
"""
def __init__(self, wal: WriteAheadLog, storage: Storage):
self.wal = wal
self.storage = storage
self.active_txns: dict[int, Transaction] = {}
self.lock = Lock()
self.next_txn_id = 0
@contextmanager
def transaction(self) -> Generator[Transaction, None, None]:
"""
Context manager for transaction scope.
Usage:
with tm.transaction() as txn:
txn.update('accounts', 'alice', balance=100)
txn.update('accounts', 'bob', balance=200)
# Auto-commit on success, auto-abort on exception
"""
txn = self._begin()
try:
yield txn
self._commit(txn)
except Exception:
self._abort(txn)
raise
def _begin(self) -> Transaction:
with self.lock:
txn_id = self.next_txn_id
self.next_txn_id += 1
# Log the begin FIRST
self.wal.append(LogRecord(
lsn=0, txn_id=txn_id, record_type=LogRecordType.BEGIN
))
txn = Transaction(txn_id, self.wal, self.storage)
self.active_txns[txn_id] = txn
return txn
def _commit(self, txn: Transaction) -> None:
# Log commit record
self.wal.append(LogRecord(
lsn=0, txn_id=txn.txn_id, record_type=LogRecordType.COMMIT
))
# Release locks, clean up
txn.release_locks()
del self.active_txns[txn.txn_id]
def _abort(self, txn: Transaction) -> None:
# UNDO all changes using log records
txn.rollback()
self.wal.append(LogRecord(
lsn=0, txn_id=txn.txn_id, record_type=LogRecordType.ABORT
))
txn.release_locks()
del self.active_txns[txn.txn_id]
```
### Idempotent Operations
```python
from hashlib import sha256
from datetime import datetime, timedelta
class IdempotentExecutor:
"""
Ensure operations execute exactly once, even with retries.
Gray's insight: Idempotency transforms "at-least-once"
into "exactly-once" semantics.
"""
def __init__(self, storage):
self.storage = storage
self.executed_ops: dict[str, tuple[datetime, Any]] = {}
def execute(
self,
idempotency_key: str,
operation: callable,
ttl: timedelta = timedelta(hours=24)
) -> Any:
"""
Execute operation exactly once for given key.
"""
# Check if already executed
if idempotency_key in self.executed_ops:
timestamp, result = self.executed_ops[idempotency_key]
if datetime.now() - timestamp < ttl:
return result # Return cached result
# Execute and store result
result = operation()
self.executed_ops[idempotency_key] = (datetime.now(), result)
return result
```
## Mental Model
Jim Gray approached systems as a scientist:
1. **Define the invariants**: What must always be true?
2. **Identify failure modes**: What can go wrong?
3. **Design recovery**: How do we get back to a valid state?
4. **Measure and benchmark**: Quantify performance precisely
5. **Simplify**: Remove complexity until it breaks, then add back only what's needed
### The Failure Model
```text
Types of failures (increasingly severe):
1. Transaction failure → Abort and undo
2. System failure → Restart and recover from log
3. Media failure → Restore from backup + log
4. Disaster → Failover to remote site
Design for all of them.
```
## Warning Signs
You're violating Gray's principles if:
- You don't know your system's durability guarantees
- Transactions span user think time (long-held locks)
- Recovery is "we'll figure it out if it happens"
- You can't explain your isolation level choice
- You assume fsync is called when it isn't
- Your benchmarks don't include failure scenarios
## Additional Resources
- For detailed philosophy, see [philosophy.md](philosophy.md)
- For references (papers, books), see [references.md](references.md)
This skill encodes Jim Gray's transaction-processing and fault-tolerance principles into practical design guidance and Python patterns. It focuses on delivering ACID guarantees, robust recovery, and measurable reliability for critical data systems. Use it to design databases, payment processors, or any system where data integrity and predictable failure behavior are mandatory.
The skill inspects system designs and suggests concrete mechanisms: write-ahead logging (WAL), explicit transaction boundaries, idempotency, and recovery procedures like REDO/UNDO from log checkpoints. It provides patterns for a transaction manager, WAL append/flush semantics, two-phase commit coordination, and idempotent executors with TTL-based caching. It emphasizes measurement, failure testing, and clear documentation of consistency guarantees.
Do I always need WAL and fsync for durability?
For critical durability guarantees, yes: write-ahead logging plus an explicit fsync (or equivalent stable storage flush) is required. Decide trade-offs only after documenting the risk.
When should I prefer pessimistic locking over optimistic?
Prefer pessimistic locking when conflicts are common or the cost of retries is high. Use optimistic concurrency for low-conflict workloads to improve throughput.