home / skills / secondsky / claude-skills / database-sharding

This skill guides scalable PostgreSQL/MySQL sharding with hash, range, and directory strategies to improve throughput and multi-tenant isolation.

npx playbooks add skill secondsky/claude-skills --skill database-sharding

Review the files below or copy the command above to add this skill to your agents.

Files (11)
SKILL.md
12.9 KB
---
name: database-sharding
description: Database sharding for PostgreSQL/MySQL with hash/range/directory strategies. Use for horizontal scaling, multi-tenant isolation, billions of records, or encountering wrong shard keys, hotspots, cross-shard transactions, rebalancing issues.
keywords: database sharding, horizontal partitioning, shard key, consistent hashing,
  hash sharding, range sharding, directory sharding, cross-shard queries, shard rebalancing,
  scatter-gather pattern, multi-tenant sharding, shard router, distributed database,
  database scalability, shard migration, hotspot shards, shard key selection,
  cross-shard aggregation, two-phase commit, saga pattern, virtual shards
license: MIT
---

# database-sharding

Comprehensive database sharding patterns for horizontal scaling with hash, range, and directory-based strategies.

---

## Quick Start (10 Minutes)

**Step 1**: Choose sharding strategy from templates:
```bash
# Hash-based (even distribution)
cat templates/hash-router.ts

# Range-based (time-series data)
cat templates/range-router.ts

# Directory-based (multi-tenancy)
cat templates/directory-router.ts
```

**Step 2**: Select shard key criteria:
- ✅ **High cardinality** (millions of unique values)
- ✅ **Even distribution** (no single value > 5%)
- ✅ **Immutable** (never changes)
- ✅ **Query alignment** (in 80%+ of WHERE clauses)

**Step 3**: Implement router:
```typescript
import { HashRouter } from './hash-router';

const router = new HashRouter([
  { id: 'shard_0', connection: { host: 'db0.example.com' } },
  { id: 'shard_1', connection: { host: 'db1.example.com' } },
  { id: 'shard_2', connection: { host: 'db2.example.com' } },
  { id: 'shard_3', connection: { host: 'db3.example.com' } },
]);

// Query single shard
const user = await router.query('user_123', 'SELECT * FROM users WHERE id = $1', ['user_123']);
```

---

## Critical Rules

### ✓ Always Do

| Rule | Reason |
|------|--------|
| **Include shard key in queries** | Avoid scanning all shards (100x slower) |
| **Monitor shard distribution** | Detect hotspots before they cause outages |
| **Plan for rebalancing upfront** | Cannot easily add shards later |
| **Choose immutable shard key** | Changing key = data migration nightmare |
| **Test distribution with production data** | Synthetic data hides real hotspots |
| **Denormalize for data locality** | Keep related data on same shard |

### ✗ Never Do

| Anti-Pattern | Why It's Bad |
|--------------|--------------|
| **Sequential ID with range sharding** | Latest shard gets all writes (hotspot) |
| **Timestamp as shard key** | Recent shard overwhelmed |
| **Cross-shard transactions without 2PC** | Data corruption, inconsistency |
| **Simple modulo without consistent hashing** | Cannot add shards without full re-shard |
| **Nullable shard key** | Special NULL handling creates hotspots |
| **No shard routing layer** | Hardcoded shards = cannot rebalance |

---

## Top 7 Critical Errors

### Error 1: Wrong Shard Key Choice (Hotspots)
**Symptom**: One shard receives 80%+ of traffic
**Fix**:
```typescript
// ❌ Bad: Low cardinality (status field)
shard_key = order.status; // 90% are 'pending' → shard_0 overloaded

// ✅ Good: High cardinality (user_id)
shard_key = order.user_id; // Millions of users, even distribution
```

### Error 2: Missing Shard Key in Queries
**Symptom**: Queries scan ALL shards (extremely slow)
**Fix**:
```typescript
// ❌ Bad: No shard key
SELECT * FROM orders WHERE status = 'shipped'; // Scans all 100 shards!

// ✅ Good: Include shard key
SELECT * FROM orders WHERE user_id = ? AND status = 'shipped'; // Targets 1 shard
```

### Error 3: Sequential IDs with Range Sharding
**Symptom**: Latest shard gets all writes
**Fix**:
```typescript
// ❌ Bad: Range sharding with auto-increment
// Shard 0: 1-1M, Shard 1: 1M-2M, Shard 2: 2M+ → All new writes to Shard 2!

// ✅ Good: Hash-based sharding
const shardId = hash(id) % shardCount; // Even distribution
```

### Error 4: No Rebalancing Strategy
**Symptom**: Stuck with initial shard count, cannot scale
**Fix**:
```typescript
// ❌ Bad: Simple modulo
const shardId = hash(key) % shardCount; // Adding 5th shard breaks ALL keys

// ✅ Good: Consistent hashing
const ring = new ConsistentHashRing(shards);
const shardId = ring.getNode(key); // Only ~25% of keys move when adding shard
```

### Error 5: Cross-Shard Transactions
**Symptom**: Data inconsistency, partial writes
**Fix**:
```typescript
// ❌ Bad: Cross-shard transaction (will corrupt)
BEGIN;
UPDATE shard_1.accounts SET balance = balance - 100 WHERE id = 'A';
UPDATE shard_2.accounts SET balance = balance + 100 WHERE id = 'B';
COMMIT; // If shard_2 fails, shard_1 already committed!

// ✅ Good: Two-Phase Commit or Saga pattern
const txn = new TwoPhaseCommitTransaction();
txn.addOperation(shard_1, 'UPDATE accounts SET balance = balance - 100 WHERE id = ?', ['A']);
txn.addOperation(shard_2, 'UPDATE accounts SET balance = balance + 100 WHERE id = ?', ['B']);
await txn.execute(); // Atomic across shards
```

### Error 6: Mutable Shard Key
**Symptom**: Records move shards, causing duplicates
**Fix**:
```typescript
// ❌ Bad: Shard by country (user relocates)
shard_key = user.country; // User moves US → CA, now in different shard!

// ✅ Good: Shard by immutable user_id
shard_key = user.id; // Never changes
```

### Error 7: No Monitoring
**Symptom**: Silent hotspots, sudden performance degradation
**Fix**:
```typescript
// ✅ Required metrics
- Per-shard record counts (should be within 20%)
- Query distribution (no shard > 40% of queries)
- Storage per shard (alert at 80%)
- Latency p99 per shard
```

**Load** `references/error-catalog.md` for all 10 errors with detailed fixes.

---

## Sharding Strategies

| Strategy | Best For | Pros | Cons |
|----------|----------|------|------|
| **Hash** | User data, even load critical | No hotspots, predictable | Range queries scatter |
| **Range** | Time-series, logs, append-only | Range queries efficient, archival | Recent shard hotspot |
| **Directory** | Multi-tenancy, complex routing | Flexible, easy rebalancing | Lookup overhead, SPOF |

**Load** `references/sharding-strategies.md` for detailed comparisons with production examples (Instagram, Discord, Salesforce).

---

## Shard Key Selection Criteria

| Criterion | Importance | Check Method |
|-----------|------------|--------------|
| **High cardinality** | Critical | `COUNT(DISTINCT shard_key)` > shard_count × 100 |
| **Even distribution** | Critical | No value > 5% of total |
| **Immutable** | Critical | Value never changes |
| **Query alignment** | High | 80%+ queries include it |
| **Data locality** | Medium | Related records together |

**Decision Tree**:
- User-focused app → `user_id`
- Multi-tenant SaaS → `tenant_id`
- Time-series/logs → `timestamp` (range sharding)
- Product catalog → `product_id`

**Load** `references/shard-key-selection.md` for comprehensive decision trees and testing strategies.

---

## Configuration Summary

### Hash-Based Router

```typescript
import { HashRouter } from './templates/hash-router';

const router = new HashRouter([
  { id: 'shard_0', connection: { /* PostgreSQL config */ } },
  { id: 'shard_1', connection: { /* PostgreSQL config */ } },
]);

// Automatically routes to correct shard
const user = await router.query('user_123', 'SELECT * FROM users WHERE id = $1', ['user_123']);
```

### Range-Based Router

```typescript
import { RangeRouter } from './templates/range-router';

const router = new RangeRouter(shardConfigs, [
  { start: Date.parse('2024-01-01'), end: Date.parse('2024-04-01'), shardId: 'shard_q1' },
  { start: Date.parse('2024-04-01'), end: Date.parse('2024-07-01'), shardId: 'shard_q2' },
  { start: Date.parse('2024-07-01'), end: Infinity, shardId: 'shard_q3' },
]);

// Range queries target specific shards
const janEvents = await router.queryRange(
  Date.parse('2024-01-01'),
  Date.parse('2024-02-01'),
  'SELECT * FROM events WHERE created_at BETWEEN $1 AND $2'
);
```

### Directory-Based Router

```typescript
import { DirectoryRouter } from './templates/directory-router';

const router = new DirectoryRouter(directoryDBConfig, shardConfigs);

// Assign tenant to specific shard
await router.assignShard('tenant_acme', 'shard_enterprise');

// Route automatically
const users = await router.query('tenant_acme', 'SELECT * FROM users');
```

---

## When to Load References

### Choosing Strategy
**Load** `references/sharding-strategies.md` when:
- Deciding between hash, range, directory
- Need production examples (Instagram, Discord)
- Planning hybrid approaches

### Selecting Shard Key
**Load** `references/shard-key-selection.md` when:
- Choosing shard key for new project
- Evaluating existing shard key
- Testing distribution with production data

### Implementation
**Load** `references/implementation-patterns.md` when:
- Building shard router from scratch
- Implementing consistent hashing
- Need transaction handling (2PC, Saga)
- Setting up monitoring/metrics

### Cross-Shard Operations
**Load** `references/cross-shard-queries.md` when:
- Need to aggregate across shards (COUNT, SUM, AVG)
- Implementing cross-shard joins
- Building pagination across shards
- Optimizing scatter-gather patterns

### Rebalancing
**Load** `references/rebalancing-guide.md` when:
- Adding new shards
- Migrating data between shards
- Planning zero-downtime migrations
- Balancing uneven load

### Error Prevention
**Load** `references/error-catalog.md` when:
- Troubleshooting performance issues
- Reviewing shard architecture
- All 10 documented errors with fixes

---

## Complete Setup Checklist

**Before Sharding**:
- [ ] Tested shard key distribution with production data
- [ ] Shard key in 80%+ of queries
- [ ] Monitoring infrastructure ready
- [ ] Rebalancing strategy planned

**Router Implementation**:
- [ ] Shard routing layer (not hardcoded shards)
- [ ] Connection pooling per shard
- [ ] Error handling and retries
- [ ] Metrics collection (queries/shard, latency)

**Shard Configuration**:
- [ ] 4-8 shards initially (room to grow)
- [ ] Consistent hashing or virtual shards
- [ ] Replicas per shard (HA)
- [ ] Backup strategy per shard

**Application Changes**:
- [ ] All queries include shard key
- [ ] Cross-shard joins eliminated (denormalized)
- [ ] Transaction boundaries respected
- [ ] Connection pooling configured

---

## Production Example

**Before** (Single database overwhelmed):
```typescript
// Single PostgreSQL instance
const db = new Pool({ host: 'db.example.com' });

// All 10M users on one server
const users = await db.query('SELECT * FROM users WHERE status = $1', ['active']);
// Query time: 5000ms (slow!)
// DB CPU: 95%
// Disk: 500GB, growing
```

**After** (Sharded across 8 servers):
```typescript
// Hash-based sharding with 8 shards
const router = new HashRouter([
  { id: 'shard_0', connection: { host: 'db0.example.com' } },
  { id: 'shard_1', connection: { host: 'db1.example.com' } },
  // ... 6 more shards
]);

// Query single user (targets 1 shard)
const user = await router.query('user_123', 'SELECT * FROM users WHERE id = $1', ['user_123']);
// Query time: 10ms (500x faster!)

// Query all shards (scatter-gather)
const allActive = await router.queryAll('SELECT * FROM users WHERE status = $1', ['active']);
// Query time: 800ms (parallelized across 8 shards, 6x faster than single)

// Result: Each shard handles ~1.25M users
// DB CPU per shard: 20%
// Disk per shard: 65GB
// Can scale to 16 shards easily (consistent hashing)
```

---

## Known Issues Prevention

All 10 documented errors prevented:
1. ✅ Wrong shard key (hotspots) → Test distribution first
2. ✅ Missing shard key in queries → Code review, linting
3. ✅ Cross-shard transactions → Use 2PC or Saga pattern
4. ✅ Sequential ID hotspots → Use hash-based sharding
5. ✅ No rebalancing strategy → Consistent hashing from day 1
6. ✅ Timestamp sharding hotspots → Hybrid hash+range approach
7. ✅ Mutable shard key → Choose immutable keys (user_id)
8. ✅ No routing layer → Abstract with router from start
9. ✅ No monitoring → Track per-shard metrics
10. ✅ Weak hash function → Use MD5, MurmurHash3, xxHash

**See**: `references/error-catalog.md` for detailed fixes

---

## Resources

**Templates**:
- `templates/hash-router.ts` - Hash-based sharding
- `templates/range-router.ts` - Range-based sharding
- `templates/directory-router.ts` - Directory-based sharding
- `templates/cross-shard-aggregation.ts` - Aggregation patterns

**References**:
- `references/sharding-strategies.md` - Strategy comparison
- `references/shard-key-selection.md` - Key selection guide
- `references/implementation-patterns.md` - Router implementations
- `references/cross-shard-queries.md` - Query patterns
- `references/rebalancing-guide.md` - Migration strategies
- `references/error-catalog.md` - All 10 errors documented

**Production Examples**:
- Instagram: Range sharding for media
- Discord: Hash sharding for messages
- Salesforce: Directory sharding for orgs

---

**Production-tested** | **10 errors prevented** | **MIT License**

Overview

This skill provides production-ready database sharding patterns for PostgreSQL and MySQL using hash, range, and directory strategies. It bundles router templates, configuration guidance, and a checklist to scale horizontally for billions of records and multi-tenant isolation. The content focuses on practical rules, common error fixes, and rebalancing strategies for reliable production use.

How this skill works

The skill supplies router implementations (HashRouter, RangeRouter, DirectoryRouter) and templates to route queries to the correct shard based on shard key, time ranges, or a directory mapping. It inspects shard key characteristics, distribution metrics, and query alignment, and recommends consistent hashing, two-phase commit or Saga patterns for cross-shard operations. It also includes monitoring and rebalancing guidance to prevent hotspots and enable safe scaling.

When to use it

  • You need horizontal scaling for billions of rows
  • You run a multi-tenant SaaS and need tenant isolation
  • You observe a single server or shard hitting CPU/disk limits
  • You face hotspots, wrong shard keys, or uneven distribution
  • You need a rebalancing plan before adding shards
  • You must avoid cross-shard transaction failures

Best practices

  • Pick a high-cardinality, immutable shard key used in 80%+ of queries
  • Always include the shard key in WHERE clauses to target one shard
  • Plan and test rebalancing using consistent hashing or virtual shards
  • Monitor per-shard metrics: record counts, query distribution, p99 latency, and storage
  • Denormalize for data locality and avoid cross-shard joins when possible
  • Use 2PC or Saga for safe cross-shard updates and retries

Example use cases

  • Hash sharding user data to evenly distribute load across 8+ servers
  • Range sharding time-series logs for efficient range queries and archival
  • Directory sharding to assign tenants to specific shards and support per-tenant scaling
  • Detect and fix hotspots by swapping shard key from timestamp to user_id or hash(id)
  • Perform scatter-gather aggregation across shards for global reports with parallel queries

FAQ

How do I pick the right shard key?

Choose a high-cardinality, immutable field that appears in most WHERE clauses; test distribution with production data and ensure no value exceeds ~5% of traffic.

Can I add shards later without downtime?

Use consistent hashing or virtual shards and a planned rebalancing process; simple modulo hashing will force massive reshuffles and likely downtime.