home / skills / amnadtaowsoam / cerebraskills / backfill-and-reconciliation-playbook

backfill-and-reconciliation-playbook skill

safe

/70-data-platform-governance/backfill-and-reconciliation-playbook

This skill helps teams backfill historical data and reconcile inconsistencies across sources to ensure completeness, accuracy, and trust.

npx playbooks add skill amnadtaowsoam/cerebraskills --skill backfill-and-reconciliation-playbook

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.6 KB

---
name: Backfill and Reconciliation Playbook
description: Playbook for backfilling historical data and reconciling inconsistencies between data sources
---

# Backfill and Reconciliation Playbook

## Overview

Playbook for backfilling historical data and reconciling (matching/fixing) inconsistencies between data sources

## Why This Matters

- **Data completeness**: No gaps in data
- **Consistency**: Data matches across all sources
- **Trust**: Confidence that data is accurate
- **Recovery**: Can fix issues when they occur

---

## Backfill Strategy

### 1. Identify Gap
```sql
-- Find missing dates
WITH date_range AS (
  SELECT generate_series(
    '2024-01-01'::date,
    CURRENT_DATE,
    '1 day'::interval
  )::date as date
)
SELECT dr.date
FROM date_range dr
LEFT JOIN analytics.daily_metrics dm ON dr.date = dm.date
WHERE dm.date IS NULL
ORDER BY dr.date;

-- Output: Missing dates that need backfill
```

### 2. Backfill in Batches
```python
def backfill_data(start_date: date, end_date: date, batch_size: int = 7):
    """Backfill data in batches to avoid overwhelming system"""
    
    current = start_date
    while current <= end_date:
        batch_end = min(current + timedelta(days=batch_size), end_date)
        
        print(f"Backfilling {current} to {batch_end}")
        
        # Run backfill for this batch
        run_dbt_model(
            model='daily_metrics',
            vars={'start_date': current, 'end_date': batch_end}
        )
        
        # Verify
        verify_backfill(current, batch_end)
        
        current = batch_end + timedelta(days=1)
        time.sleep(60)  # Rate limit

# Example
backfill_data(date(2024, 1, 1), date(2024, 1, 31))
```

---

## Reconciliation

### Compare Sources
```python
def reconcile_sources(source_a: str, source_b: str, key: str):
    """Compare two data sources and find discrepancies"""
    
    # Get data from both sources
    data_a = db_a.query(f"SELECT * FROM {source_a}")
    data_b = db_b.query(f"SELECT * FROM {source_b}")
    
    # Convert to dataframes
    df_a = pd.DataFrame(data_a)
    df_b = pd.DataFrame(data_b)
    
    # Find differences
    merged = df_a.merge(df_b, on=key, how='outer', indicator=True, suffixes=('_a', '_b'))
    
    # Records only in A
    only_a = merged[merged['_merge'] == 'left_only']
    print(f"Only in {source_a}: {len(only_a)} records")
    
    # Records only in B
    only_b = merged[merged['_merge'] == 'right_only']
    print(f"Only in {source_b}: {len(only_b)} records")
    
    # Records in both but different values
    both = merged[merged['_merge'] == 'both']
    for col in df_a.columns:
        if col == key:
            continue
        col_a = f"{col}_a"
        col_b = f"{col}_b"
        if col_a in both.columns and col_b in both.columns:
            diff = both[both[col_a] != both[col_b]]
            if len(diff) > 0:
                print(f"Column {col}: {len(diff)} mismatches")
    
    return merged

# Example
discrepancies = reconcile_sources('users', 'users_replica', 'user_id')
```

---

## Reconciliation Patterns

### Pattern 1: Count Reconciliation
```sql
-- Compare counts
SELECT
  'source_a' as source,
  COUNT(*) as count
FROM source_a.users
WHERE created_at::date = '2024-01-16'

UNION ALL

SELECT
  'source_b',
  COUNT(*)
FROM source_b.users
WHERE created_at::date = '2024-01-16';

-- Expected: Counts should match
```

### Pattern 2: Sum Reconciliation
```sql
-- Compare aggregates
SELECT
  'orders' as source,
  SUM(amount) as total_amount,
  COUNT(*) as order_count
FROM orders
WHERE order_date = '2024-01-16'

UNION ALL

SELECT
  'billing',
  SUM(charge_amount),
  COUNT(*)
FROM billing_records
WHERE charge_date = '2024-01-16';

-- Expected: Totals should match
```

### Pattern 3: Key Reconciliation
```sql
-- Find missing keys
SELECT user_id
FROM source_a.users
WHERE created_at::date = '2024-01-16'

EXCEPT

SELECT user_id
FROM source_b.users
WHERE created_at::date = '2024-01-16';

-- Output: user_ids in A but not in B
```

---

## Automated Reconciliation

```python
# Daily reconciliation job
def daily_reconciliation():
    """Run daily reconciliation checks"""
    
    checks = [
        {
            'name': 'user_count',
            'query_a': 'SELECT COUNT(*) FROM users',
            'query_b': 'SELECT COUNT(*) FROM users_replica',
            'tolerance': 0  # Must match exactly
        },
        {
            'name': 'order_total',
            'query_a': 'SELECT SUM(amount) FROM orders WHERE date = CURRENT_DATE',
            'query_b': 'SELECT SUM(amount) FROM billing WHERE date = CURRENT_DATE',
            'tolerance': 0.01  # Allow 1% difference
        }
    ]
    
    for check in checks:
        result_a = db_a.query(check['query_a'])[0][0]
        result_b = db_b.query(check['query_b'])[0][0]
        
        diff = abs(result_a - result_b)
        diff_pct = diff / result_a if result_a > 0 else 0
        
        if diff_pct > check['tolerance']:
            alert(f"Reconciliation failed: {check['name']}")
            alert(f"Source A: {result_a}, Source B: {result_b}, Diff: {diff_pct:.2%}")
        else:
            print(f"✓ {check['name']} reconciled")

# Schedule
schedule.every().day.at("06:00").do(daily_reconciliation)
```

---

## Summary

**Backfill:** Fill in historical data

**Process:**
1. Identify gaps
2. Backfill in batches
3. Verify completeness

**Reconciliation:** Verify consistency across sources

**Patterns:**
- Count reconciliation
- Sum reconciliation
- Key reconciliation

**Automation:**
- Daily reconciliation jobs
- Alerts on discrepancies
- Auto-fix (if safe)

**Best Practices:**
- Batch processing
- Rate limiting
- Verification
- Audit logging

Overview

This skill is a practical playbook for backfilling historical data and reconciling inconsistencies across data sources. It provides step-by-step patterns, code snippets, and automation templates to identify gaps, run controlled backfills, and detect or fix mismatches. The playbook focuses on repeatable, safe operations to restore completeness and preserve trust in analytical results.

How this skill works

The playbook first identifies missing records or date ranges using lightweight queries and date-series generation. It prescribes batch backfills with rate limiting and verification steps to avoid overwhelming systems. For reconciliation, it compares sources by key, counts, and aggregates, producing discrepancy reports and supporting automated daily checks with alerting and optional auto-fix when safe.

When to use it

After a pipeline outage that created gaps in historical or daily metrics
When importing a new replica or migrating data between systems
During regular data quality checks to detect drift between source systems
Before running reports or models that require complete historical data
When billing, financial, or compliance reports must be accurate and auditable

Best practices

Identify gaps with simple, reproducible queries before running any backfill
Backfill in small batches and include sleep/rate limits to protect production systems
Verify each batch by re-querying and asserting completeness before moving on
Use reconciliation patterns (count, sum, key) to detect different classes of mismatch
Log all actions and alerts for auditability and enable alerts for any threshold breaches

Example use cases

Backfill missing daily metrics for a month after a failed ingestion job
Reconcile user records between the OLTP system and analytical replica to find missing IDs
Compare order totals between orders and billing systems using sum reconciliation
Run an automated daily reconciliation job that alerts when counts diverge beyond tolerance
Perform a targeted repair: identify mismatched rows and re-run transforms for those keys

FAQ

How large should backfill batches be?

Choose a batch size that balances throughput and system load; start small (daily or weekly) and increase after monitoring resource impact.

When is auto-fix safe?

Auto-fix is safe when the fix is deterministic, idempotent, and non-destructive; always require stricter checks and a rollback plan for financial data.