home / skills / amnadtaowsoam / cerebraskills / backfill-and-reconciliation-playbook
/70-data-platform-governance/backfill-and-reconciliation-playbook
This skill helps teams backfill historical data and reconcile inconsistencies across sources to ensure completeness, accuracy, and trust.
npx playbooks add skill amnadtaowsoam/cerebraskills --skill backfill-and-reconciliation-playbookReview the files below or copy the command above to add this skill to your agents.
---
name: Backfill and Reconciliation Playbook
description: Playbook for backfilling historical data and reconciling inconsistencies between data sources
---
# Backfill and Reconciliation Playbook
## Overview
Playbook for backfilling historical data and reconciling (matching/fixing) inconsistencies between data sources
## Why This Matters
- **Data completeness**: No gaps in data
- **Consistency**: Data matches across all sources
- **Trust**: Confidence that data is accurate
- **Recovery**: Can fix issues when they occur
---
## Backfill Strategy
### 1. Identify Gap
```sql
-- Find missing dates
WITH date_range AS (
SELECT generate_series(
'2024-01-01'::date,
CURRENT_DATE,
'1 day'::interval
)::date as date
)
SELECT dr.date
FROM date_range dr
LEFT JOIN analytics.daily_metrics dm ON dr.date = dm.date
WHERE dm.date IS NULL
ORDER BY dr.date;
-- Output: Missing dates that need backfill
```
### 2. Backfill in Batches
```python
def backfill_data(start_date: date, end_date: date, batch_size: int = 7):
"""Backfill data in batches to avoid overwhelming system"""
current = start_date
while current <= end_date:
batch_end = min(current + timedelta(days=batch_size), end_date)
print(f"Backfilling {current} to {batch_end}")
# Run backfill for this batch
run_dbt_model(
model='daily_metrics',
vars={'start_date': current, 'end_date': batch_end}
)
# Verify
verify_backfill(current, batch_end)
current = batch_end + timedelta(days=1)
time.sleep(60) # Rate limit
# Example
backfill_data(date(2024, 1, 1), date(2024, 1, 31))
```
---
## Reconciliation
### Compare Sources
```python
def reconcile_sources(source_a: str, source_b: str, key: str):
"""Compare two data sources and find discrepancies"""
# Get data from both sources
data_a = db_a.query(f"SELECT * FROM {source_a}")
data_b = db_b.query(f"SELECT * FROM {source_b}")
# Convert to dataframes
df_a = pd.DataFrame(data_a)
df_b = pd.DataFrame(data_b)
# Find differences
merged = df_a.merge(df_b, on=key, how='outer', indicator=True, suffixes=('_a', '_b'))
# Records only in A
only_a = merged[merged['_merge'] == 'left_only']
print(f"Only in {source_a}: {len(only_a)} records")
# Records only in B
only_b = merged[merged['_merge'] == 'right_only']
print(f"Only in {source_b}: {len(only_b)} records")
# Records in both but different values
both = merged[merged['_merge'] == 'both']
for col in df_a.columns:
if col == key:
continue
col_a = f"{col}_a"
col_b = f"{col}_b"
if col_a in both.columns and col_b in both.columns:
diff = both[both[col_a] != both[col_b]]
if len(diff) > 0:
print(f"Column {col}: {len(diff)} mismatches")
return merged
# Example
discrepancies = reconcile_sources('users', 'users_replica', 'user_id')
```
---
## Reconciliation Patterns
### Pattern 1: Count Reconciliation
```sql
-- Compare counts
SELECT
'source_a' as source,
COUNT(*) as count
FROM source_a.users
WHERE created_at::date = '2024-01-16'
UNION ALL
SELECT
'source_b',
COUNT(*)
FROM source_b.users
WHERE created_at::date = '2024-01-16';
-- Expected: Counts should match
```
### Pattern 2: Sum Reconciliation
```sql
-- Compare aggregates
SELECT
'orders' as source,
SUM(amount) as total_amount,
COUNT(*) as order_count
FROM orders
WHERE order_date = '2024-01-16'
UNION ALL
SELECT
'billing',
SUM(charge_amount),
COUNT(*)
FROM billing_records
WHERE charge_date = '2024-01-16';
-- Expected: Totals should match
```
### Pattern 3: Key Reconciliation
```sql
-- Find missing keys
SELECT user_id
FROM source_a.users
WHERE created_at::date = '2024-01-16'
EXCEPT
SELECT user_id
FROM source_b.users
WHERE created_at::date = '2024-01-16';
-- Output: user_ids in A but not in B
```
---
## Automated Reconciliation
```python
# Daily reconciliation job
def daily_reconciliation():
"""Run daily reconciliation checks"""
checks = [
{
'name': 'user_count',
'query_a': 'SELECT COUNT(*) FROM users',
'query_b': 'SELECT COUNT(*) FROM users_replica',
'tolerance': 0 # Must match exactly
},
{
'name': 'order_total',
'query_a': 'SELECT SUM(amount) FROM orders WHERE date = CURRENT_DATE',
'query_b': 'SELECT SUM(amount) FROM billing WHERE date = CURRENT_DATE',
'tolerance': 0.01 # Allow 1% difference
}
]
for check in checks:
result_a = db_a.query(check['query_a'])[0][0]
result_b = db_b.query(check['query_b'])[0][0]
diff = abs(result_a - result_b)
diff_pct = diff / result_a if result_a > 0 else 0
if diff_pct > check['tolerance']:
alert(f"Reconciliation failed: {check['name']}")
alert(f"Source A: {result_a}, Source B: {result_b}, Diff: {diff_pct:.2%}")
else:
print(f"✓ {check['name']} reconciled")
# Schedule
schedule.every().day.at("06:00").do(daily_reconciliation)
```
---
## Summary
**Backfill:** Fill in historical data
**Process:**
1. Identify gaps
2. Backfill in batches
3. Verify completeness
**Reconciliation:** Verify consistency across sources
**Patterns:**
- Count reconciliation
- Sum reconciliation
- Key reconciliation
**Automation:**
- Daily reconciliation jobs
- Alerts on discrepancies
- Auto-fix (if safe)
**Best Practices:**
- Batch processing
- Rate limiting
- Verification
- Audit logging
This skill is a practical playbook for backfilling historical data and reconciling inconsistencies across data sources. It provides step-by-step patterns, code snippets, and automation templates to identify gaps, run controlled backfills, and detect or fix mismatches. The playbook focuses on repeatable, safe operations to restore completeness and preserve trust in analytical results.
The playbook first identifies missing records or date ranges using lightweight queries and date-series generation. It prescribes batch backfills with rate limiting and verification steps to avoid overwhelming systems. For reconciliation, it compares sources by key, counts, and aggregates, producing discrepancy reports and supporting automated daily checks with alerting and optional auto-fix when safe.
How large should backfill batches be?
Choose a batch size that balances throughput and system load; start small (daily or weekly) and increase after monitoring resource impact.
When is auto-fix safe?
Auto-fix is safe when the fix is deterministic, idempotent, and non-destructive; always require stricter checks and a rollback plan for financial data.