home / skills / rohunvora / cool-claude-skills / incremental-fetch

incremental-fetch skill

safe

This skill builds resilient data ingestion pipelines with two watermarks to fetch incremental and historical data without duplicates.

npx playbooks add skill rohunvora/cool-claude-skills --skill incremental-fetch

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

2.8 KB

---
name: incremental-fetch
description: "Build resilient data ingestion pipelines from APIs. Use when creating scripts that fetch paginated data from external APIs (Twitter, exchanges, any REST API) and need to track progress, avoid duplicates, handle rate limits, and support both incremental updates and historical backfills. Triggers: 'ingest data from API', 'pull tweets', 'fetch historical data', 'sync from X', 'build a data pipeline', 'fetch without re-downloading', 'resume the download', 'backfill older data'. NOT for: simple one-shot API calls, websocket/streaming connections, file downloads, or APIs without pagination."
---

# Incremental Fetch

Build data pipelines that never lose progress and never re-fetch existing data.

## The Two Watermarks Pattern

Track TWO cursors to support both forward and backward fetching:

| Watermark | Purpose | API Parameter |
|-----------|---------|---------------|
| `newest_id` | Fetch new data since last run | `since_id` |
| `oldest_id` | Backfill older data | `until_id` |

A single watermark only fetches forward. Two watermarks enable:
- Regular runs: fetch NEW data (since `newest_id`)
- Backfill runs: fetch OLD data (until `oldest_id`)
- No overlap, no gaps

## Critical: Data vs Watermark Saving

These are different operations with different timing:

| What | When to Save | Why |
|------|--------------|-----|
| **Data records** | After EACH page | Resilience: interrupted on page 47? Keep 46 pages |
| **Watermarks** | ONCE at end of run | Correctness: only commit progress after full success |

```
fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks
```

## Workflow Decision Tree

```
First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
    ├── YES → Backfill mode (until_id = oldest_id)
    └── NO → Update mode (since_id = newest_id)
```

## Implementation Checklist

1. **Database**: Create ingestion_state table (see patterns.md)
2. **Fetch loop**: Insert records immediately after each API page
3. **Watermark tracking**: Track newest/oldest IDs seen in this run
4. **Watermark update**: Save watermarks ONCE at end of successful run
5. **Retry**: Exponential backoff with jitter
6. **Rate limits**: Wait for reset or skip and record for next run

## Pagination Types

This pattern works best with **ID-based pagination** (numeric IDs that can be compared). For other pagination types:

| Type | Adaptation |
|------|------------|
| **Cursor/token** | Store cursor string instead of ID; can't compare numerically |
| **Timestamp** | Use `last_timestamp` column; compare as dates |
| **Offset/limit** | Store page number; resume from last saved page |

See [references/patterns.md](references/patterns.md) for schemas and code examples.

Overview

This skill builds resilient, idempotent data ingestion pipelines for paginated REST APIs. It focuses on never losing progress and avoiding duplicate downloads by tracking two watermarks for forward and backward fetches. Use it to support regular incremental syncs and controlled historical backfills with robust retry and rate-limit handling.

How this skill works

The core is the Two Watermarks pattern: track newest_id to fetch new records (since_id) and oldest_id to backfill older records (until_id). The fetch loop writes each API page of records immediately, while watermarks are saved only once at the end of a successful run. It supports exponential backoff, jitter, rate-limit waits, and adapts for cursor, timestamp, or offset pagination.

When to use it

Building a repeatable ingestion job that must resume after failures
Fetching paginated API data without re-downloading already ingested records
Implementing scheduled incremental syncs plus occasional historical backfills
Handling APIs that return numeric IDs, cursors, or timestamps for paging
Avoiding duplicate data and ensuring no gaps across runs

Best practices

Persist each page of records as soon as they arrive; don’t batch watermark commits with page writes
Commit watermarks only after the whole run completes successfully to avoid skipping data
Track both newest and oldest seen IDs during a run to enable forward updates and backward backfills
Use exponential backoff with jitter for retries and respect rate-limit reset headers
Design your ingestion_state storage to be atomic and queryable for monitoring and restarts

Example use cases

Syncing a social feed (e.g., tweets) daily while occasionally backfilling older posts
Ingesting paginated exchange trade history and resuming after partial failures
Building a scheduled ETL that fetches incremental API results without duplicates
Resuming a long historical download from the last saved page after interruption
Switching between update mode and backfill mode based on watermark state

FAQ

What if the API uses cursors instead of numeric IDs?

Store the cursor string as the watermark and resume from that cursor; you can’t numerically compare cursors, so rely on the API’s pagination direction and state tracking.

When should I save watermarks versus data records?

Save data records immediately after each page is fetched. Save watermarks only once at the end of a fully successful run to avoid skipping data if the run fails mid-way.