home / skills / shipshitdev / library / incremental-fetch

incremental-fetch skill

/bundles/backend/skills/incremental-fetch

This skill builds resilient incremental data pipelines by tracking two watermarks for forward and backward fetching to avoid duplicates.

This is most likely a fork of the incremental-fetch skill from rohunvora
npx playbooks add skill shipshitdev/library --skill incremental-fetch

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
2.8 KB
---
name: incremental-fetch
description: "Build resilient data ingestion pipelines from APIs. Use when creating scripts that fetch paginated data from external APIs (Twitter, exchanges, any REST API) and need to track progress, avoid duplicates, handle rate limits, and support both incremental updates and historical backfills. Triggers: 'ingest data from API', 'pull tweets', 'fetch historical data', 'sync from X', 'build a data pipeline', 'fetch without re-downloading', 'resume the download', 'backfill older data'. NOT for: simple one-shot API calls, websocket/streaming connections, file downloads, or APIs without pagination."
---

# Incremental Fetch

Build data pipelines that never lose progress and never re-fetch existing data.

## The Two Watermarks Pattern

Track TWO cursors to support both forward and backward fetching:

| Watermark | Purpose | API Parameter |
|-----------|---------|---------------|
| `newest_id` | Fetch new data since last run | `since_id` |
| `oldest_id` | Backfill older data | `until_id` |

A single watermark only fetches forward. Two watermarks enable:

- Regular runs: fetch NEW data (since `newest_id`)
- Backfill runs: fetch OLD data (until `oldest_id`)
- No overlap, no gaps

## Critical: Data vs Watermark Saving

These are different operations with different timing:

| What | When to Save | Why |
|------|--------------|-----|
| **Data records** | After EACH page | Resilience: interrupted on page 47? Keep 46 pages |
| **Watermarks** | ONCE at end of run | Correctness: only commit progress after full success |

```
fetch page 1 → save records → fetch page 2 → save records → ... → update watermarks
```

## Workflow Decision Tree

```
First run (no watermarks)?
├── YES → Full fetch (no since_id, no until_id)
└── NO → Backfill flag set?
    ├── YES → Backfill mode (until_id = oldest_id)
    └── NO → Update mode (since_id = newest_id)
```

## Implementation Checklist

1. **Database**: Create ingestion_state table (see patterns.md)
2. **Fetch loop**: Insert records immediately after each API page
3. **Watermark tracking**: Track newest/oldest IDs seen in this run
4. **Watermark update**: Save watermarks ONCE at end of successful run
5. **Retry**: Exponential backoff with jitter
6. **Rate limits**: Wait for reset or skip and record for next run

## Pagination Types

This pattern works best with **ID-based pagination** (numeric IDs that can be compared). For other pagination types:

| Type | Adaptation |
|------|------------|
| **Cursor/token** | Store cursor string instead of ID; can't compare numerically |
| **Timestamp** | Use `last_timestamp` column; compare as dates |
| **Offset/limit** | Store page number; resume from last saved page |

See [references/patterns.md](references/patterns.md) for schemas and code examples.

Overview

This skill builds resilient, incremental data ingestion pipelines for paginated REST APIs. It implements a two-watermark pattern to support both forward updates and retrospective backfills while preventing duplicates and gaps. The skill is focused on ID-, cursor-, or timestamp-based pagination and includes retry, rate-limit handling, and atomic watermark commits.

How this skill works

The skill tracks two watermarks: newest_id for fetching newer records (since_id) and oldest_id for backfills (until_id). It saves fetched records immediately after each page and only commits updated watermarks once the entire run completes successfully. It supports exponential backoff with jitter, rate-limit waits or skips, and adapts to cursor or timestamp pagination by storing comparable markers.

When to use it

  • Building a scheduled ingestion job that must resume after interruptions without re-downloading data
  • Syncing from APIs that return paginated results (e.g., tweets, exchange trades, REST endpoints)
  • Running periodic incremental updates plus occasional historical backfills
  • Avoiding duplicates when API responses overlap between runs
  • Handling APIs with rate limits where runs may need to pause and resume

Best practices

  • Persist raw page data to your DB immediately after each successful page fetch for resilience
  • Only update watermarks once the run finishes successfully to avoid data loss or skipped ranges
  • Use two watermarks (newest and oldest) so forward fetches and backfills never overlap
  • Implement exponential backoff with jitter and record rate-limit state for the next run
  • Choose the watermark type that matches the API (ID, cursor/token, timestamp, or page number)

Example use cases

  • Daily job that pulls new tweets since the last run while occasionally backfilling older tweets
  • Ingesting paginated trade or order data from an exchange API without reprocessing the same trades
  • Resumable historical data import where a crash should not force re-downloading all pages
  • Building a sync service that maintains exact progress markers across distributed workers
  • Converting cursor-based pagination into resumable checkpoints by storing the last cursor

FAQ

What if the API uses cursors instead of numeric IDs?

Store the cursor string as the watermark and treat it as an opaque marker; you cannot compare values numerically but you can resume from the saved cursor.

When should I update watermarks during a run?

Never update the persistent watermarks mid-run. Save page data continuously, but persist the new watermarks only after the full ingestion run completes successfully.