home / skills / sfc-gh-dflippo / snowflake-dbt-demo / doc-scraper

This skill helps you extract and organize Snowflake documentation efficiently with cached, configurable scraping and depth control.

npx playbooks add skill sfc-gh-dflippo/snowflake-dbt-demo --skill doc-scraper

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
2.1 KB
---
name: doc-scraper
description:
  Generic web scraper for extracting and organizing Snowflake documentation with intelligent caching
  and configurable spider depth. Scrapes any section of docs.snowflake.com controlled by
  --base-path.
---

# Snowflake Documentation Scraper

Scrapes docs.snowflake.com sections to Markdown with SQLite caching (7-day expiration).

## Usage

**First time setup** (auto-installs uv and doc-scraper):

```bash
python3 .claude/skills/doc-scraper/scripts/doc_scraper.py
```

**Subsequent runs:**

```bash
doc-scraper --output-dir=./snowflake-docs
doc-scraper --output-dir=./snowflake-docs --base-path="/en/sql-reference/"
doc-scraper --output-dir=./snowflake-docs --spider-depth=2
```

## Command Options

| Option           | Default           | Description                           |
| ---------------- | ----------------- | ------------------------------------- |
| `--output-dir`   | **Required**      | Output directory for scraped docs     |
| `--base-path`    | `/en/migrations/` | URL section to scrape                 |
| `--spider-depth` | `1`               | Link depth: 0=seeds, 1=+links, 2=+2nd |
| `--limit`        | None              | Cap URLs (for testing)                |
| `--dry-run`      | -                 | Preview without writing               |

## Output

```sql
output-dir/
├── SKILL.md              # Auto-generated index
├── scraper_config.yaml   # Editable config (auto-created)
├── .cache/               # SQLite cache (auto-managed)
└── en/migrations/*.md    # Scraped pages with frontmatter
```

## Configuration

Auto-created at `{output-dir}/scraper_config.yaml`:

```yaml
rate_limiting:
  max_concurrent_threads: 4
spider:
  max_pages: 1000
  allowed_paths: ["/en/"]
scraped_pages:
  expiration_days: 7
```

## Troubleshooting

| Issue            | Solution                              |
| ---------------- | ------------------------------------- |
| Too many pages   | Lower `--spider-depth` or edit config |
| Missing pages    | Increase `--spider-depth`             |
| Cache corruption | Delete `{output-dir}/.cache/` (rare)  |

Overview

This skill scrapes sections of docs.snowflake.com and converts pages to organized Markdown files with frontmatter. It uses a local SQLite cache with a configurable expiration to avoid re-downloading unchanged pages. The spider depth and base path are configurable to control crawl scope and volume.

How this skill works

Run the scraper with an output directory and optional flags to set the base path and spider depth. The tool crawls allowed paths, follows links up to the specified depth, converts pages to Markdown, and stores metadata. A SQLite cache stores fetched pages for the configured expiration period to speed repeated runs and reduce load. Rate-limiting and thread settings control concurrency.

When to use it

  • When you need an offline, searchable copy of Snowflake documentation sections.
  • To build documentation sites, knowledge bases, or data platform playbooks from Snowflake docs.
  • When auditing or tracking changes to specific Snowflake docs over time.
  • To extract docs for inclusion in internal developer portals or training materials.

Best practices

  • Set --base-path to limit the crawl to only the documentation area you need.
  • Start with --spider-depth=0 or 1 to verify scope before increasing depth.
  • Use the --limit flag for testing large sections to avoid long runs.
  • Inspect and tune scraper_config.yaml for max threads and allowed paths to balance speed and site load.
  • Use --dry-run to preview which pages will be processed before writing files.
  • Remove the .cache directory only if you suspect cache corruption; otherwise rely on expiration_days to refresh content.

Example use cases

  • Export the SQL reference section to Markdown for integration into an internal docs site using --base-path="/en/sql-reference/".
  • Create a weekly snapshot of a migrations guide by running the scraper with caching and storing output in a version control repo.
  • Generate a trimmed, offline copy of admin and security docs by adjusting allowed_paths and spider depth.
  • Test new scraping rules on a small set using --limit and --dry-run before full extraction.

FAQ

How long are cached pages kept?

Cached pages expire after the configured expiration_days (default 7 days) in the scraper config.

How do I restrict the crawl to a small area?

Set --base-path to the desired docs path and lower --spider-depth to limit link traversal; the config allowed_paths also controls scope.

Can I preview actions without writing files?

Yes. Use --dry-run to see which URLs would be processed without creating output files.