home / skills / ratacat / claude-skills / documentation-scraper
This skill analyzes sitemaps and scrapes documentation sites into a single Markdown file optimized for AI context, ensuring accurate scope and structure.
npx playbooks add skill ratacat/claude-skills --skill documentation-scraperReview the files below or copy the command above to add this skill to your agents.
---
name: documentation-scraper
description: Use when needing to scrape documentation websites into markdown for AI context. Triggers on "scrape docs", "download documentation", "get docs for [library]", or creating local copies of online documentation. CRITICAL - always analyze sitemap first before scraping.
---
# Documentation Scraper with slurp-ai
## Overview
slurp-ai scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, fast, and deterministic - it does NOT use AI to scrape, it is FOR AI consumption.
## CRITICAL: Run Outside Sandbox
**All commands in this skill MUST be run outside the sandbox.** Use `dangerouslyDisableSandbox: true` for all Bash commands including:
- `which slurp` (installation check)
- `node analyze-sitemap.js` (sitemap analysis)
- `slurp` (scraping)
- File inspection commands (`wc`, `head`, `cat`, etc.)
The sandbox blocks network access and file operations required for web scraping.
## Pre-Flight: Check Installation
**Before scraping, verify slurp-ai is installed:**
```bash
which slurp || echo "NOT INSTALLED"
```
If not installed, ask the user to run:
```bash
npm install -g slurp-ai
```
**Requires:** Node.js v20+
**Do NOT proceed with scraping until slurp-ai is confirmed installed.**
## Commands
| Command | Purpose |
|---------|---------|
| `slurp <url>` | Fetch and compile in one step |
| `slurp fetch <url> [version]` | Download docs to partials only |
| `slurp compile` | Compile partials into single file |
| `slurp read <package> [version]` | Read local documentation |
**Output:** Creates `slurp_compiled/compiled_docs.md` from partials in `slurp_partials/`.
## CRITICAL: Analyze Sitemap First
**Before running slurp, ALWAYS analyze the sitemap.** This reveals the complete site structure and informs your `--base-path` and `--max` decisions.
### Step 1: Run Sitemap Analysis
Use the included `analyze-sitemap.js` script:
```bash
node analyze-sitemap.js https://docs.example.com
```
This outputs:
- Total page count (informs `--max`)
- URLs grouped by section (informs `--base-path`)
- Suggested slurp commands with appropriate flags
- Sample URLs to understand naming patterns
### Step 2: Interpret the Output
Example output:
```
š Total URLs in sitemap: 247
š URLs by top-level section:
/docs 182 pages
/api 45 pages
/blog 20 pages
šÆ Suggested --base-path options:
https://docs.example.com/docs/guides/ (67 pages)
https://docs.example.com/docs/reference/ (52 pages)
https://docs.example.com/api/ (45 pages)
š” Recommended slurp commands:
# Just "/docs/guides" section (67 pages)
slurp https://docs.example.com/docs/guides/ --base-path https://docs.example.com/docs/guides/ --max 80
```
### Step 3: Choose Scope Based on Analysis
| Sitemap Shows | Action |
|---------------|--------|
| < 50 pages total | Scrape entire site: `slurp <url> --max 60` |
| 50-200 pages | Scope to relevant section with `--base-path` |
| 200+ pages | Must scope down - pick specific subsection |
| No sitemap found | Start with `--max 30`, inspect partials, adjust |
### Step 4: Frame the Slurp Command
With sitemap data, you can now set accurate parameters:
```bash
# From sitemap: /docs/api has 45 pages
slurp https://docs.example.com/docs/api/intro \
--base-path https://docs.example.com/docs/api/ \
--max 55
```
**Key insight:** Starting URL is where crawling begins. Base path filters which links get followed. They can differ (useful when base path itself returns 404).
## Common Scraping Patterns
### Library Documentation (versioned)
```bash
# Express.js 4.x docs
slurp https://expressjs.com/en/4x/api.html --base-path https://expressjs.com/en/4x/
# React docs (latest)
slurp https://react.dev/learn --base-path https://react.dev/learn
```
### API Reference Only
```bash
slurp https://docs.example.com/api/introduction --base-path https://docs.example.com/api/
```
### Full Documentation Site
```bash
slurp https://docs.example.com/
```
## CLI Options
| Flag | Default | Purpose |
|------|---------|---------|
| `--max <n>` | 20 | Maximum pages to scrape |
| `--concurrency <n>` | 5 | Parallel page requests |
| `--headless <bool>` | true | Use headless browser |
| `--base-path <url>` | start URL | Filter links to this prefix |
| `--output <dir>` | `./slurp_partials` | Output directory for partials |
| `--retry-count <n>` | 3 | Retries for failed requests |
| `--retry-delay <ms>` | 1000 | Delay between retries |
| `--yes` | - | Skip confirmation prompts |
### Compile Options
| Flag | Default | Purpose |
|------|---------|---------|
| `--input <dir>` | `./slurp_partials` | Input directory |
| `--output <file>` | `./slurp_compiled/compiled_docs.md` | Output file |
| `--preserve-metadata` | true | Keep metadata blocks |
| `--remove-navigation` | true | Strip nav elements |
| `--remove-duplicates` | true | Eliminate duplicates |
| `--exclude <json>` | - | JSON array of regex patterns to exclude |
### When to Disable Headless Mode
Use `--headless false` for:
- Static HTML documentation sites
- Faster scraping when JS rendering not needed
**Default is headless (true)** - works for most modern doc sites including SPAs.
## Output Structure
```
slurp_partials/ # Intermediate files
āāā page1.md
āāā page2.md
slurp_compiled/ # Final output
āāā compiled_docs.md # Compiled result
```
## Quick Reference
```bash
# 1. ALWAYS analyze sitemap first
node analyze-sitemap.js https://docs.example.com
# 2. Scrape with informed parameters (from sitemap analysis)
slurp https://docs.example.com/docs/ --base-path https://docs.example.com/docs/ --max 80
# 3. Skip prompts for automation
slurp https://docs.example.com/ --yes
# 4. Check output
cat slurp_compiled/compiled_docs.md | head -100
```
## Common Issues
| Problem | Cause | Solution |
|---------|-------|----------|
| Wrong `--max` value | Guessing page count | Run `analyze-sitemap.js` first |
| Too few pages scraped | `--max` limit (default 20) | Set `--max` based on sitemap analysis |
| Missing content | JS not rendering | Ensure `--headless true` (default) |
| Crawl stuck/slow | Rate limiting | Reduce `--concurrency 3` |
| Duplicate sections | Similar content | Use `--remove-duplicates` (default) |
| Wrong pages included | Base path too broad | Use sitemap to find correct `--base-path` |
| Prompts blocking automation | Interactive mode | Add `--yes` flag |
## Post-Scrape Usage
The output markdown is designed for AI context injection:
```bash
# Check file size (context budget)
wc -c slurp_compiled/compiled_docs.md
# Preview structure
grep "^#" slurp_compiled/compiled_docs.md | head -30
# Use with Claude Code - reference in prompt or via @file
```
## When NOT to Use
- **API specs in OpenAPI/Swagger**: Use dedicated parsers instead
- **GitHub READMEs**: Fetch directly via raw.githubusercontent.com
- **npm package docs**: Often better to read source + README
- **Frequently updated docs**: Consider caching strategy
This skill scrapes documentation websites and compiles them into a single markdown file optimized for AI agent context. It is lightweight, deterministic, and does not use AI to perform scraping ā it prepares source material for AI consumption. Always analyze the sitemap before scraping to set correct scope and limits. All scraping commands must be executed outside the sandbox environment.
The skill uses the slurp-ai CLI to fetch pages into partial markdown files and then compiles them into one consolidated document. It begins by running an included sitemap analysis script to report page counts, section grouping, and suggested --base-path and --max values. After verifying slurp-ai is installed, you run slurp with parameters informed by the sitemap, then compile partials into slurp_compiled/compiled_docs.md for AI context use.
Do I need to run anything inside the sandbox?
No. All network and file commands must run outside the sandbox; the sandbox blocks required operations.
What if slurp-ai is not installed?
Ask the user to run npm install -g slurp-ai and ensure Node.js v20+ is installed, then re-run which slurp to confirm.
What if there is no sitemap?
Start with --max 30, run a short fetch, inspect partials to understand structure, then re-run analyze-sitemap or adjust --base-path and --max accordingly.