home / skills / jmagly / aiwg / doc-scraper

doc-scraper skill

safe

/agentic/code/addons/doc-intelligence/skills/doc-scraper

This skill scrapes documentation websites into organized reference files for offline archives or Claude skills, ensuring structured outputs and easy retrieval.

npx playbooks add skill jmagly/aiwg --skill doc-scraper

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

6.2 KB

---
name: doc-scraper
description: Scrape documentation websites into organized reference files. Use when converting docs sites to searchable references or building Claude skills.
tools: Read, Write, Bash, WebFetch
---

# Documentation Scraper Skill

## Purpose

Single responsibility: Convert documentation websites into organized, categorized reference files suitable for Claude skills or offline archives. (BP-4)

## Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

- [ ] Target URL is accessible (test with `curl -I`)
- [ ] Documentation structure is identifiable (inspect page for content selectors)
- [ ] Output directory is writable
- [ ] Rate limiting requirements are known (check robots.txt)

**DO NOT proceed without verification. Inspect before scraping.**

## Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

- Content selector is ambiguous (multiple `<article>` or `<main>` elements)
- URL patterns unclear (can't determine include/exclude rules)
- Category mapping uncertain (content doesn't fit predefined categories)
- Rate limiting unknown (no robots.txt, unclear ToS)

**NEVER substitute missing configuration with assumptions.**

## Context Scope (Archetype 3 Mitigation)

| Context Type | Included | Excluded |
|--------------|----------|----------|
| RELEVANT | Target URL, selectors, output path | Unrelated documentation |
| PERIPHERAL | Similar site examples for selector hints | Historical scrape data |
| DISTRACTOR | Other projects, unrelated URLs | Previous failed attempts |

## Workflow Steps

### Step 1: Verify Target (Grounding)

```bash
# Test URL accessibility
curl -I <target-url>

# Check robots.txt
curl <base-url>/robots.txt

# Inspect page structure (use browser dev tools or fetch sample)
```

### Step 2: Create Configuration

Generate scraper config based on inspection:

```json
{
  "name": "skill-name",
  "description": "When to use this skill",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article",
    "title": "h1",
    "code_blocks": "pre code"
  },
  "url_patterns": {
    "include": ["/docs", "/guide", "/api"],
    "exclude": ["/blog", "/changelog", "/releases"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "installation"],
    "api_reference": ["api", "reference", "methods"],
    "guides": ["guide", "tutorial", "how-to"]
  },
  "rate_limit": 0.5,
  "max_pages": 500
}
```

### Step 3: Execute Scraping

**Option A: With skill-seekers (if installed)**

```bash
# Verify skill-seekers is available
pip show skill-seekers

# Run scraper
skill-seekers scrape --config config.json

# For large docs, use async mode
skill-seekers scrape --config config.json --async --workers 8
```

**Option B: Manual scraping guidance**

1. Use sitemap.xml or crawl starting URL
2. Extract content using configured selectors
3. Categorize pages based on URL patterns and keywords
4. Save to organized directory structure

### Step 4: Validate Output

```bash
# Check output structure
ls -la output/<skill-name>/

# Verify content quality
head -50 output/<skill-name>/references/index.md

# Count extracted pages
find output/<skill-name>_data/pages -name "*.json" | wc -l
```

## Recovery Protocol (Archetype 4 Mitigation)

On error:

1. **PAUSE** - Stop scraping, preserve already-fetched pages
2. **DIAGNOSE** - Check error type:
   - `Connection error` → Verify URL, check network
   - `Selector not found` → Re-inspect page structure
   - `Rate limited` → Increase delay, reduce workers
   - `Memory/disk` → Reduce batch size, clear temp files
3. **ADAPT** - Adjust configuration based on diagnosis
4. **RETRY** - Resume from checkpoint (max 3 attempts)
5. **ESCALATE** - Ask user for guidance

## Checkpoint Support

State saved to: `.aiwg/working/checkpoints/doc-scraper/`

Resume interrupted scrape:
```bash
skill-seekers scrape --config config.json --resume
```

Clear checkpoint and start fresh:
```bash
skill-seekers scrape --config config.json --fresh
```

## Output Structure

```
output/<skill-name>/
├── SKILL.md              # Main skill description
├── references/           # Categorized documentation
│   ├── index.md          # Category index
│   ├── getting_started.md
│   ├── api_reference.md
│   └── guides.md
├── scripts/              # (empty, for user additions)
└── assets/               # (empty, for user additions)

output/<skill-name>_data/
├── pages/                # Raw scraped JSON (one per page)
└── summary.json          # Scrape statistics
```

## Configuration Templates

### Minimal Config

```json
{
  "name": "myframework",
  "base_url": "https://docs.example.com/",
  "max_pages": 100
}
```

### Full Config

```json
{
  "name": "myframework",
  "description": "MyFramework documentation for building web apps",
  "base_url": "https://docs.example.com/",
  "selectors": {
    "main_content": "article, main, div[role='main']",
    "title": "h1, .title",
    "code_blocks": "pre code, .highlight code",
    "navigation": "nav, .sidebar"
  },
  "url_patterns": {
    "include": ["/docs/", "/api/", "/guide/"],
    "exclude": ["/blog/", "/changelog/", "/v1/", "/v2/"]
  },
  "categories": {
    "getting_started": ["intro", "quickstart", "install", "setup"],
    "concepts": ["concept", "overview", "architecture"],
    "api": ["api", "reference", "method", "function"],
    "guides": ["guide", "tutorial", "how-to", "example"],
    "advanced": ["advanced", "internals", "customize"]
  },
  "rate_limit": 0.5,
  "max_pages": 1000,
  "checkpoint": {
    "enabled": true,
    "interval": 100
  }
}
```

## Troubleshooting

| Issue | Diagnosis | Solution |
|-------|-----------|----------|
| No content extracted | Selector mismatch | Inspect page, update `main_content` selector |
| Wrong pages scraped | URL pattern issue | Check `include`/`exclude` patterns |
| Rate limited | Too aggressive | Increase `rate_limit` to 1.0+ seconds |
| Memory issues | Too many pages | Add `max_pages` limit, enable checkpoints |
| Categories wrong | Keyword mismatch | Update category keywords in config |

## References

- Skill Seekers: https://github.com/jmagly/Skill_Seekers
- REF-001: Production-Grade Agentic Workflows (BP-1, BP-4, BP-9)
- REF-002: LLM Failure Modes (Archetype 1-4 mitigations)

Overview

This skill scrapes documentation websites and converts them into organized, categorized reference files suitable for searchable archives or Claude skills. It focuses on reliable extraction, category mapping, checkpointed runs, and safe rate-limited crawling. The skill outputs a tidy directory of reference pages plus raw JSON and summary metadata for downstream processing.

How this skill works

First, the scraper validates the target: URL accessibility, robots.txt, selector viability, and writable output path. Next it generates a config with base_url, CSS selectors, include/exclude URL patterns, categories, and rate limits. The scraper crawls pages (sitemap or crawl start), extracts main content and code blocks per selectors, categorizes pages, saves structured markdown and raw JSON, and records checkpoints for resuming interrupted runs.

When to use it

Converting a docs site into local, searchable reference files for an LLM skill
Building offline archives or knowledge bases from API docs or guides
Standardizing documentation into categorized markdown for developer portals
Preparing source material for agentic workflows or Claude skill training
Migrating docs content while preserving code samples and structure

Best practices

Inspect the site manually before running: test curl -I and check robots.txt
Define precise selectors for main content, titles, and code blocks to avoid noise
Use include/exclude URL patterns and category keywords to reduce misclassification
Respect rate limits and start with low concurrency; increase only after testing
Enable checkpoints for large crawls and verify output structure before bulk retries

Example use cases

Scrape a library's API docs into categorized markdown for offline reference
Harvest tutorial and guide pages to bootstrap a Claude skill knowledge base
Create a searchable developer handbook from scattered docs and how-tos
Migrate multiple versioned docs into a single standardized folder layout
Extract code examples and save raw JSON for downstream analysis or testing

FAQ

What do I do if selectors fail to match content?

Pause the scrape, inspect the page structure with dev tools, refine the main_content and title selectors, then resume from the last checkpoint.

How does the skill avoid getting rate limited?

Configure a conservative rate_limit and reduce workers; check robots.txt and increase delays when encountering 429 or failed requests.