home / skills / codingheader / myskills / 21pounder-web-scrape

This skill helps you extract clean, structured content from web pages in various formats, saving time and improving data quality.

npx playbooks add skill codingheader/myskills --skill 21pounder-web-scrape

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.8 KB
---
name: web-scrape
description: Intelligent web scraper with content extraction, multiple output formats, and error handling
version: 3.0.0
---

# Web Scraping Skill v3.0

## Usage

```
/web-scrape <url> [options]
```

**Options:**
- `--format=markdown|json|text` - Output format (default: markdown)
- `--full` - Include full page content (skip smart extraction)
- `--screenshot` - Also save a screenshot
- `--scroll` - Scroll to load dynamic content (infinite scroll pages)

**Examples:**
```
/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot
```

---

## Execution Flow

### Phase 1: Navigate and Load

```
1. mcp__playwright__browser_navigate
   url: "<target URL>"

2. mcp__playwright__browser_wait_for
   time: 2  (allow initial render)
```

**If `--scroll` option:** Execute scroll sequence to trigger lazy loading:
```
3. mcp__playwright__browser_evaluate
   function: "async () => {
     for (let i = 0; i < 3; i++) {
       window.scrollTo(0, document.body.scrollHeight);
       await new Promise(r => setTimeout(r, 1000));
     }
     window.scrollTo(0, 0);
   }"
```

### Phase 2: Capture Content

```
4. mcp__playwright__browser_snapshot
   → Returns full accessibility tree with all text content
```

**If `--screenshot` option:**
```
5. mcp__playwright__browser_take_screenshot
   filename: "scraped_<domain>_<timestamp>.png"
   fullPage: true
```

### Phase 3: Close Browser

```
6. mcp__playwright__browser_close
```

---

## Smart Content Extraction

After getting the snapshot, apply intelligent extraction:

### Step 1: Identify Content Type

| Page Type | Indicators | Extraction Strategy |
|-----------|------------|---------------------|
| **Article/Blog** | `<article>`, long paragraphs, date/author | Extract main article body |
| **Product Page** | Price, "Add to Cart", specs | Extract title, price, description, specs |
| **Documentation** | Code blocks, headings hierarchy | Preserve structure and code |
| **List/Search** | Repeated item patterns | Extract as structured list |
| **Landing Page** | Hero section, CTAs | Extract key messaging |

### Step 2: Filter Noise

**ALWAYS REMOVE these elements from output:**
- Navigation menus and breadcrumbs
- Footer content (copyright, links)
- Sidebars (ads, related articles, social links)
- Cookie banners and popups
- Comments section (unless specifically requested)
- Share buttons and social widgets
- Login/signup prompts

### Step 3: Structure the Content

**For Articles:**
```markdown
# [Title]

**Source:** [URL]
**Date:** [if available]
**Author:** [if available]

---

[Main content in clean markdown]
```

**For Product Pages:**
```markdown
# [Product Name]

**Price:** [price]
**Availability:** [in stock/out of stock]

## Description
[product description]

## Specifications
| Spec | Value |
|------|-------|
| ... | ... |
```

---

## Output Formats

### Markdown (default)
Clean, readable markdown with proper headings, lists, and formatting.

### JSON
```json
{
  "url": "https://...",
  "title": "Page Title",
  "type": "article|product|docs|list",
  "content": {
    "main": "...",
    "metadata": {}
  },
  "extracted_at": "ISO timestamp"
}
```

### Text
Plain text with minimal formatting, suitable for further processing.

---

## Error Handling

### Navigation Errors

| Error | Detection | Action |
|-------|-----------|--------|
| **Timeout** | Page doesn't load in 30s | Report error, suggest retry |
| **404 Not Found** | "404" in title/content | Report "Page not found" |
| **403 Forbidden** | "403", "Access Denied" | Report access restriction |
| **CAPTCHA** | "captcha", "verify you're human" | Report CAPTCHA detected, cannot proceed |
| **Paywall** | "subscribe", "premium content" | Extract visible content, note paywall |

### Recovery Actions

```
If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly

If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable
```

---

## Advanced Scenarios

### Single Page Applications (SPA)
```
1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot
```

### Infinite Scroll Pages
```
1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes
```

### Pages with Click-to-Reveal Content
```
1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content
```

### Multi-page Articles
```
1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine
```

---

## Performance Guidelines

| Metric | Target | How |
|--------|--------|-----|
| **Speed** | < 15 seconds | Minimal waits, parallel where possible |
| **Token Usage** | < 5000 tokens | Smart extraction, not full DOM |
| **Reliability** | > 95% success | Proper error handling |

---

## Security Notes

- Never execute arbitrary JavaScript from the page
- Don't follow redirects to suspicious domains
- Don't submit forms or click login buttons
- Don't scrape pages that require authentication (unless user provides credentials flow)
- Respect robots.txt when mentioned by user

---

## Quick Reference

**Minimum viable scrape (4 tool calls):**
```
1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close
```

**Full-featured scrape (with scroll + screenshot):**
```
1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close
```

Remember: The goal is to deliver **clean, useful content** to the user, not raw HTML/DOM dumps.

Overview

This skill is an intelligent web scraper that navigates pages, captures content, and returns cleaned, structured outputs in Markdown, JSON, or plain text. It supports SPA and infinite-scroll pages, optional screenshots, and robust error handling to surface problems like CAPTCHAs or paywalls. The focus is delivering readable, noise-free content rather than raw DOM dumps.

How this skill works

The skill opens a headless browser, waits for page render, optionally performs a scroll sequence or interacts with reveal buttons, then captures an accessibility snapshot and (optionally) a full-page screenshot. It applies smart extraction rules to detect page type (article, product, docs, list, landing), strip navigation/ads/popups, and structure the result into Markdown, JSON, or text. Errors such as timeouts, 404/403, CAPTCHAs, and paywalls are detected and reported with suggested recovery actions.

When to use it

  • Extract a clean article or blog post for summarization or archiving
  • Collect product details (title, price, specs) from e-commerce pages
  • Save structured documentation or code snippets with preserved headings
  • Scrape search results or listing pages into a structured list
  • Capture SPA or infinite-scroll pages where dynamic loading is required

Best practices

  • Prefer smart extraction (default) to limit token usage and remove noise; use --full only when you need the entire page
  • Use --scroll for infinite-scroll pages and --screenshot when a visual record is required
  • Provide a clear URL and, if available, text hints (e.g., article title) for better SPA hydration timing
  • Respect site restrictions: avoid authenticated areas, don't submit forms or click login buttons, and follow robots.txt when requested
  • If scraping multi-page content, confirm before iterating through pagination to avoid unnecessary requests

Example use cases

  • Quickly convert a news article to clean Markdown for content curation
  • Pull product specs and price from a product page into JSON for comparison tools
  • Capture API documentation or tutorials with code blocks preserved for developer notes
  • Aggregate listing pages into structured JSON for downstream processing
  • Detect and report access issues like paywalls or CAPTCHAs when automated scraping cannot proceed

FAQ

What output formats are available?

Markdown (default), JSON, and plain text are supported; choose via --format.

How does it handle infinite-scroll or SPA pages?

Use --scroll to run a scroll loop before snapshot. For SPAs, the scraper waits longer or for specific text to ensure hydration before capturing.

What happens if a CAPTCHA or paywall is detected?

The tool reports the detection, extracts any visible preview content, and suggests alternatives; it does not bypass CAPTCHAs or paid walls.