home / skills / codingheader / myskills / 21pounder-web-scrape
This skill helps you extract clean, structured content from web pages in various formats, saving time and improving data quality.
npx playbooks add skill codingheader/myskills --skill 21pounder-web-scrapeReview the files below or copy the command above to add this skill to your agents.
---
name: web-scrape
description: Intelligent web scraper with content extraction, multiple output formats, and error handling
version: 3.0.0
---
# Web Scraping Skill v3.0
## Usage
```
/web-scrape <url> [options]
```
**Options:**
- `--format=markdown|json|text` - Output format (default: markdown)
- `--full` - Include full page content (skip smart extraction)
- `--screenshot` - Also save a screenshot
- `--scroll` - Scroll to load dynamic content (infinite scroll pages)
**Examples:**
```
/web-scrape https://example.com/article
/web-scrape https://news.site.com/story --format=json
/web-scrape https://spa-app.com/page --scroll --screenshot
```
---
## Execution Flow
### Phase 1: Navigate and Load
```
1. mcp__playwright__browser_navigate
url: "<target URL>"
2. mcp__playwright__browser_wait_for
time: 2 (allow initial render)
```
**If `--scroll` option:** Execute scroll sequence to trigger lazy loading:
```
3. mcp__playwright__browser_evaluate
function: "async () => {
for (let i = 0; i < 3; i++) {
window.scrollTo(0, document.body.scrollHeight);
await new Promise(r => setTimeout(r, 1000));
}
window.scrollTo(0, 0);
}"
```
### Phase 2: Capture Content
```
4. mcp__playwright__browser_snapshot
→ Returns full accessibility tree with all text content
```
**If `--screenshot` option:**
```
5. mcp__playwright__browser_take_screenshot
filename: "scraped_<domain>_<timestamp>.png"
fullPage: true
```
### Phase 3: Close Browser
```
6. mcp__playwright__browser_close
```
---
## Smart Content Extraction
After getting the snapshot, apply intelligent extraction:
### Step 1: Identify Content Type
| Page Type | Indicators | Extraction Strategy |
|-----------|------------|---------------------|
| **Article/Blog** | `<article>`, long paragraphs, date/author | Extract main article body |
| **Product Page** | Price, "Add to Cart", specs | Extract title, price, description, specs |
| **Documentation** | Code blocks, headings hierarchy | Preserve structure and code |
| **List/Search** | Repeated item patterns | Extract as structured list |
| **Landing Page** | Hero section, CTAs | Extract key messaging |
### Step 2: Filter Noise
**ALWAYS REMOVE these elements from output:**
- Navigation menus and breadcrumbs
- Footer content (copyright, links)
- Sidebars (ads, related articles, social links)
- Cookie banners and popups
- Comments section (unless specifically requested)
- Share buttons and social widgets
- Login/signup prompts
### Step 3: Structure the Content
**For Articles:**
```markdown
# [Title]
**Source:** [URL]
**Date:** [if available]
**Author:** [if available]
---
[Main content in clean markdown]
```
**For Product Pages:**
```markdown
# [Product Name]
**Price:** [price]
**Availability:** [in stock/out of stock]
## Description
[product description]
## Specifications
| Spec | Value |
|------|-------|
| ... | ... |
```
---
## Output Formats
### Markdown (default)
Clean, readable markdown with proper headings, lists, and formatting.
### JSON
```json
{
"url": "https://...",
"title": "Page Title",
"type": "article|product|docs|list",
"content": {
"main": "...",
"metadata": {}
},
"extracted_at": "ISO timestamp"
}
```
### Text
Plain text with minimal formatting, suitable for further processing.
---
## Error Handling
### Navigation Errors
| Error | Detection | Action |
|-------|-----------|--------|
| **Timeout** | Page doesn't load in 30s | Report error, suggest retry |
| **404 Not Found** | "404" in title/content | Report "Page not found" |
| **403 Forbidden** | "403", "Access Denied" | Report access restriction |
| **CAPTCHA** | "captcha", "verify you're human" | Report CAPTCHA detected, cannot proceed |
| **Paywall** | "subscribe", "premium content" | Extract visible content, note paywall |
### Recovery Actions
```
If page load fails:
1. Report the specific error to user
2. Suggest: "Try again?" or "Different URL?"
3. Close browser cleanly
If content is blocked:
1. Report what was detected (CAPTCHA/paywall/geo-block)
2. Extract any available preview content
3. Suggest alternatives if applicable
```
---
## Advanced Scenarios
### Single Page Applications (SPA)
```
1. Navigate to URL
2. Wait longer (3-5 seconds) for JS hydration
3. Use browser_wait_for with specific text if known
4. Then snapshot
```
### Infinite Scroll Pages
```
1. Navigate
2. Execute scroll loop (see Phase 1)
3. Snapshot after scrolling completes
```
### Pages with Click-to-Reveal Content
```
1. Snapshot first to identify clickable elements
2. Use browser_click on "Read more" / "Show all" buttons
3. Wait briefly
4. Snapshot again for full content
```
### Multi-page Articles
```
1. Scrape first page
2. Identify "Next" or pagination links
3. Ask user: "Article has X pages. Scrape all?"
4. If yes, iterate through pages and combine
```
---
## Performance Guidelines
| Metric | Target | How |
|--------|--------|-----|
| **Speed** | < 15 seconds | Minimal waits, parallel where possible |
| **Token Usage** | < 5000 tokens | Smart extraction, not full DOM |
| **Reliability** | > 95% success | Proper error handling |
---
## Security Notes
- Never execute arbitrary JavaScript from the page
- Don't follow redirects to suspicious domains
- Don't submit forms or click login buttons
- Don't scrape pages that require authentication (unless user provides credentials flow)
- Respect robots.txt when mentioned by user
---
## Quick Reference
**Minimum viable scrape (4 tool calls):**
```
1. browser_navigate → 2. browser_wait_for → 3. browser_snapshot → 4. browser_close
```
**Full-featured scrape (with scroll + screenshot):**
```
1. browser_navigate
2. browser_wait_for
3. browser_evaluate (scroll)
4. browser_snapshot
5. browser_take_screenshot
6. browser_close
```
Remember: The goal is to deliver **clean, useful content** to the user, not raw HTML/DOM dumps.
This skill is an intelligent web scraper that navigates pages, captures content, and returns cleaned, structured outputs in Markdown, JSON, or plain text. It supports SPA and infinite-scroll pages, optional screenshots, and robust error handling to surface problems like CAPTCHAs or paywalls. The focus is delivering readable, noise-free content rather than raw DOM dumps.
The skill opens a headless browser, waits for page render, optionally performs a scroll sequence or interacts with reveal buttons, then captures an accessibility snapshot and (optionally) a full-page screenshot. It applies smart extraction rules to detect page type (article, product, docs, list, landing), strip navigation/ads/popups, and structure the result into Markdown, JSON, or text. Errors such as timeouts, 404/403, CAPTCHAs, and paywalls are detected and reported with suggested recovery actions.
What output formats are available?
Markdown (default), JSON, and plain text are supported; choose via --format.
How does it handle infinite-scroll or SPA pages?
Use --scroll to run a scroll loop before snapshot. For SPAs, the scraper waits longer or for specific text to ensure hydration before capturing.
What happens if a CAPTCHA or paywall is detected?
The tool reports the detection, extracts any visible preview content, and suggests alternatives; it does not bypass CAPTCHAs or paid walls.