home / skills / zephyrwang6 / myskill / web-scraper

web-scraper skill

This skill fetches web pages and converts HTML to clean markdown, enabling you to read articles and extract content from URLs.

npx playbooks add skill zephyrwang6/myskill --skill web-scraper

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

2.0 KB

---
name: web-scraper
description: Fetch and extract content from web pages, converting HTML to clean markdown. Use when users want to read web articles, extract information from URLs, scrape web content, or when the built-in WebFetch tool fails due to network restrictions. Trigger when user provides URLs to read, asks to fetch web content, or needs to extract text from websites.
---

# Web Scraper

Fetch web page content and convert to clean markdown format.

## Usage

Run the fetch script to get web content:

```bash
python3 scripts/fetch_url.py <url> [options]
```

### Options

- `--timeout <seconds>`: Request timeout (default: 30)
- `--max-length <chars>`: Maximum output length (default: 100000)
- `--raw`: Output raw HTML instead of markdown

### Examples

**Fetch single URL:**
```bash
python3 scripts/fetch_url.py "https://example.com/article"
```

**Fetch with custom timeout:**
```bash
python3 scripts/fetch_url.py "https://example.com/article" --timeout 60
```

**Fetch multiple URLs in parallel:**
```bash
for url in "https://url1.com" "https://url2.com"; do
  python3 scripts/fetch_url.py "$url" &
done
wait
```

## Workflow

1. **Single URL**: Run `fetch_url.py` with the URL
2. **Multiple URLs**: Run multiple fetch commands in parallel using background processes
3. **Handle errors**: If a URL fails, check:
   - Network connectivity
   - URL validity
   - Website may block automated requests (try different User-Agent or use browser automation)

## Output Format

The script converts HTML to clean markdown:
- Headings → `#`, `##`, `###`, etc.
- Lists → `-` for unordered, `1.` for ordered
- Bold/Italic → `**bold**`, `*italic*`
- Code blocks preserved
- Navigation, footer, and ads removed

## Troubleshooting

**403 Forbidden**: Website blocks automated requests. Consider:
- Some sites require JavaScript rendering (not supported by this script)
- Try accessing from a different network

**Timeout errors**: Increase timeout with `--timeout 60`

**Empty content**: Website may require JavaScript to render content

Overview

This skill fetches web page content and converts HTML into clean, readable Markdown. It extracts article content while stripping navigation, footers, and ads to deliver focused text suitable for reading or downstream processing. Use it when you need plain-text versions of web pages, or when built-in web fetchers fail due to environment restrictions.

How this skill works

The tool requests a URL, parses the returned HTML, and extracts main article elements such as headings, paragraphs, lists, and code blocks. It then maps HTML structures to Markdown syntax (headings, lists, bold/italic, fenced code) and removes common noise like navigation bars and advertisements. Options control request timeout, maximum output length, and raw-HTML output when needed.

When to use it

You want a clean, Markdown version of an article for reading or note-taking.
You need to extract textual content from a URL for analysis or summarization.
The environment restricts the built-in WebFetch tool or other fetch methods fail.
You are scraping multiple pages and want a simple script to run in parallel.
A website returns HTML but you need to strip boilerplate (nav, footer, ads).

Best practices

Increase --timeout for slow sites or large pages to avoid premature errors.
Use --max-length to limit output size when processing many pages or long articles.
If you get 403 responses, try changing the User-Agent or use browser automation for JS-heavy sites.
Run multiple fetches in parallel with background processes when collecting many URLs.
Verify target pages don’t require JavaScript rendering; the scraper does not run a browser.

Example use cases

Fetch a single article: fetch and convert https://example.com/article to Markdown for archiving.
Batch scrape: run multiple fetch processes in parallel to collect a list of news pages.
Content extraction for NLP: produce plain text for summarization, topic modeling, or entity extraction.
Troubleshooting fetch failures: increase timeout or switch to raw HTML to debug parsing issues.

FAQ

What options control request behavior?

You can set --timeout (seconds) and --max-length (characters). Use --raw to output raw HTML instead of Markdown.

Why am I getting empty content or missing text?

Many sites require JavaScript to render content. This scraper does not execute JS, so try a browser-based approach for those pages.

How do I handle 403 Forbidden errors?

403 often means the site blocks automated requests. Try a different User-Agent, access from another network, or use browser automation that imitates a real browser.