home / skills / aidotnet / moyucode / web-scraper

web-scraper skill

/skills/tools/web-scraper

This skill helps you extract structured data from web pages using CSS selectors with rate limiting and pagination support.

npx playbooks add skill aidotnet/moyucode --skill web-scraper

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
1015 B
---
name: web-scraper
description: 使用CSS选择器从网页提取数据,支持分页、限速和多种输出格式。
metadata:
  short-description: 从网站爬取数据
source:
  repository: https://github.com/cheeriojs/cheerio
  license: MIT
---

# Web Scraper Tool

## Description
Extract structured data from web pages using CSS selectors with rate limiting and pagination support.

## Trigger
- `/scrape` command
- User requests web data extraction
- User needs to parse HTML

## Usage

```bash
# Scrape single page
python scripts/web_scraper.py --url "https://example.com" --selector ".item" --output data.json

# Scrape with multiple selectors
python scripts/web_scraper.py --url "https://example.com" --selectors "title:.title,price:.price,link:a@href"

# Scrape multiple pages
python scripts/web_scraper.py --urls urls.txt --selector ".product" --output products.json --delay 2
```

## Tags
`scraping`, `web`, `html`, `data-extraction`, `automation`

## Compatibility
- Codex: ✅
- Claude Code: ✅

Overview

This skill extracts structured data from web pages using CSS selectors, with built-in support for pagination, rate limiting, and multiple output formats. It is designed for quick, repeatable data extraction from static HTML pages and simple listing sites. Use it to convert page elements into JSON, CSV, or other structured outputs for analysis or automation.

How this skill works

You provide one or more CSS selectors that map to fields on the target pages. The tool fetches pages, applies selectors to each document, follows pagination rules when configured, and respects a configurable delay between requests. Results can be emitted as JSON, CSV, or streamed output; optionally it accepts a list of URLs to process in batch.

When to use it

  • Extracting product, article, or listing data from static HTML pages
  • Converting repeated page structures into CSV or JSON for analysis
  • Automating data collection across paginated search or category pages
  • Scraping multiple URLs from a list with controlled request rate
  • Rapid prototyping of small scraping jobs without a headless browser

Best practices

  • Start with specific CSS selectors and test on a single page before batching
  • Use pagination rules or URL patterns to avoid missing items across pages
  • Set a reasonable delay between requests to avoid throttling or bans
  • Validate output format early (JSON keys or CSV columns) to simplify downstream processing
  • Respect robots.txt and site terms; limit load on target servers

Example use cases

  • Scrape product titles, prices, and links from an e-commerce category into products.json
  • Collect article headlines and publish dates across paginated blog archives into a CSV
  • Process a list of candidate profile pages from urls.txt and extract contact fields
  • Convert HTML tables or repeated card components into structured JSON for reporting
  • Run scheduled batches to update a dataset while throttling requests with a delay

FAQ

Which output formats are supported?

Common outputs include JSON and CSV; the tool can also stream results or write to a specified file.

How does pagination work?

You can supply a pagination selector or URL pattern. The scraper follows pages until no new items are found or a page limit is reached.