home / skills / openclaw / skills / web-scraper-as-a-service

web-scraper-as-a-service skill

/skills/seanwyngaard/web-scraper-as-a-service

This skill helps you generate client-ready web scrapers from briefs, run them, clean data, and deliver packaged results.

npx playbooks add skill openclaw/skills --skill web-scraper-as-a-service

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
5.4 KB
---
name: web-scraper-as-a-service
description: Build client-ready web scrapers with clean data output. Use when creating scrapers for clients, extracting data from websites, or delivering scraping projects.
argument-hint: "[target-url-or-brief]"
allowed-tools: Read, Write, Edit, Grep, Glob, Bash, WebFetch, WebSearch
---

# Web Scraper as a Service

Turn scraping briefs into deliverable scraping projects. Generates the scraper, runs it, cleans the data, and packages everything for the client.

## How to Use

```
/web-scraper-as-a-service "Scrape all products from example-store.com — need name, price, description, images. CSV output."
/web-scraper-as-a-service https://example.com --fields "title,price,rating,url" --format csv
/web-scraper-as-a-service brief.txt
```

## Scraper Generation Pipeline

### Step 1: Analyze the Target

Before writing any code:

1. **Fetch the target URL** to understand the page structure
2. **Identify**:
   - Is the site server-rendered (static HTML) or client-rendered (JavaScript/SPA)?
   - What anti-scraping measures are visible? (Cloudflare, CAPTCHAs, rate limits)
   - Pagination pattern (URL params, infinite scroll, load more button)
   - Data structure (product cards, table rows, list items)
   - Total estimated volume (number of pages/items)
3. **Choose the right tool**:
   - Static HTML → Python + `requests` + `BeautifulSoup`
   - JavaScript-rendered → Python + `playwright`
   - API available → Direct API calls (check network tab patterns)

### Step 2: Build the Scraper

Generate a complete Python script in `scraper/` directory:

```
scraper/
  scrape.py           # Main scraper script
  requirements.txt    # Dependencies
  config.json         # Target URLs, fields, settings
  README.md           # Setup and usage instructions for client
```

**`scrape.py` must include**:

```python
# Required features in every scraper:

# 1. Configuration
import json
config = json.load(open('config.json'))

# 2. Rate limiting (ALWAYS — be respectful)
import time
DELAY_BETWEEN_REQUESTS = 2  # seconds, adjustable in config

# 3. Retry logic
MAX_RETRIES = 3
RETRY_DELAY = 5

# 4. User-Agent rotation
USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
    # ... at least 5 user agents
]

# 5. Progress tracking
print(f"Scraping page {current}/{total} — {items_collected} items collected")

# 6. Error handling
# - Log errors but don't crash on individual page failures
# - Save progress incrementally (don't lose data on crash)
# - Write errors to error_log.txt

# 7. Output
# - Save data incrementally (append to file, don't hold in memory)
# - Support CSV and JSON output
# - Clean and normalize data before saving

# 8. Resume capability
# - Track last successfully scraped page/URL
# - Can resume from where it left off if interrupted
```

### Step 3: Data Cleaning

After scraping, clean the data:

1. **Remove duplicates** (by unique identifier or composite key)
2. **Normalize text** (strip extra whitespace, fix encoding issues, consistent capitalization)
3. **Validate data** (no empty required fields, prices are numbers, URLs are valid)
4. **Standardize formats** (dates to ISO 8601, currency to numbers, consistent units)
5. **Generate data quality report**:
   ```
   Data Quality Report
   ───────────────────
   Total records: 2,487
   Duplicates removed: 13
   Empty fields filled: 0
   Fields with issues: price (3 records had non-numeric values — cleaned)
   Completeness: 99.5%
   ```

### Step 4: Client Deliverable Package

Generate a complete deliverable:

```
delivery/
  data.csv                    # Clean data in requested format
  data.json                   # JSON alternative
  data-quality-report.md      # Quality metrics
  scraper-documentation.md    # How the scraper works
  README.md                   # Quick start guide
```

**`scraper-documentation.md`** includes:
- What was scraped and from where
- How many records collected
- Data fields and their descriptions
- How to re-run the scraper
- Known limitations
- Date of scraping

### Step 5: Output to User

Present:
1. **Summary**: X records scraped from Y pages, Z% data quality
2. **Sample data**: First 5 rows of the output
3. **File locations**: Where the deliverables are saved
4. **Client handoff notes**: What to tell the client about the data

## Scraper Templates

Based on the target type, use the appropriate template:

### E-commerce Product Scraper
Fields: name, price, original_price, discount, description, images, category, sku, rating, review_count, availability, url

### Real Estate Listings
Fields: address, price, bedrooms, bathrooms, sqft, lot_size, listing_type, agent, description, images, url

### Job Listings
Fields: title, company, location, salary, job_type, description, requirements, posted_date, url

### Directory/Business Listings
Fields: business_name, address, phone, website, category, rating, review_count, hours, description

### News/Blog Articles
Fields: title, author, date, content, tags, url, image

## Ethical Scraping Rules

1. **Always respect robots.txt** — check before scraping
2. **Rate limit** — minimum 2 second delay between requests
3. **Identify yourself** — use realistic but honest User-Agent
4. **Don't scrape personal data** (emails, phone numbers) unless explicitly authorized by the client AND the data is publicly displayed
5. **Cache responses** — don't re-scrape pages unnecessarily
6. **Check ToS** — note if the site's terms prohibit scraping and inform the client

Overview

This skill builds client-ready web scrapers and delivers clean, packaged data. It converts a scraping brief into a complete Python scraper, runs it, cleans and validates results, and produces a client handoff package with documentation and a data quality report.

How this skill works

The skill analyzes the target site to choose the right engine (requests/BeautifulSoup for static pages, Playwright for JS-rendered sites, or direct API calls when available). It generates a reusable Python scraper with rate limiting, retry logic, user-agent rotation, progress tracking, error logging, incremental output and resume capability. After running the scraper it deduplicates, normalizes, validates and standardizes data, then builds a delivery bundle containing data files, docs, and a quality report.

When to use it

  • You need a finished scraper packaged for a client or stakeholder.
  • Extract structured data from product pages, listings, articles or directories.
  • Deliver repeatable scrapers with resume and error handling for large volumes.
  • Produce validated CSV/JSON output plus a data quality report.
  • Create a documented handoff for non-technical clients or teams.

Best practices

  • Run an initial reconnaissance: check robots.txt, page rendering type, pagination and anti-scraping signals before coding.
  • Honor ethical rules: minimum 2s delay, identify requests with a clear User-Agent, and avoid scraping private/personal data without authorization.
  • Save progress incrementally and implement resume logic so long jobs survive interruptions.
  • Validate and normalize output: remove duplicates, standardize dates/currencies, and ensure required fields are present.
  • Provide clear client documentation: how to run, dependencies, limitations, and expected data shape.

Example use cases

  • E-commerce scraping: collect product name, price, images and stock status for price monitoring or marketplace migration.
  • Real estate listings: gather address, price, bedrooms, images and agent info for market analysis.
  • Job aggregator: pull job title, company, location, salary and posting date into a unified feed.
  • News monitoring: extract article title, author, published date and content for research or alerting.
  • Local business directory: compile business name, address, phone and website for lead enrichment.

FAQ

What formats are supported for output?

CSV and JSON are produced by default; files are written incrementally to avoid memory issues.

How are anti-scraping protections handled?

The skill detects client vs server rendering, recommends Playwright for JS-heavy sites, includes rate limiting, user-agent rotation, retries, and logs; complex protections like CAPTCHAs require manual intervention or client approval for advanced bypass strategies.

Can the scraper resume after a crash?

Yes — scrapers track last successful page/URL and append output incrementally so runs can resume without redoing completed work.