home / skills / openclaw / skills / web-scraper-as-a-service
This skill helps you generate client-ready web scrapers from briefs, run them, clean data, and deliver packaged results.
npx playbooks add skill openclaw/skills --skill web-scraper-as-a-serviceReview the files below or copy the command above to add this skill to your agents.
---
name: web-scraper-as-a-service
description: Build client-ready web scrapers with clean data output. Use when creating scrapers for clients, extracting data from websites, or delivering scraping projects.
argument-hint: "[target-url-or-brief]"
allowed-tools: Read, Write, Edit, Grep, Glob, Bash, WebFetch, WebSearch
---
# Web Scraper as a Service
Turn scraping briefs into deliverable scraping projects. Generates the scraper, runs it, cleans the data, and packages everything for the client.
## How to Use
```
/web-scraper-as-a-service "Scrape all products from example-store.com — need name, price, description, images. CSV output."
/web-scraper-as-a-service https://example.com --fields "title,price,rating,url" --format csv
/web-scraper-as-a-service brief.txt
```
## Scraper Generation Pipeline
### Step 1: Analyze the Target
Before writing any code:
1. **Fetch the target URL** to understand the page structure
2. **Identify**:
- Is the site server-rendered (static HTML) or client-rendered (JavaScript/SPA)?
- What anti-scraping measures are visible? (Cloudflare, CAPTCHAs, rate limits)
- Pagination pattern (URL params, infinite scroll, load more button)
- Data structure (product cards, table rows, list items)
- Total estimated volume (number of pages/items)
3. **Choose the right tool**:
- Static HTML → Python + `requests` + `BeautifulSoup`
- JavaScript-rendered → Python + `playwright`
- API available → Direct API calls (check network tab patterns)
### Step 2: Build the Scraper
Generate a complete Python script in `scraper/` directory:
```
scraper/
scrape.py # Main scraper script
requirements.txt # Dependencies
config.json # Target URLs, fields, settings
README.md # Setup and usage instructions for client
```
**`scrape.py` must include**:
```python
# Required features in every scraper:
# 1. Configuration
import json
config = json.load(open('config.json'))
# 2. Rate limiting (ALWAYS — be respectful)
import time
DELAY_BETWEEN_REQUESTS = 2 # seconds, adjustable in config
# 3. Retry logic
MAX_RETRIES = 3
RETRY_DELAY = 5
# 4. User-Agent rotation
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36...",
# ... at least 5 user agents
]
# 5. Progress tracking
print(f"Scraping page {current}/{total} — {items_collected} items collected")
# 6. Error handling
# - Log errors but don't crash on individual page failures
# - Save progress incrementally (don't lose data on crash)
# - Write errors to error_log.txt
# 7. Output
# - Save data incrementally (append to file, don't hold in memory)
# - Support CSV and JSON output
# - Clean and normalize data before saving
# 8. Resume capability
# - Track last successfully scraped page/URL
# - Can resume from where it left off if interrupted
```
### Step 3: Data Cleaning
After scraping, clean the data:
1. **Remove duplicates** (by unique identifier or composite key)
2. **Normalize text** (strip extra whitespace, fix encoding issues, consistent capitalization)
3. **Validate data** (no empty required fields, prices are numbers, URLs are valid)
4. **Standardize formats** (dates to ISO 8601, currency to numbers, consistent units)
5. **Generate data quality report**:
```
Data Quality Report
───────────────────
Total records: 2,487
Duplicates removed: 13
Empty fields filled: 0
Fields with issues: price (3 records had non-numeric values — cleaned)
Completeness: 99.5%
```
### Step 4: Client Deliverable Package
Generate a complete deliverable:
```
delivery/
data.csv # Clean data in requested format
data.json # JSON alternative
data-quality-report.md # Quality metrics
scraper-documentation.md # How the scraper works
README.md # Quick start guide
```
**`scraper-documentation.md`** includes:
- What was scraped and from where
- How many records collected
- Data fields and their descriptions
- How to re-run the scraper
- Known limitations
- Date of scraping
### Step 5: Output to User
Present:
1. **Summary**: X records scraped from Y pages, Z% data quality
2. **Sample data**: First 5 rows of the output
3. **File locations**: Where the deliverables are saved
4. **Client handoff notes**: What to tell the client about the data
## Scraper Templates
Based on the target type, use the appropriate template:
### E-commerce Product Scraper
Fields: name, price, original_price, discount, description, images, category, sku, rating, review_count, availability, url
### Real Estate Listings
Fields: address, price, bedrooms, bathrooms, sqft, lot_size, listing_type, agent, description, images, url
### Job Listings
Fields: title, company, location, salary, job_type, description, requirements, posted_date, url
### Directory/Business Listings
Fields: business_name, address, phone, website, category, rating, review_count, hours, description
### News/Blog Articles
Fields: title, author, date, content, tags, url, image
## Ethical Scraping Rules
1. **Always respect robots.txt** — check before scraping
2. **Rate limit** — minimum 2 second delay between requests
3. **Identify yourself** — use realistic but honest User-Agent
4. **Don't scrape personal data** (emails, phone numbers) unless explicitly authorized by the client AND the data is publicly displayed
5. **Cache responses** — don't re-scrape pages unnecessarily
6. **Check ToS** — note if the site's terms prohibit scraping and inform the client
This skill builds client-ready web scrapers and delivers clean, packaged data. It converts a scraping brief into a complete Python scraper, runs it, cleans and validates results, and produces a client handoff package with documentation and a data quality report.
The skill analyzes the target site to choose the right engine (requests/BeautifulSoup for static pages, Playwright for JS-rendered sites, or direct API calls when available). It generates a reusable Python scraper with rate limiting, retry logic, user-agent rotation, progress tracking, error logging, incremental output and resume capability. After running the scraper it deduplicates, normalizes, validates and standardizes data, then builds a delivery bundle containing data files, docs, and a quality report.
What formats are supported for output?
CSV and JSON are produced by default; files are written incrementally to avoid memory issues.
How are anti-scraping protections handled?
The skill detects client vs server rendering, recommends Playwright for JS-heavy sites, includes rate limiting, user-agent rotation, retries, and logs; complex protections like CAPTCHAs require manual intervention or client approval for advanced bypass strategies.
Can the scraper resume after a crash?
Yes — scrapers track last successful page/URL and append output incrementally so runs can resume without redoing completed work.