home / skills / openclaw / skills / xpr-web-scraping
This skill helps you fetch and extract web content, discover links, and organize results for efficient research and data gathering.
npx playbooks add skill openclaw/skills --skill xpr-web-scrapingReview the files below or copy the command above to add this skill to your agents.
---
name: web-scraping
description: Web scraping tools for fetching and extracting data from web pages
---
## Web Scraping
You have web scraping tools for fetching and extracting data from web pages:
**Single page:**
- `scrape_url` — fetch a URL and get cleaned text content + metadata (title, description, link count)
- Use format="text" (default) for most tasks — strips all HTML
- Use format="markdown" to preserve headings, links, lists, bold/italic
- Use format="html" only when you need raw HTML
**Link discovery:**
- `extract_links` — fetch a page and extract all links with text and type (internal/external)
- Use the `pattern` parameter to filter by regex (e.g. `"\\.pdf$"` for PDF links)
- Links are deduplicated and resolved to absolute URLs
**Multi-page research:**
- `scrape_multiple` — fetch up to 10 URLs in parallel for comparison/research
- One failure doesn't block others (uses Promise.allSettled)
**Best practices:**
- Prefer "text" format for content extraction, "markdown" for preserving structure
- Don't scrape the same domain more than 5 times per minute
- Combine with `store_deliverable` to save scraped content as job evidence
- For very large pages, the content is limited to 5MB
This skill provides web scraping tools to fetch and extract structured content and metadata from web pages. It offers single-page extraction, link discovery, and parallel multi-page scraping with sensible defaults and safeguards. It is designed for research, archiving, and automated evidence collection workflows.
The skill fetches web pages and returns cleaned text, optional markdown or raw HTML, plus metadata such as title, description, and link counts. A dedicated link extractor returns deduplicated, absolute URLs with link text and a classification of internal or external. A multi-page worker fetches up to 10 URLs in parallel and isolates failures so other requests continue to succeed.
What output formats are available?
Three formats: text (default, strips HTML), markdown (preserves headings, links, lists), and html (raw source).
How does the multi-page fetch handle failures?
It uses a parallel settled approach so one failing URL does not block others; you receive results and error details per URL.