home / skills / openclaw / skills / xpr-web-scraping

xpr-web-scraping skill

/skills/paulgnz/xpr-web-scraping

This skill helps you fetch and extract web content, discover links, and organize results for efficient research and data gathering.

npx playbooks add skill openclaw/skills --skill xpr-web-scraping

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
1.2 KB
---
name: web-scraping
description: Web scraping tools for fetching and extracting data from web pages
---

## Web Scraping

You have web scraping tools for fetching and extracting data from web pages:

**Single page:**
- `scrape_url` — fetch a URL and get cleaned text content + metadata (title, description, link count)
  - Use format="text" (default) for most tasks — strips all HTML
  - Use format="markdown" to preserve headings, links, lists, bold/italic
  - Use format="html" only when you need raw HTML

**Link discovery:**
- `extract_links` — fetch a page and extract all links with text and type (internal/external)
  - Use the `pattern` parameter to filter by regex (e.g. `"\\.pdf$"` for PDF links)
  - Links are deduplicated and resolved to absolute URLs

**Multi-page research:**
- `scrape_multiple` — fetch up to 10 URLs in parallel for comparison/research
  - One failure doesn't block others (uses Promise.allSettled)

**Best practices:**
- Prefer "text" format for content extraction, "markdown" for preserving structure
- Don't scrape the same domain more than 5 times per minute
- Combine with `store_deliverable` to save scraped content as job evidence
- For very large pages, the content is limited to 5MB

Overview

This skill provides web scraping tools to fetch and extract structured content and metadata from web pages. It offers single-page extraction, link discovery, and parallel multi-page scraping with sensible defaults and safeguards. It is designed for research, archiving, and automated evidence collection workflows.

How this skill works

The skill fetches web pages and returns cleaned text, optional markdown or raw HTML, plus metadata such as title, description, and link counts. A dedicated link extractor returns deduplicated, absolute URLs with link text and a classification of internal or external. A multi-page worker fetches up to 10 URLs in parallel and isolates failures so other requests continue to succeed.

When to use it

  • Extract readable text from a single URL for analysis or summarization
  • Preserve page structure using markdown for content migration or reporting
  • Discover and filter links on a page, for example to find PDFs or resource lists
  • Fetch and compare multiple pages in parallel for research or competitive analysis
  • Archive scraped content as evidence or attach to a job record

Best practices

  • Prefer format="text" for general content extraction to remove HTML noise; use format="markdown" when headings and links matter
  • Respect rate limits: avoid scraping the same domain more than five times per minute
  • Limit parallel batches to the provided cap (10) to avoid server overload or IP throttling
  • Combine scraping output with a storage step (for example store_deliverable) to persist evidence and metadata
  • Be mindful of the 5MB per-page content limit; split very large pages or fetch specific subresources

Example use cases

  • Pull cleaned article text from a news page and run summarization or entity extraction
  • Extract a list of PDF links from an academic department page using a regex pattern filter
  • Fetch 8 product pages in parallel to compare prices and feature lists
  • Archive page content and metadata as job evidence for compliance or auditing
  • Scrape content in markdown to prepare documentation drafts while preserving headings and links

FAQ

What output formats are available?

Three formats: text (default, strips HTML), markdown (preserves headings, links, lists), and html (raw source).

How does the multi-page fetch handle failures?

It uses a parallel settled approach so one failing URL does not block others; you receive results and error details per URL.