home / skills / artwist-polyakov / polyakov-claude-skills / scrapedo-web-scraper

scrapedo-web-scraper skill

/plugins/scrapedo-web-scraper/skills/scrapedo-web-scraper

This skill fetches page text or HTML via Scrape.do when normal fetch is blocked, bypassing captchas and blockers to ensure access.

npx playbooks add skill artwist-polyakov/polyakov-claude-skills --skill scrapedo-web-scraper

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
1.3 KB
---
name: scrapedo-web-scraper
description: |
  Веб-скрапинг через Scrape.do. Обходит блокировки и CAPTCHA.
  АВТОМАТИЧЕСКИ используй при ошибках WebFetch: 403, 401, 429,
  timeout, access denied, Cloudflare block.
---

# Scrape.do Web Scraper

Скрапинг веб-страниц через Scrape.do API. Используй когда обычный fetch не работает (блокировка, JavaScript).

## Использование

```bash
# Получить текст страницы
python scripts/scrape.py https://example.com

# Получить HTML
python scripts/scrape.py --html https://example.com
```

## Из Python

```python
from scripts.scrape import fetch_via_scrapedo

result = fetch_via_scrapedo('https://example.com')
if result['success']:
    print(result['content'])  # текст
    # result['html'] — оригинальный HTML
else:
    print(result['content'])  # описание ошибки
```

## Результат

- **Успех**: текст страницы (или HTML с `--html`)
- **Ошибка**: понятное сообщение (нет токена / лимит / недоступно)

Если вернулась ошибка — страница недоступна через этот метод.

Overview

This skill integrates Scrape.do as a fallback web scraper for pages that fail with standard HTTP fetches. It bypasses common blocks like Cloudflare, JavaScript rendering barriers, and simple CAPTCHAs to return page text or full HTML. It triggers automatically on common WebFetch errors to preserve workflow continuity.

How this skill works

The skill routes requests to the Scrape.do API when WebFetch returns 403, 401, 429, timeout, access denied, or Cloudflare block. It performs headless rendering and anti-block measures, then returns either cleaned text or raw HTML depending on the options. On API errors it returns a clear, human-readable error explaining token, rate limit, or availability issues.

When to use it

  • Standard fetch returns 403, 401, 429, timeout, or access denied
  • Pages blocked by Cloudflare or similar anti-bot services
  • Pages that require JavaScript rendering to load content
  • When you need raw HTML for parsing or the rendered text for NLP
  • As an automatic fallback in production scraping pipelines

Best practices

  • Keep the Scrape.do API token secure and rotate it per your security policy
  • Use HTML output only when you need full markup; prefer text for downstream NLP
  • Respect robots.txt and site terms; use rate limiting and backoff to avoid bans
  • Handle returned error messages programmatically to fall back or alert on quota issues
  • Cache results for stable pages to reduce API usage and costs

Example use cases

  • Automated content extraction when a site blocks direct fetches with 403/Cloudflare
  • Collecting article text from dynamic sites that require JS rendering
  • Fallback layer in a scraping service to reduce failed fetches and false negatives
  • Getting raw HTML for complex parsers that rely on client-side rendering

FAQ

What does the skill return on success?

It returns page text by default and can return raw HTML when requested.

What happens if the Scrape.do API fails?

You get a clear error message indicating token, rate limit, or service availability; treat it as an unavailable source and fallback accordingly.