home / skills / vm0-ai / vm0-skills / scrapeninja

scrapeninja skill

/scrapeninja

This skill accelerates data extraction from protected sites using fast non-JS scraping and JS rendering with geo-targeted proxies.

npx playbooks add skill vm0-ai/vm0-skills --skill scrapeninja

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
8.4 KB
---
name: scrapeninja
description: High-performance web scraping API with Chrome TLS fingerprint and JS rendering
vm0_secrets:
  - SCRAPENINJA_API_KEY
---

# ScrapeNinja

High-performance web scraping API with Chrome TLS fingerprint, rotating proxies, smart retries, and optional JavaScript rendering.

> Official docs: https://scrapeninja.net/docs/

---

## When to Use

Use this skill when you need to:

- Scrape websites with anti-bot protection (Cloudflare, Datadome)
- Extract data without running a full browser (fast `/scrape` endpoint)
- Render JavaScript-heavy pages (`/scrape-js` endpoint)
- Use rotating proxies with geo selection (US, EU, Brazil, etc.)
- Extract structured data with Cheerio extractors
- Intercept AJAX requests
- Take screenshots of pages

---

## Prerequisites

1. Get an API key from RapidAPI or APIRoad:
  - RapidAPI: https://rapidapi.com/restyler/api/scrapeninja
  - APIRoad: https://apiroad.net/marketplace/apis/scrapeninja

Set environment variable:

```bash
# For RapidAPI
export SCRAPENINJA_API_KEY="your-rapidapi-key"

# For APIRoad (use X-Apiroad-Key header instead)
export SCRAPENINJA_API_KEY="your-apiroad-key"
```

---


> **Important:** When using `$VAR` in a command that pipes to another command, wrap the command containing `$VAR` in `bash -c '...'`. Due to a Claude Code bug, environment variables are silently cleared when pipes are used directly.
> ```bash
> bash -c 'curl -s "https://api.example.com" -H "Authorization: Bearer $API_KEY"'
> ```

## How to Use

### 1. Basic Scrape (Non-JS, Fast)

High-performance scraping with Chrome TLS fingerprint, no JavaScript:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com"
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json' | jq '{status: .info.statusCode, url: .info.finalUrl, bodyLength: (.body | length)}'
```

**With custom headers and retries:**

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "headers": ["Accept-Language: en-US"],
  "retryNum": 3,
  "timeout": 15
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json'
```

### 2. Scrape with JavaScript Rendering

For JavaScript-heavy sites (React, Vue, etc.):

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "waitForSelector": "h1",
  "timeout": 20
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json' | jq '{status: .info.statusCode, bodyLength: (.body | length)}'
```

**With screenshot:**

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "screenshot": true
}
```

Then run:

```bash
# Get screenshot URL from response
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json' | jq -r '.info.screenshot'
```

### 3. Geo-Based Proxy Selection

Use proxies from specific regions:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "geo": "eu"
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json' | jq .info
```

Available geos: `us`, `eu`, `br` (Brazil), `fr` (France), `de` (Germany), `4g-eu`

### 4. Smart Retries

Retry on specific HTTP status codes or text patterns:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "retryNum": 3,
  "statusNotExpected": [403, 429, 503],
  "textNotExpected": ["captcha", "Access Denied"]
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json'
```

### 5. Extract Data with Cheerio

Extract structured JSON using Cheerio extractor functions:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://news.ycombinator.com",
  "extractor": "function(input, cheerio) { let $ = cheerio.load(input); return $(\".titleline > a\").slice(0,5).map((i,el) => ({title: $(el).text(), url: $(el).attr(\"href\")})).get(); }"
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json' | jq '.extractor'
```

### 6. Intercept AJAX Requests

Capture XHR/fetch responses:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "catchAjaxHeadersUrlMask": "api/data"
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json' | jq '.info.catchedAjax'
```

### 7. Block Resources for Speed

Speed up JS rendering by blocking images and media:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "blockImages": true,
  "blockMedia": true
}
```

Then run:

```bash
bash -c 'curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: ${SCRAPENINJA_API_KEY}" -d @/tmp/scrapeninja_request.json'
```

---

## API Endpoints

| Endpoint | Description |
|----------|-------------|
| `/scrape` | Fast non-JS scraping with Chrome TLS fingerprint |
| `/scrape-js` | Full Chrome browser with JS rendering |
| `/v2/scrape-js` | Enhanced JS rendering for protected sites (APIRoad only) |

---

## Request Parameters

### Common Parameters (all endpoints)

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | URL to scrape |
| `headers` | string[] | - | Custom HTTP headers |
| `retryNum` | int | 1 | Number of retry attempts |
| `geo` | string | `us` | Proxy geo: us, eu, br, fr, de, 4g-eu |
| `proxy` | string | - | Custom proxy URL (overrides geo) |
| `timeout` | int | 10/16 | Timeout per attempt in seconds |
| `textNotExpected` | string[] | - | Text patterns that trigger retry |
| `statusNotExpected` | int[] | [403, 502] | HTTP status codes that trigger retry |
| `extractor` | string | - | Cheerio extractor function |

### JS Rendering Parameters (`/scrape-js`, `/v2/scrape-js`)

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `waitForSelector` | string | - | CSS selector to wait for |
| `postWaitTime` | int | - | Extra wait time after load (1-12s) |
| `screenshot` | bool | true | Take page screenshot |
| `blockImages` | bool | false | Block image loading |
| `blockMedia` | bool | false | Block CSS/fonts loading |
| `catchAjaxHeadersUrlMask` | string | - | URL pattern to intercept AJAX |
| `viewport` | object | 1920x1080 | Custom viewport size |

---

## Response Format

```json
{
  "info": {
  "statusCode": 200,
  "finalUrl": "https://example.com",
  "headers": ["content-type: text/html"],
  "screenshot": "base64-encoded-png",
  "catchedAjax": {
  "url": "https://example.com/api/data",
  "method": "GET",
  "body": "...",
  "status": 200
  }
  },
  "body": "<html>...</html>",
  "extractor": { "extracted": "data" }
}
```

---

## Guidelines

1. **Start with `/scrape`**: Use the fast non-JS endpoint first, only switch to `/scrape-js` if needed
2. **Retries**: Set `retryNum` to 2-3 for unreliable sites
3. **Geo Selection**: Use `eu` for European sites, `us` for American sites
4. **Extractors**: Test extractors at https://scrapeninja.net/cheerio-sandbox/
5. **Blocked Sites**: For Cloudflare/Datadome protected sites, use `/v2/scrape-js` via APIRoad
6. **Screenshots**: Set `screenshot: false` to speed up JS rendering
7. **Rate Limits**: Check your plan limits on RapidAPI/APIRoad dashboard

---

## Tools

- **Playground**: https://scrapeninja.net/scraper-sandbox
- **Cheerio Sandbox**: https://scrapeninja.net/cheerio-sandbox
- **cURL Converter**: https://scrapeninja.net/curl-to-scraper

Overview

This skill provides a high-performance web scraping API with Chrome TLS fingerprinting, rotating proxies, smart retries, and optional JavaScript rendering. It offers fast non-JS scraping and full browser rendering endpoints, plus features like geo-based proxies, Cheerio extractors, AJAX interception, and screenshots. Designed for scraping protected and JavaScript-heavy sites while minimizing infrastructure overhead.

How this skill works

You send JSON requests to dedicated endpoints: /scrape for fast non-JS HTML fetches and /scrape-js (or /v2/scrape-js via APIRoad) for full Chrome rendering. The service uses a Chrome TLS fingerprint, rotating proxies (selectable by geo), and configurable retry logic. Optional extractor functions run Cheerio on the HTML and JS rendering can capture AJAX responses and screenshots.

When to use it

  • When sites block simple scrapers (Cloudflare, Datadome) and you need realistic TLS/browser fingerprinting.
  • When you want fast HTML fetches without running a full browser (/scrape).
  • When pages are JS-driven (React, Vue) and require rendering or AJAX interception (/scrape-js).
  • When you need geo-specific requests via rotating proxies (us, eu, br, fr, de, 4g-eu).
  • When you need structured extraction with Cheerio or screenshots for visual verification.

Best practices

  • Start with /scrape and only use /scrape-js if the page needs JS to load content.
  • Set retryNum to 2–3 and configure statusNotExpected/textNotExpected to handle transient blocks.
  • Use geo selection to match the target audience and reduce regional blocks.
  • Disable screenshots and block images/media to speed up JS rendering when visuals aren’t needed.
  • Test Cheerio extractors in the sandbox before deploying them in production requests.

Example use cases

  • Crawl product listings where HTML is available but anti-bot measures require TLS/browser mimicry.
  • Render single-page apps to extract dynamic content and capture API responses made by the page.
  • Take periodic screenshots of landing pages for monitoring or visual regression checks.
  • Collect region-restricted content by targeting proxies in the desired country or continent.
  • Use Cheerio extractors to return structured JSON (titles, links, prices) directly from responses.

FAQ

Which endpoint should I try first?

Begin with /scrape for speed; switch to /scrape-js only if the content requires JavaScript rendering.

How do I handle sites that still block requests?

Increase retryNum, add statusNotExpected/textNotExpected rules, use geo proxies, or use /v2/scrape-js via APIRoad for tougher protections.