home / skills / vm0-ai / vm0-skills / bright-data

bright-data skill

safe

This skill helps you manage Bright Data Web Scraper API through curl for social media scraping, data extraction, and usage monitoring.

npx playbooks add skill vm0-ai/vm0-skills --skill bright-data

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

11.2 KB

---
name: bright-data
description: Bright Data Web Scraper API via curl. Use this skill for scraping social media (Twitter/X, Reddit, YouTube, Instagram, TikTok), account management, and usage monitoring.
vm0_secrets:
  - BRIGHTDATA_API_KEY
---

# Bright Data Web Scraper API

Use the Bright Data API via direct `curl` calls for **social media scraping**, **web data extraction**, and **account management**.

> Official docs: `https://docs.brightdata.com/`

---

## When to Use

Use this skill when you need to:

- **Scrape social media** - Twitter/X, Reddit, YouTube, Instagram, TikTok, LinkedIn
- **Extract web data** - Posts, profiles, comments, engagement metrics
- **Monitor usage** - Track bandwidth and request usage
- **Manage account** - Check status and zones

---

## Prerequisites

1. Sign up at [Bright Data](https://brightdata.com/)
2. Get your API key from [Settings > Users](https://brightdata.com/cp/setting/users)
3. Create a Web Scraper dataset in the [Control Panel](https://brightdata.com/cp/datasets) to get your `dataset_id`

```bash
export BRIGHTDATA_API_KEY="your-api-key"
```

### Base URL

```
https://api.brightdata.com
```

---


> **Important:** When using `$VAR` in a command that pipes to another command, wrap the command containing `$VAR` in `bash -c '...'`. Due to a Claude Code bug, environment variables are silently cleared when pipes are used directly.
> ```bash
> bash -c 'curl -s "https://api.example.com" -H "Authorization: Bearer $API_KEY"'
> ```

---

## Social Media Scraping

Bright Data supports scraping these social media platforms:

| Platform | Profiles | Posts | Comments | Reels/Videos |
|----------|----------|-------|----------|--------------|
| Twitter/X | ✅ | ✅ | - | - |
| Reddit | - | ✅ | ✅ | - |
| YouTube | ✅ | ✅ | ✅ | - |
| Instagram | ✅ | ✅ | ✅ | ✅ |
| TikTok | ✅ | ✅ | ✅ | - |
| LinkedIn | ✅ | ✅ | - | - |

---

## How to Use

### 1. Trigger Scraping (Asynchronous)

Trigger a data collection job and get a `snapshot_id` for later retrieval.

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/username"},
  {"url": "https://twitter.com/username2"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Response:**
```json
{
  "snapshot_id": "s_m4x7enmven8djfqak"
}
```

---

### 2. Trigger Scraping (Synchronous)

Get results immediately in the response (for small requests).

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology/comments/xxxxx"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

---

### 3. Monitor Progress

Check the status of a scraping job (replace `<snapshot-id>` with your actual snapshot ID):

```bash
bash -c 'curl -s "https://api.brightdata.com/datasets/v3/progress/<snapshot-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
```

**Response:**
```json
{
  "snapshot_id": "s_m4x7enmven8djfqak",
  "dataset_id": "gd_xxxxx",
  "status": "running"
}
```

Status values: `running`, `ready`, `failed`

---

### 4. Download Results

Once status is `ready`, download the collected data (replace `<snapshot-id>` with your actual snapshot ID):

```bash
bash -c 'curl -s "https://api.brightdata.com/datasets/v3/snapshot/<snapshot-id>?format=json" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
```

---

### 5. List Snapshots

Get all your snapshots:

```bash
bash -c 'curl -s "https://api.brightdata.com/datasets/v3/snapshots" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"' | jq '.[] | {snapshot_id, dataset_id, status}'
```

---

### 6. Cancel Snapshot

Cancel a running job (replace `<snapshot-id>` with your actual snapshot ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/cancel?snapshot_id=<snapshot-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
```

---

## Platform-Specific Examples

### Twitter/X - Scrape Profile

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/elonmusk"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Returns:** `x_id`, `profile_name`, `biography`, `is_verified`, `followers`, `following`, `profile_image_link`

### Twitter/X - Scrape Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/username/status/123456789"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Returns:** `post_id`, `text`, `replies`, `likes`, `retweets`, `views`, `hashtags`, `media`

---

### Reddit - Scrape Subreddit Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology", "sort_by": "hot"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Parameters:** `url`, `sort_by` (new/top/hot)

**Returns:** `post_id`, `title`, `description`, `num_comments`, `upvotes`, `date_posted`, `community`

### Reddit - Scrape Comments

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology/comments/xxxxx/post_title"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Returns:** `comment_id`, `user_posted`, `comment_text`, `upvotes`, `replies`

---

### YouTube - Scrape Video Info

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Returns:** `title`, `views`, `likes`, `num_comments`, `video_length`, `transcript`, `channel_name`

### YouTube - Search by Keyword

Write to `/tmp/brightdata_request.json`:

```json
[
  {"keyword": "artificial intelligence", "num_of_posts": 50}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

### YouTube - Scrape Comments

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.youtube.com/watch?v=xxxxx", "load_replies": 3}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Returns:** `comment_text`, `likes`, `replies`, `username`, `date`

---

### Instagram - Scrape Profile

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.instagram.com/username"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

**Returns:** `followers`, `post_count`, `profile_name`, `is_verified`, `biography`

### Instagram - Scrape Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {
    "url": "https://www.instagram.com/username",
    "num_of_posts": 20,
    "start_date": "01-01-2024",
    "end_date": "12-31-2024"
  }
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
bash -c 'curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json'
```

---

## Account Management

### Check Account Status

```bash
bash -c 'curl -s "https://api.brightdata.com/status" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
```

**Response:**
```json
{
  "status": "active",
  "customer": "hl_xxxxxxxx",
  "can_make_requests": true,
  "ip": "x.x.x.x"
}
```

### Get Active Zones

```bash
bash -c 'curl -s "https://api.brightdata.com/zone/get_active_zones" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"' | jq '.[] | {name, type}'
```

### Get Bandwidth Usage

```bash
bash -c 'curl -s "https://api.brightdata.com/customer/bw" \
  -H "Authorization: Bearer ${BRIGHTDATA_API_KEY}"'
```

---

## Getting Dataset IDs

To use the scraping features, you need a `dataset_id`:

1. Go to [Bright Data Control Panel](https://brightdata.com/cp/datasets)
2. Create a new Web Scraper dataset or select an existing one
3. Choose the platform (Twitter, Reddit, YouTube, etc.)
4. Copy the `dataset_id` from the dataset settings

Dataset IDs can also be found in the bandwidth usage API response under the `data` field keys (e.g., `v__ds_api_gd_xxxxx` where `gd_xxxxx` is your dataset ID).

---

## Common Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `url` | Target URL to scrape | `https://twitter.com/user` |
| `keyword` | Search keyword | `"artificial intelligence"` |
| `num_of_posts` | Limit number of results | `50` |
| `start_date` | Filter by date (MM-DD-YYYY) | `"01-01-2024"` |
| `end_date` | Filter by date (MM-DD-YYYY) | `"12-31-2024"` |
| `sort_by` | Sort order (Reddit) | `new`, `top`, `hot` |
| `format` | Response format | `json`, `csv` |

---

## Rate Limits

- Batch mode: up to 100 concurrent requests
- Maximum input size: 1GB per batch
- Exceeding limits returns `429` error

---

## Guidelines

1. **Create datasets first**: Use the Control Panel to create scraper datasets
2. **Use async for large jobs**: Use `/trigger` for discovery and batch operations
3. **Use sync for small jobs**: Use `/scrape` for single URL quick lookups
4. **Check status before download**: Poll `/progress` until status is `ready`
5. **Respect rate limits**: Don't exceed 100 concurrent requests
6. **Date format**: Use MM-DD-YYYY for date parameters

Overview

This skill provides ready-to-use curl examples and guidance for the Bright Data Web Scraper API, focusing on social media scraping, account checks, and usage monitoring. It helps you trigger synchronous or asynchronous scrapes, track job progress, download snapshots, and manage account/zone info using simple shell commands. Designed for operators who prefer direct API calls and scripting with environment variables.

How this skill works

You send JSON payloads to Bright Data dataset endpoints to start scraping jobs (either async via /trigger or sync via /scrape). The API returns a snapshot_id for async runs; poll /progress until status is ready, then download results from the snapshot endpoint. Additional endpoints provide account status, active zones, and bandwidth usage. All examples use curl and expect an exported BRIGHTDATA_API_KEY environment variable.

When to use it

Collect posts, profiles, comments, and engagement metrics from Twitter/X, Reddit, YouTube, Instagram, TikTok, or LinkedIn.
Run large batch collections or scheduled discovery jobs (use async /trigger).
Perform quick lookups for single pages or videos (use sync /scrape).
Monitor account health, active zones, and bandwidth usage programmatically.
Integrate scraping into shell scripts, cron jobs, or CI pipelines that require simple curl calls.

Best practices

Create and select a Web Scraper dataset in the Bright Data Control Panel to obtain your dataset_id before calling the API.
Export BRIGHTDATA_API_KEY and wrap curl commands in bash -c '...' when piping to avoid environment-variable clearing issues.
Use /trigger for large or multi-URL jobs and poll /progress until status is ready before downloading snapshots.
Respect rate limits (max ~100 concurrent requests and 1GB input size per batch) and handle HTTP 429 gracefully with retries/backoff.
Filter results with parameters like num_of_posts, start_date/end_date (MM-DD-YYYY), and sort_by where supported to reduce costs and data volume.

Example use cases

Trigger a scheduled crawl of a list of Twitter profiles and download consolidated JSON snapshots for analytics.
Scrape YouTube video metadata and transcripts synchronously for a small set of video IDs.
Collect subreddit posts and nested comments for sentiment analysis using /trigger then download when ready.
Check account status, list active zones, and pull bandwidth usage daily to track consumption and costs.
Fetch recent Instagram posts for a brand within a date range using date filters to limit results.

FAQ

Do I need a dataset_id to scrape?

Yes. Create or select a Web Scraper dataset in the Control Panel and use its dataset_id in API calls.

When should I use /trigger vs /scrape?

Use /trigger for large or batch jobs (async snapshot flow). Use /scrape for quick, small synchronous lookups that return results immediately.