home / skills / inclusionai / aworld / x-scraper

x-scraper skill

safe

This skill enables targeted X (Twitter) data collection via a connected browser, filtering by user or keywords and exporting Markdown, RSS, or JSON.

npx playbooks add skill inclusionai/aworld --skill x-scraper

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

2.7 KB

---
name: x-scraper
description: X (Twitter) 抓取 skill - 通过 agent-browser (CDP) 抓取指定用户推文或首页推荐流，支持关键词过滤、Tab 切换、多格式输出。使用场景：按用户/关键词抓取时间线、查看首页推荐流、生成 RSS/JSON/Markdown。
---

# X 抓取 (x-scraper)

## 概述

通过已连接 CDP 的浏览器（agent-browser）抓取 X (Twitter) 内容，包含两个脚本：

1. **scrape_x_user.sh** — 抓取指定用户时间线，可选关键词过滤
2. **scrape_x_home.sh** — 抓取当前登录用户的首页推荐流（For you / Following）

输出格式统一支持 Markdown / RSS / JSON。

## 工具路径

- 用户抓取：`./scrape_x_user.sh`
- 首页推荐：`./scrape_x_home.sh`
- 依赖：`agent-browser`（CDP 已连接且已登录 X）、`python3`

---

## 1. 用户帖子抓取 (scrape_x_user.sh)

按用户名抓取最新帖子，可选关键词搜索过滤。

### 用法

```bash
././scrape_x_user.sh [-u <username>] [-k <keyword>] [-p <cdp_port>] [-n <max_scrolls>] [-o <output_file>] [-f <format>]
```

### 参数

| 参数 | 说明 | 默认 |
|------|------|------|
| `-u` | X 用户名（不带 @） | Alibaba_Qwen |
| `-k` | 搜索关键词（可选，不指定则抓取用户全部最新帖子） | - |
| `-p` | CDP 端口 | 9222 |
| `-n` | 最大滚动次数 | 10 |
| `-o` | 输出文件路径 | stdout |
| `-f` | 格式：`md` \| `rss` \| `json` | md |

### 示例

```bash
././scrape_x_user.sh
././scrape_x_user.sh -k qwen3
././scrape_x_user.sh -u chenchengpro -k claw -f rss -o feed.xml
././scrape_x_user.sh -u chenchengpro -f json -n 20 -o data.json
```

---

## 2. 首页推荐流抓取 (scrape_x_home.sh)

抓取当前登录用户的 X 首页推荐内容，支持 For you / Following 两个 Tab 切换。

### 用法

```bash
././scrape_x_home.sh [-t <tab>] [-p <cdp_port>] [-n <max_scrolls>] [-o <output_file>] [-f <format>]
```

### 参数

| 参数 | 说明 | 默认 |
|------|------|------|
| `-t` | 推荐 Tab：`foryou` \| `following` | foryou |
| `-p` | CDP 端口 | 9222 |
| `-n` | 最大滚动次数 | 5 |
| `-o` | 输出文件路径 | stdout |
| `-f` | 格式：`md` \| `rss` \| `json` | md |

### 输出字段

每条帖子包含：`author`（作者名 + handle）、`time`（ISO 时间戳）、`text`（正文）、`link`（帖子链接）、`hasMedia`（是否含图片/视频）、`retweet`（转推/置顶上下文）

### 示例

```bash
././scrape_x_home.sh                           # 抓取 For you 推荐流
././scrape_x_home.sh -t following -n 10        # 抓取 Following 时间线
././scrape_x_home.sh -f json -o feed.json      # JSON 输出到文件
././scrape_x_home.sh -n 3 -f rss -o home.xml   # 少量抓取，RSS 输出
```

Overview

This skill scrapes X (Twitter) content via a browser connected through Chrome DevTools Protocol (agent-browser). It provides two scripts to harvest a user timeline or the logged-in home recommendation stream and exports results as Markdown, RSS, or JSON. The tool supports keyword filtering, tab switching (For you / Following), and configurable scrolling limits.

How this skill works

The scripts control a CDP-connected browser to navigate X, load timelines, and scroll to collect visible posts. scrape_x_user.sh targets a specific username and can filter posts by keyword; scrape_x_home.sh captures the logged-in account’s home feed and can switch between recommendation tabs. Output includes structured fields (author, time, text, link, media flag, retweet context) and is written to stdout or a file in md/rss/json.

When to use it

Harvest a specific user’s recent tweets for archiving or analysis.
Monitor a public figure or project feed and filter by keywords or topics.
Capture the logged-in account’s For You or Following recommendation stream for research.
Generate RSS or JSON feeds from X content for downstream consumption.
Quickly export timelines to Markdown for note-taking or reporting.

Best practices

Ensure agent-browser CDP is running and the browser is logged into X before running scripts.
Use modest max scroll counts for initial runs to avoid long sessions and to validate results.
Combine keyword filters with user scraping to reduce noise and focus on relevant posts.
Prefer JSON output for programmatic processing; use RSS for feed consumption and Markdown for human review.
Rotate output file paths when collecting repeated snapshots to avoid overwriting historical captures.

Example use cases

Daily scrape of a project lead’s timeline, filtering for release-related keywords, output to JSON for downstream analytics.
Capture your logged-in For You feed weekly to study recommendation differences versus Following, export as RSS for review.
Create a Markdown digest of a researcher’s posts filtered by topic for inclusion in a weekly newsletter.
Fetch 10 pages of a user timeline to build a local archive and convert to JSON for import into a search index.
Quickly generate an RSS feed from the home recommendation stream for a private monitoring dashboard.

FAQ

What prerequisites are required to run the scripts?

A running agent-browser with CDP enabled and logged into X, plus Python 3 installed on the system.

How do I limit how many posts are collected?

Use the -n parameter to control the maximum number of scrolls; lower values reduce the number of posts fetched.

Which output format should I choose?

Use JSON for automated processing, RSS for feed readers, and Markdown for human-friendly summaries.