home / skills / nanmicoder / claude-code-skills / news-extractor

This skill extracts news articles from major platforms and outputs structured JSON and Markdown for easy reuse.

npx playbooks add skill nanmicoder/claude-code-skills --skill news-extractor

Review the files below or copy the command above to add this skill to your agents.

Files (15)
SKILL.md
4.4 KB
---
name: news-extractor
description: 新闻站点内容提取。支持微信公众号、今日头条、网易新闻、搜狐新闻、腾讯新闻。当用户需要提取新闻内容、抓取公众号文章、爬取新闻、或获取新闻JSON/Markdown时激活。
---

# News Extractor Skill

从主流新闻平台提取文章内容,输出 JSON 和 Markdown 格式。

## 支持平台

| 平台 | ID | URL 示例 |
|------|-----|----------|
| 微信公众号 | wechat | `https://mp.weixin.qq.com/s/xxxxx` |
| 今日头条 | toutiao | `https://www.toutiao.com/article/123456/` |
| 网易新闻 | netease | `https://www.163.com/news/article/ABC123.html` |
| 搜狐新闻 | sohu | `https://www.sohu.com/a/123456_789` |
| 腾讯新闻 | tencent | `https://news.qq.com/rain/a/20251016A07W8J00` |

## 依赖安装

本 skill 使用 uv 管理依赖。首次使用前需要安装:

```bash
cd ~/.claude/skills/news-extractor
uv sync
```

**重要**: 所有脚本必须使用 `uv run` 执行,不要直接用 `python` 运行。`uv run` 会自动使用项目虚拟环境中的依赖。

### 依赖列表

| 包名 | 用途 |
|------|------|
| pydantic | 数据模型验证 |
| requests | HTTP 请求 |
| curl_cffi | 浏览器模拟抓取 |
| tenacity | 重试机制 |
| parsel | HTML/XPath 解析 |
| demjson3 | 非标准 JSON 解析 |

## 使用方式

### 基本用法

```bash
# 提取新闻,自动检测平台,输出 JSON + Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL"

# 指定输出目录
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --output ./output

# 仅输出 JSON
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format json

# 仅输出 Markdown
uv run .claude/skills/news-extractor/scripts/extract_news.py "URL" --format markdown

# 列出支持的平台
uv run .claude/skills/news-extractor/scripts/extract_news.py --list-platforms
```

### 输出文件

脚本默认输出两种格式到指定目录(默认 `./output`):
- `{news_id}.json` - 结构化 JSON 数据
- `{news_id}.md` - Markdown 格式文章

## 工作流程

1. **接收 URL** - 用户提供新闻链接
2. **平台检测** - 自动识别平台类型
3. **内容提取** - 调用对应爬虫获取并解析内容
4. **格式转换** - 生成 JSON 和 Markdown
5. **输出文件** - 保存到指定目录

## 输出格式

### JSON 结构

```json
{
  "title": "文章标题",
  "news_url": "原始链接",
  "news_id": "文章ID",
  "meta_info": {
    "author_name": "作者/来源",
    "author_url": "",
    "publish_time": "2024-01-01 12:00"
  },
  "contents": [
    {"type": "text", "content": "段落文本", "desc": ""},
    {"type": "image", "content": "https://...", "desc": ""},
    {"type": "video", "content": "https://...", "desc": ""}
  ],
  "texts": ["段落1", "段落2"],
  "images": ["图片URL1", "图片URL2"],
  "videos": []
}
```

### Markdown 结构

```markdown
# 文章标题

## 文章信息
**作者**: xxx
**发布时间**: 2024-01-01 12:00
**原文链接**: [链接](URL)

---

## 正文内容

段落内容...

![图片](URL)

---

## 媒体资源
### 图片 (N)
1. URL1
2. URL2
```

## 使用示例

### 提取微信公众号文章

```bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://mp.weixin.qq.com/s/ebMzDPu2zMT_mRgYgtL6eQ"
```

输出:
```
[INFO] Platform detected: wechat (微信公众号)
[INFO] Extracting content...
[INFO] Title: 文章标题
[INFO] Author: 公众号名称
[INFO] Text paragraphs: 15
[INFO] Images: 3
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.json
[SUCCESS] Saved: ./output/ebMzDPu2zMT_mRgYgtL6eQ.md
```

### 提取今日头条文章

```bash
uv run .claude/skills/news-extractor/scripts/extract_news.py \
  "https://www.toutiao.com/article/7434425099895210546/"
```

## 错误处理

| 错误类型 | 说明 | 解决方案 |
|----------|------|----------|
| `无法识别该平台` | URL 不匹配任何支持的平台 | 检查 URL 是否正确 |
| `平台不支持` | 非支持的站点 | 本 Skill 仅支持列出的新闻站点 |
| `提取失败` | 网络错误或页面结构变化 | 重试或检查 URL 有效性 |

## 注意事项

- 仅用于教育和研究目的
- 不要进行大规模爬取
- 尊重目标网站的 robots.txt 和服务条款
- 微信公众号可能需要有效的 Cookie(当前默认配置通常可用)

## 参考

- [平台 URL 模式说明](references/platform-patterns.md)

Overview

This skill extracts full article content from major Chinese news platforms and outputs structured JSON and Markdown. It supports WeChat public accounts, Toutiao, NetEase, Sohu, and Tencent News, producing an article payload with metadata, text blocks, images, and videos. The tool is optimized for single-URL extraction and local file output.

How this skill works

Provide a news URL and the skill auto-detects the platform by URL pattern. It fetches the page, runs platform-specific parsing logic to extract title, author, publish time, paragraphs and media, then formats results into a JSON object and a Markdown document and saves them to an output directory. The extractor includes retry and HTML parsing safeguards to handle common page variations.

When to use it

  • You need a clean structured JSON representation of a single news article for downstream processing.
  • You want a ready-to-read Markdown file of a news article for notes, reports, or archiving.
  • You need to scrape public article content from supported Chinese news platforms for research or analysis.
  • You want to batch or script extraction into a pipeline that consumes article JSON.
  • You need images and media links extracted alongside text for content auditing or indexing.

Best practices

  • Use one URL per extraction to minimize parsing errors and to make outputs deterministic.
  • Respect site terms and robots.txt; avoid high-frequency batch scraping.
  • Provide valid cookies if extracting WeChat content that requires a session.
  • Validate the saved JSON before downstream ingestion; page structure can change over time.
  • Run the extraction from the project virtual environment or wrapper to ensure correct dependencies.

Example use cases

  • Researcher exporting articles to JSON for text analysis or NLP model training.
  • Content curator saving Markdown copies of news articles for summaries or newsletters.
  • Developer integrating a one-off scraper into a data pipeline that indexes headlines and images.
  • Journalist archiving source articles with metadata (author, publish time, original link).
  • QA teams verifying that article media (images/videos) are correctly captured and linked.

FAQ

Which platforms are supported?

WeChat public accounts, Toutiao, NetEase, Sohu, and Tencent News.

What formats does the skill produce?

Structured JSON with metadata and a Markdown article file; you can choose one or both.

What if extraction fails?

Common causes are invalid URLs, page layout changes, or required cookies; retry, check the URL, or provide session cookies for protected WeChat pages.