home / skills / wwwzhouhui / skills_collection / wechat-article-fetcher

wechat-article-fetcher skill

/wechat-article-fetcher

This skill helps you fetch, parse, and save WeChat official account articles, delivering metadata, Markdown, and images for offline reading.

npx playbooks add skill wwwzhouhui/skills_collection --skill wechat-article-fetcher

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
4.4 KB
---
name: wechat-article-fetcher
description: "获取并解析微信公众号文章。从微信文章链接中提取标题、作者、公众号名称、正文内容、图片和元数据。当用户提供微信文章链接(mp.weixin.qq.com/s/...)并希望阅读、提取、下载或转换文章内容时使用。适用场景包括获取/下载微信文章、提取微信文章文字或元数据、将微信文章转换为Markdown、将微信文章连同图片保存到本地。关键词:微信公众号、文章获取、文章抓取、文章下载。"
---

# 微信公众号文章获取器

获取、解析并保存微信公众号文章,支持单篇和批量下载、元数据提取、图片下载和 Markdown 转换。

## 快速开始

获取单篇文章:

```bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx"
```

批量获取多篇文章(空格分隔):

```bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output
```

批量获取多篇文章(逗号分隔):

```bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output
```

仅输出元数据(不保存文件):

```bash
python scripts/fetch_wechat_article.py "https://mp.weixin.qq.com/s/xxxxx" --json
```

## 依赖安装

```bash
pip install beautifulsoup4 html2text requests
```

## 功能说明

### 1. 获取文章并保存到本地

```bash
python scripts/fetch_wechat_article.py "<url>" --output-dir ./output
```

输出目录结构:
```
output/<公众号名称>/<日期>_<标题>/
├── index.html    # 格式化的独立HTML文件
├── article.md    # Markdown版本
├── meta.json     # 文章元数据
└── images/       # 下载的图片
```

### 2. 仅提取元数据

```bash
python scripts/fetch_wechat_article.py "<url>" --json
```

返回 JSON 包含:`title`(标题)、`author`(作者)、`account_nickname`(公众号名称)、`description`(摘要)、`create_time`(发布时间)、`content_text`(正文文本)、`content_markdown`(Markdown内容)、`cover_image`(封面图)、`source_url`(原文链接)。

### 3. 批量下载多篇文章

空格分隔多个链接:

```bash
python scripts/fetch_wechat_article.py "url1" "url2" "url3" --output-dir ./output
```

逗号分隔多个链接:

```bash
python scripts/fetch_wechat_article.py "url1,url2,url3" --output-dir ./output
```

自定义下载间隔(默认3秒,避免触发反爬):

```bash
python scripts/fetch_wechat_article.py "url1" "url2" --interval 5
```

同一公众号的文章自动归类到同一目录下。

### 4. 不下载图片

```bash
python scripts/fetch_wechat_article.py "<url>" --no-images
```

### 4. 不下载图片

```bash
python scripts/fetch_wechat_article.py "<url>" --no-images
```

### 5. 作为 Python 库调用

```python
from scripts.fetch_wechat_article import fetch_article, batch_fetch

# 单篇获取并保存
result = fetch_article("https://mp.weixin.qq.com/s/xxxxx", output_dir="./output")
print(result['title'], result['path'])

# 单篇仅获取元数据
meta = fetch_article("https://mp.weixin.qq.com/s/xxxxx", json_only=True)
print(meta['title'])
print(meta['content_text'][:200])

# 批量获取
urls = ["https://mp.weixin.qq.com/s/aaa", "https://mp.weixin.qq.com/s/bbb"]
stats = batch_fetch(urls, output_dir="./output", interval=3.0)
print(f"成功{stats['success']}篇, 失败{stats['fail']}篇")
```

主要函数参数:
- `url`:文章链接(支持短链接和长链接)
- `output_dir`:保存目录(默认:`./wechat_articles`)
- `download_img`:是否下载图片(默认:`True`)
- `to_markdown`:是否转换为 Markdown(默认:`True`)
- `json_only`:仅返回元数据字典,不保存文件

`batch_fetch` 额外参数:
- `urls`:文章链接列表
- `interval`:每篇文章之间的下载间隔秒数(默认:`3.0`)

## 注意事项

- **优先使用短链接**(`/s/xxxxx`)—— 带 `__biz` 参数的长链接可能触发验证码。
- 批量下载时默认间隔3秒,可通过 `--interval` 调整,避免触发微信反爬机制。
- 自动使用微信移动端 User-Agent 绕过访问限制。
- 微信图片使用 `data-src` 属性(非 `src`),因为采用了懒加载。
- 下载图片需要设置 `Referer: https://mp.weixin.qq.com/` 请求头。
- HTML 结构详情参见 [references/wechat_html_structure.md](references/wechat_html_structure.md)。

Overview

This skill fetches and parses WeChat Official Account (mp.weixin.qq.com) articles, extracting title, author, account name, body text, images, and metadata. It can save a formatted HTML, Markdown, images, and a JSON metadata file, or return metadata only for programmatic use. Designed for single or batch processing with configurable download intervals and optional image downloading.

How this skill works

Given one or more mp.weixin.qq.com/s/... links (short or long), the skill requests the article using a mobile WeChat User-Agent, parses the HTML to extract metadata and content, converts content to Markdown if requested, and downloads images (respecting lazy-loaded data-src and Referer requirements). It exposes a command-line script and a Python API with fetch_article and batch_fetch functions for automation.

When to use it

  • Save a single WeChat article locally including images and Markdown for offline reading.
  • Batch-download multiple articles from the same public account and organize them by account/date.
  • Extract article metadata (title, author, publish time, description, cover) for analysis or cataloging.
  • Convert WeChat articles to Markdown to republish or import into CMS tools.
  • Download images referenced by articles while preserving correct Referer and lazy-load handling.

Best practices

  • Prefer short /s/ links to avoid verification challenges with long __biz links.
  • Set a download interval (default 3s) when batching to reduce the chance of triggering anti-scraping protections.
  • Use --no-images when only text/metadata is needed to speed up processing.
  • Validate output paths and sanitize titles to avoid filesystem conflicts when saving files.
  • When downloading images, maintain Referer header and handle data-src lazy-loaded attributes.

Example use cases

  • Command-line: fetch a single article and save HTML, Markdown, images, and meta.json to output directory.
  • Batch mode: supply multiple URLs (space- or comma-separated) to download many articles with a configurable interval.
  • Developer integration: call fetch_article(url, json_only=True) inside a script to ingest article metadata into a database.
  • Content workflow: convert WeChat posts to Markdown for editing and republishing in a blog or knowledge base.
  • Archival: organize articles by public account and date for long-term storage and search.

FAQ

Can I avoid downloading images?

Yes. Use the --no-images flag or pass download_img=False to the Python API to skip image downloads and only save text and metadata.

Does it support batch downloads without rate limits?

You can batch-download, but set an interval (default 3 seconds) between requests to reduce the risk of being rate-limited or blocked by WeChat anti-scraping measures.

What output formats are available?

The tool can save a self-contained HTML, a Markdown (.md) version, downloaded images in an images/ folder, and a meta.json file containing title, author, publish time, content_text, content_markdown, and cover_image.