home / skills / jst-well-dan / skill-box / web-fetch

web-fetch skill

safe

This skill fetches web content and converts it to clean Markdown and PDF, including WeChat articles, with noise removal and image preservation.

npx playbooks add skill jst-well-dan/skill-box --skill web-fetch

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

6.8 KB

---
name: web-fetch
description: Use this skill when users want to scrape web content and convert it to clean Markdown or PDF. Handles workflows like "Save this webpage as PDF", "Fetch this article", "抓取网页内容", or "转换为PDF". Supports crawl4ai for general web scraping and Playwright-based WeChat (微信公众号) article fetching with anti-bot bypass. Automatically converts to PDF by default unless user specifies Markdown-only.
---

# Web Fetch

Fetch web content and convert to clean Markdown and PDF formats. Supports general websites and WeChat (微信公众号) articles.

## Features

- Automatic noise removal (navigation, headers, footers, sidebars)
- Image preservation with alt text
- WeChat article special handling (lazy-loaded images, metadata extraction)
- Clean Markdown output ready for translation or processing
- **PDF conversion with clean reading style**
- **CJK font support for Chinese content**
- **Both MD and PDF output by default** 

## Dependencies

```bash
# Core dependencies
pip install crawl4ai requests beautifulsoup4 markdownify

# WeChat article fetching
pip install playwright
playwright install chromium

# PDF conversion with CJK font support
pip install reportlab markdown beautifulsoup4
```

**Note**: `reportlab` provides excellent CJK font support and works on Windows/Mac/Linux without system dependencies.

## Usage

### General Web Pages

For most websites, use the crawl4ai-based fetcher:

```bash
python scripts/fetch_web_content.py <url> <output_filename>
```

Example:
```bash
python scripts/fetch_web_content.py https://example.com/article article.md
```

### WeChat Articles (微信公众号)

For WeChat articles, use the Playwright-based fetcher with anti-bot bypass:

```bash
python scripts/fetch_weixin.py <url> [output_filename]
```

Examples:
```bash
# Auto-generate filename (YYYYMMDD+Title format)
python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx"

# Custom filename
python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx" article.md
```

**Features:**
- Uses real Chromium browser to bypass anti-bot protections
- Handles lazy-loaded images automatically
- Auto-generates filename from publish date + title (YYYYMMDD格式)
- Supports both visible browser (for debugging) and headless mode

### Convert Markdown to PDF

After fetching content to Markdown, convert to PDF:

```bash
python scripts/md_to_pdf.py <markdown_file> [--output output.pdf]
```

Examples:
```bash
# Convert single file to PDF (auto-generates output name)
python scripts/md_to_pdf.py article.md

# Convert with custom output name
python scripts/md_to_pdf.py article.md --output custom_name.pdf

# Batch convert entire directory
python scripts/md_to_pdf.py ./articles_folder --concurrency 4
```

**Features:**
- Excellent Chinese (CJK) font support using Microsoft YaHei
- Image rendering support (HTTP/HTTPS URLs and local paths)
- Automatic image scaling with aspect ratio preservation
- Both single file and batch directory conversion
- Clean, readable typography optimized for Chinese content

## Response Pattern (Updated)

When user requests web content fetching:

1. **Identify URL type:**
   - WeChat URL (`mp.weixin.qq.com`) → use `fetch_weixin.py`
   - Other URLs → use `fetch_web_content.py`

2. **Determine output format:**
   - User mentions "PDF" explicitly → MD + PDF
   - User says "only MD"/"no PDF"/"markdown only" → MD only
   - **Ambiguous request** → Ask: "Would you like PDF format as well?"

   **Detection examples:**
   - "Fetch as PDF" / "转换为PDF" → MD + PDF
   - "Save to PDF" → MD + PDF
   - "Get markdown only" / "只要markdown" → MD only
   - "Fetch this article" → **Ask user**
   - "抓取网页内容" → **Ask user**

3. **Execute fetching:**
   ```bash
   python scripts/fetch_web_content.py <url> <output>.md
   # or
   python scripts/fetch_weixin.py <url> [output].md
   ```

   **Note:** For WeChat articles, output filename is optional - it auto-generates as YYYYMMDD+Title

4. **Convert to PDF (if requested):**
   ```bash
   python scripts/md_to_pdf.py <output>.md
   ```
   This creates `<output>.pdf` alongside `<output>.md`

5. **Report results:**
   - Confirm both files saved (if PDF)
   - Show statistics for both formats
   - Suggest next steps

## Example Workflows

### Workflow 1: Fetch with PDF (Explicit Request)

```bash
# User: "Fetch this article as PDF: https://example.com/article"

# Step 1: Fetch markdown
python scripts/fetch_web_content.py https://example.com/article article.md

# Step 2: Convert to PDF
python scripts/md_to_pdf.py article.md

# Result:
# ✓ Saved: article.md (45 KB, 8,234 words)
# ✓ PDF: article.pdf (with images embedded)
```

### Workflow 2: Fetch Markdown Only

```bash
# User: "Get the markdown only"

# Step 1: Fetch markdown
python scripts/fetch_web_content.py https://example.com/article article.md

# Step 2: Skip PDF conversion

# Result:
# ✓ Saved: article.md (45 KB, 8,234 words)
```

### Workflow 3: Ambiguous Request

```bash
# User: "Fetch this article: https://example.com/article"

# Claude asks: "I'll fetch this article. Would you like me to convert it to PDF as well?"
# User: "Yes"

# Then proceed with Workflow 1
```

### Workflow 4: WeChat Article with PDF

```bash
# User: "抓取微信文章为PDF"

# Step 1: Fetch markdown (auto-generates filename as YYYYMMDD+Title)
python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx"

# Step 2: Convert to PDF (use the auto-generated filename)
python scripts/md_to_pdf.py 20251214关于财政政策和货币政策的关系.md

# Result:
# ✓ Saved: 20251214关于财政政策和货币政策的关系.md (中文内容)
# ✓ PDF: 20251214关于财政政策和货币政策的关系.pdf (完美支持中文和图片)
```

### Batch Processing

For multiple URLs, loop through and fetch each:
```bash
for url in url1 url2 url3; do
  filename="output_$(date +%s)"
  python scripts/fetch_web_content.py "$url" "$filename.md"
  python scripts/md_to_pdf.py "$filename.md"  # Optional: add PDF
done
```

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Empty content | Try different CSS selector or use WeChat Playwright fetcher |
| Missing images | Check if site blocks external requests |
| Encoding issues | Content is saved as UTF-8 by default |
| WeChat blocked | Use Playwright fetcher - it launches real browser to bypass anti-bot |
| **WeChat timeout** | Script has 60s timeout with retry - usually succeeds on second attempt |
| **Playwright not installed** | Run: `pip install playwright && playwright install chromium` |
| **PDF conversion failed** | Install dependencies: `pip install reportlab markdown beautifulsoup4` |
| **Chinese characters in PDF** | Microsoft YaHei font is automatically used (excellent CJK support) |
| **Images missing in PDF** | Check that image URLs are accessible or local image paths are correct |
| **PDF too large** | Images are embedded and scaled; original image size affects PDF size |

Overview

This skill fetches web pages and converts them into clean Markdown and PDF outputs. It supports general websites via crawl4ai and handles WeChat (mp.weixin.qq.com) articles with a Playwright-based browser to bypass anti-bot protections. By default it produces both Markdown and PDF unless you request Markdown-only.

How this skill works

The skill inspects the URL to choose the appropriate fetcher: WeChat URLs use a Playwright browser fetcher to handle lazy-loaded images and anti-bot measures, while other sites use a crawl4ai-based scraper with HTML cleaning. It strips navigation, headers, footers and sidebars, preserves images with alt text, and outputs sanitized Markdown. If PDF is requested (default), it converts the Markdown to a typographically clean PDF with CJK font support.

When to use it

Save a web article as a PDF for offline reading or archival.
Extract clean Markdown for translation, editing, or republishing workflows.
Fetch WeChat public articles that are blocked by simple scrapers.
Batch-convert a directory of Markdown files into PDFs with proper Chinese font rendering.
When you need images preserved and scaled correctly in the final PDF.

Best practices

Specify whether you want PDF output or Markdown-only to avoid extra conversion steps.
Provide the full article URL; for ambiguous requests I’ll ask whether you want PDF too.
For WeChat articles, use the mp.weixin.qq.com link so the Playwright fetcher can auto-generate filename and metadata.
Run Playwright installs (playwright install chromium) beforehand if using WeChat fetcher.
Check image accessibility (public URLs) when images are missing in the PDF.

Example use cases

"Save this webpage as PDF: https://example.com/article" — fetch with crawl4ai and produce article.md + article.pdf.
"Fetch this WeChat article" — Playwright fetcher auto-generates YYYYMMDD+Title.md and converts to PDF with CJK fonts.
"Get markdown only" — produce cleaned Markdown without PDF conversion for translation workflows.
Batch processing a folder of articles into PDFs using md_to_pdf.py with concurrency.
Recover a noisy webpage into readable Markdown for content analysis or republishing.

FAQ

What if the scraper returns empty content?

Try the WeChat Playwright fetcher for pages with heavy JS or adjust CSS selectors; retrying often helps.

How do I force Markdown-only output?

Tell me explicitly "Markdown only" or "no PDF" and the skill will skip PDF conversion.