home / skills / dcjanus / prompts / fetch-url

fetch-url skill

/skills/fetch-url

This skill fetches a URL, extracts the article body in Markdown by default, and supports various output formats and browser-based strategies.

npx playbooks add skill dcjanus/prompts --skill fetch-url

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
2.2 KB
---
name: fetch-url
description: 获取并提取链接正文(默认 Markdown);内置 X/Twitter URL 处理,提升受限页面的抓取成功率。
---

在当前文件所在目录运行:`./scripts/fetch_url.py URL`(仅支持 `http` / `https`)。  
说明:必须直接当作可执行文件执行。

脚本调用方式示例(不要用 `uv run python` 或 `python`):
```bash
cd skills/fetch-url && ./scripts/fetch_url.py https://example.com --output ./page.md
```
错误示例:
```bash
uv run python skills/fetch-url/scripts/fetch_url.py https://example.com --output ./page.md
python skills/fetch-url/scripts/fetch_url.py https://example.com --output ./page.md
```

默认自动探测本地 Chromium 系浏览器路径;未探测到时需安装 Playwright 浏览器:

```bash
uv run playwright install chromium
```

参数:
- `--output`:将输出写入文件(默认 stdout)。
- `--timeout-ms`:Playwright 导航超时(毫秒,默认 60000)。
- `--browser-path`:指定本地 Chromium 系浏览器路径(默认自动探测)。
- `--output-format`:输出格式(默认 `markdown`),支持 `csv`、`html`、`json`、`markdown`、`raw-html`、`txt`、`xml`、`xmltei`;`raw-html` 直接输出渲染后的 HTML(不经 trafilatura)。
- `--fetch-strategy`:仅 `markdown` 可用,支持 `auto`、`agent`、`jina`、`browser`。默认 `auto`。

`--fetch-strategy` 常用值:
- `auto`:默认选择。
- `agent`:优先用原站 Markdown 协商。
- `jina`:优先用 Jina Reader。
- `browser`:直接用本地 Playwright。

环境变量:
- 可设置 `JINA_API_KEY` 提升 Jina Reader 限流:`JINA_API_KEY=your-token ./scripts/fetch_url.py ...`

示例:

```bash
./scripts/fetch_url.py https://example.com --output ./page.md --timeout-ms 60000
./scripts/fetch_url.py https://example.com --fetch-strategy jina
JINA_API_KEY=your-token ./scripts/fetch_url.py https://example.com --fetch-strategy jina
./scripts/fetch_url.py https://example.com --fetch-strategy browser
./scripts/fetch_url.py https://x.com/jack/status/20 --output-format markdown
./scripts/fetch_url.py https://x.com/jack/status/20 --output-format markdown --fetch-strategy browser
```

Reference:[`scripts/fetch_url.py`](scripts/fetch_url.py)

Overview

This skill fetch-url fetches a web page and extracts its main text content (default output: Markdown). It includes special handling for X/Twitter URLs to improve retrieval success on restricted pages. The tool runs as a standalone executable script and supports multiple output formats and fetch strategies.

How this skill works

Run the script directly from its directory as an executable to navigate the URL with Playwright or fall back to content readers. It can detect a local Chromium-like browser automatically or use Playwright-installed browsers, call Jina Reader when configured, or prefer raw site Markdown depending on the chosen fetch strategy. The script then extracts rendered content (optionally post-processed with trafilatura) and writes output to stdout or a file in the requested format.

When to use it

  • Quickly capture and save the main text of any http/https page as Markdown or other formats.
  • Fetch content from X/Twitter posts with improved handling for restricted or dynamically rendered pages.
  • Automate content collection for offline reading, archiving, or downstream NLP processing.
  • Prefer browser rendering for pages that require JS, or use Jina/agent modes for lighter, API-driven extraction.
  • Produce raw rendered HTML when you need the exact DOM output rather than cleaned text.

Best practices

  • Always run the script as an executable from its containing directory (do not invoke via python interpreter wrappers).
  • Install Playwright browsers if no local Chromium is detected (example: uv run playwright install chromium).
  • Set JINA_API_KEY in the environment when using the jina fetch strategy to reduce rate limits and improve reliability.
  • Choose --fetch-strategy based on the page: browser for heavy JS sites, agent or jina for API-based extraction, auto for sensible defaults.
  • Use --timeout-ms to increase navigation time for slow pages or complex renderings.

Example use cases

  • Save a long-form article as Markdown for offline reading: ./scripts/fetch_url.py https://example.com --output ./page.md
  • Extract a Twitter/X post as Markdown with browser rendering for dynamic content.
  • Batch-crawl and store HTML or CSV outputs for downstream data processing or ingestion.
  • Use the jina strategy with JINA_API_KEY for API-native extraction where available.

FAQ

How do I run the script?

Change to the fetch-url directory and run it directly as an executable: ./scripts/fetch_url.py https://example.com --output ./page.md. Do not call it via python or uv run python wrappers.

What if Chromium is not found on my system?

Install Playwright browsers (for example: uv run playwright install chromium) or provide a local Chromium path with --browser-path.