home / skills / jrajasekera / claude-skills / article-extractor

article-extractor skill

/article-extractor

This skill extracts clean article content from URLs, saves as markdown, and supports offline reading via Wayback when needed.

npx playbooks add skill jrajasekera/claude-skills --skill article-extractor

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
2.1 KB
---
name: article-extractor
description: Extract clean article content from URLs and save as markdown. Triggers when user provides a webpage URL and wants to download it, extract content, get a clean version without ads, capture an article for offline reading, save an article, grab content from a page, archive a webpage, clip an article, or read something later. Handles blog posts, news articles, tutorials, documentation pages, and similar web content. Supports Wayback Machine for dead links or paywalled content. This skill handles the entire workflow - do NOT use web_fetch or other tools first, just call the extraction script directly with the URL.
---

# Article Extractor

Extract clean article content from URLs, removing ads, navigation, and clutter. Multi-tool fallback ensures reliability.

## Workflow

When user provides a URL to download/extract:
1. Call the extraction script directly with the URL (do NOT fetch the URL first with web_fetch)
2. Script handles fetching, extraction, and saving automatically
3. Returns clean markdown file with frontmatter

## Usage

```bash
# Basic extraction
scripts/extract-article.sh "https://example.com/article"

# Specify output location
scripts/extract-article.sh "https://example.com/article" -o my-article.md -d ~/Documents

# Try Wayback Machine if original fails
scripts/extract-article.sh "https://example.com/article" --wayback
```

Make script executable if needed: `chmod +x scripts/extract-article.sh`

## Key Options

- `-o <file>` - Output filename
- `-d <dir>` - Output directory
- `-w, --wayback` - Try Wayback Machine if extraction fails
- `-t <tool>` - Force tool: `jina`, `trafilatura`, `readability`, `fallback`
- `-q` - Quiet mode

For complete options, exit codes, tool details, and examples, see [references/tools-and-options.md](references/tools-and-options.md).

## Common Failures

- **Exit 3 (access denied)**: Paywall or login required - try `--wayback`
- **Exit 4 (no content)**: Heavy JavaScript - try different `--tool`
- **Exit 2 (network)**: Connection issue - check URL

## Local Tools (Optional)

For offline extraction: `scripts/install-deps.sh`

Overview

This skill extracts clean article content from a webpage URL and saves it as a markdown file with frontmatter. It removes ads, navigation, and clutter so you get a reader-friendly version for offline reading, archiving, or publishing. The tool supports fallback extractors and Wayback Machine retrieval for dead or paywalled links.

How this skill works

Give the script a URL and it fetches the page, runs content-extraction tools, cleans HTML, and writes a markdown file with frontmatter. You must call the extraction script directly with the URL; the script handles fetching, extraction, and saving. It supports forcing a specific extractor, quiet mode, and an option to try the Wayback Machine when the original fails.

When to use it

  • Save a blog post, news article, or tutorial for offline reading
  • Archive a webpage or capture content before it changes
  • Extract documentation or long-form content for notes or publication
  • Grab content from pages behind mild paywalls using Wayback fallback
  • Prepare clean markdown for static sites or content pipelines

Best practices

  • Call the provided extraction script directly with the URL; do not prefetch the page with other tools
  • Set an explicit output path (-o and -d) to keep files organized
  • If extraction fails, try --wayback or change the extractor (-t) before troubleshooting networking
  • Make the script executable (chmod +x) and install optional local tools for best reliability
  • Use quiet mode (-q) in scripts or cron jobs to reduce logs

Example use cases

  • Download a news article and save as my-article.md for research notes
  • Archive an important tutorial before it’s updated or removed using --wayback
  • Extract multiple blog posts into a content folder for reuse in a static site generator
  • Clip long-form interviews or op-eds for offline reading on a mobile device
  • Automate weekly saves of documentation pages for team knowledge backups

FAQ

What if the extractor reports access denied or a paywall?

Try the --wayback option to fetch an archived copy. If that fails, the page may require credentials or advanced bypasses that the extractor can’t handle.

Extraction returned no content or heavy JavaScript page — what now?

Retry with a different tool using -t (options like jina, trafilatura, readability, fallback). If network issues occurred, check the URL and your connection and consult the exit code for details.