home / skills / zpankz / mcp-skillset / article-extractor

article-extractor skill

/article-extractor

This skill extracts clean article text from a URL, removing clutter and saving readable content for offline use.

This is most likely a fork of the article-extractor skill from michalparkola
npx playbooks add skill zpankz/mcp-skillset --skill article-extractor

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
9.4 KB
---
name: article-extractor
description: Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.
---

# Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

## When to Use This Skill

Activate when the user:
- Provides an article/blog URL and wants the text content
- Asks to "download this article"
- Wants to "extract the content from [URL]"
- Asks to "save this blog post as text"
- Needs clean article text without distractions

## How It Works

### Priority Order:
1. **Check if tools are installed** (reader or trafilatura)
2. **Download and extract article** using best available tool
3. **Clean up the content** (remove extra whitespace, format properly)
4. **Save to file** with article title as filename
5. **Confirm location** and show preview

## Installation Check

Check for article extraction tools in this order:

### Option 1: reader (Recommended - Mozilla's Readability)

```bash
command -v reader
```

If not installed:
```bash
npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli
```

### Option 2: trafilatura (Python-based, very good)

```bash
command -v trafilatura
```

If not installed:
```bash
pip3 install trafilatura
```

### Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

## Extraction Methods

### Method 1: Using reader (Best for most articles)

```bash
# Extract article
reader "URL" > article.txt
```

**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structure

### Method 2: Using trafilatura (Best for blogs/news)

```bash
# Extract article
trafilatura --URL "URL" --output-format txt > article.txt

# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
```

**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages

**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)

### Method 3: Fallback (curl + basic parsing)

```bash
# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
        self.current_tag = None

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
                self.in_content = True
        self.current_tag = tag

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
```

**Note:** This is less reliable but works without dependencies.

## Getting Article Title

Extract title for filename:

### Using reader:
```bash
# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
```

### Using trafilatura:
```bash
# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
```

### Using curl (fallback):
```bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
```

## Filename Creation

Clean title for filesystem:

```bash
# Get title
TITLE="Article Title from Website"

# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

# Add extension
FILENAME="${FILENAME}.txt"
```

## Complete Workflow

```bash
ARTICLE_URL="https://example.com/article"

# Check for tools
if command -v reader &> /dev/null; then
    TOOL="reader"
    echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
    TOOL="trafilatura"
    echo "Using trafilatura"
else
    TOOL="fallback"
    echo "Using fallback method (may be less accurate)"
fi

# Extract article
case $TOOL in
    reader)
        # Get content
        reader "$ARTICLE_URL" > temp_article.txt

        # Get title (first line after # in markdown)
        TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
        ;;

    trafilatura)
        # Get title from metadata
        METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
        TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")

        # Get clean content
        trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
        ;;

    fallback)
        # Get title
        TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
        TITLE=${TITLE%% - *}  # Remove site name
        TITLE=${TITLE%% | *}  # Remove site name (alternate)

        # Get content (basic extraction)
        curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main'}:
                self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\n')

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
        ;;
esac

# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"

# Move to final filename
mv temp_article.txt "$FILENAME"

# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
```

## Error Handling

### Common Issues

**1. Tool not installed**
- Try alternate tool (reader → trafilatura → fallback)
- Offer to install: "Install reader with: npm install -g reader-cli"

**2. Paywall or login required**
- Extraction tools may fail
- Inform user: "This article requires authentication. Cannot extract."

**3. Invalid URL**
- Check URL format
- Try with and without redirects

**4. No content extracted**
- Site may use heavy JavaScript
- Try fallback method
- Inform user if extraction fails

**5. Special characters in title**
- Clean title for filesystem
- Remove: `/`, `:`, `?`, `"`, `<`, `>`, `|`
- Replace with `-` or remove

## Output Format

### Saved File Contains:
- Article title (if available)
- Author (if available from tool)
- Main article text
- Section headings
- No navigation, ads, or clutter

### What Gets Removed:
- Navigation menus
- Ads and promotional content
- Newsletter signup forms
- Related articles sidebars
- Comment sections (optional)
- Social media buttons
- Cookie notices

## Tips for Best Results

**1. Use reader for most articles**
- Best all-around tool
- Based on Firefox Reader View
- Works on most news sites and blogs

**2. Use trafilatura for:**
- Academic articles
- News sites
- Blogs with complex layouts
- Non-English content

**3. Fallback method limitations:**
- May include some noise
- Less accurate paragraph detection
- Better than nothing for simple sites

**4. Check extraction quality:**
- Always show preview to user
- Ask if it looks correct
- Offer to try different tool if needed

## Example Usage

**Simple extraction:**
```bash
# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
```

**With error handling:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi
```

## Best Practices

- ✅ Always show preview after extraction (first 10 lines)
- ✅ Verify extraction succeeded before saving
- ✅ Clean filename for filesystem compatibility
- ✅ Try fallback method if primary fails
- ✅ Inform user which tool was used
- ✅ Keep filename length reasonable (< 100 chars)

## After Extraction

Display to user:
1. "✓ Extracted: [Article Title]"
2. "✓ Saved to: [filename]"
3. Show preview (first 10-15 lines)
4. File size and location

Ask if needed:
- "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
- "Should I extract another article?"

Overview

This skill extracts the main readable content from article and blog URLs and saves it as a clean text file. It removes navigation, ads, newsletters, comments, and other clutter so you get only the article title, headings, and body. The output is saved with a filesystem-safe filename and a short preview is shown after extraction.

How this skill works

The skill checks for available extraction tools in priority order (reader, then trafilatura, then a curl+parser fallback). It downloads the page, runs the best available extraction method to isolate article text and metadata, cleans whitespace and special characters, and constructs a sanitized filename from the article title. Finally it writes a .txt file, confirms the save location, and shows a preview of the first lines.

When to use it

  • You have a URL and want the article text without ads or navigation
  • You want to download or archive a blog post or tutorial as plain text
  • You need a readable version of a page for offline reading or analysis
  • You want a clean input for text processing or summarization
  • You need to save multiple articles with consistent filenames

Best practices

  • Prefer reader (Mozilla Readability) as primary tool for most sites
  • Use trafilatura for complex blogs, news sites, or non-English content
  • Always inspect the preview (first 10–15 lines) before finalizing
  • Sanitize titles to create safe filenames and limit length (<100 chars)
  • If extraction fails, try alternate tool or fallback; report paywalls or heavy JS as likely causes

Example use cases

  • Extract a news article URL to save as Article-Title.txt for offline reading
  • Batch-download a set of blog post URLs and create cleaned text files for NLP preprocessing
  • Save a tutorial page as plain text to include in documentation or notes
  • Extract content from a site with heavy ads to produce a distraction-free copy
  • Quickly verify extraction quality by showing a 10-line preview before moving the file

FAQ

What if the extraction tool is not installed?

The skill tries reader first, then trafilatura, then a fallback curl+parser. It can suggest installation commands (npm install -g reader-cli or pip3 install trafilatura) if needed.

What happens with paywalled or login-restricted articles?

Extraction will usually fail for paywalled content. The skill reports the issue and asks for an authenticated export or an alternative URL.

How is the filename created from the title?

The skill cleans the title by removing or replacing problematic characters (/ : ? " < > |), trimming whitespace, and truncating to a safe length, then appends .txt.