home / skills / christopheryeo / claude-skills / news-articles-rename

news-articles-rename skill

/news-articles-rename

This skill renames news article files by extracting headlines via OCR and saving as the article title in the Vivien (PA)/News/ folder.

npx playbooks add skill christopheryeo/claude-skills --skill news-articles-rename

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
2.8 KB
---
name: news-articles-rename
description: >
  OCR and rename news article files (PDFs and images) by extracting the article headline from
  the content and using it as the filename. Targets the Vivien (PA)/News/ folder. Use this skill
  whenever the user asks to rename news articles, organise news clippings, process newspaper
  PDFs, extract article titles from scanned pages, or tidy up the News folder. Also trigger when
  the user mentions "rename articles", "news folder", "article titles", or "news clippings".
---

# News Articles Rename

## Purpose

Newspaper articles saved as PDFs or images typically arrive with unhelpful filenames like
`Image 2026-02-25 15-52-38.pdf`. This skill extracts the main headline from each file using
OCR and renames it to `Article Title.pdf` — making the News folder instantly browsable.

## Target Folder

The default target is always:

```
Vivien (PA)/News/
```

## Supported File Types

Process any file with these extensions: `.pdf`, `.png`, `.jpg`, `.jpeg`

Skip hidden files (starting with `.`) and any file that doesn't match these extensions.

## How It Works

Run the bundled script `scripts/rename_articles.py` which handles the full pipeline:

```bash
python3 <skill-path>/scripts/rename_articles.py "<news-folder-path>"
```

The script will:
1. Scan the folder for all supported files
2. For each file, extract the first page as an image (300 DPI)
3. Run Tesseract OCR on the image
4. Identify the headline using heuristics (skip metadata, collect first substantial text block)
5. Apply common OCR corrections (e.g. "Al" → "AI")
6. Sanitise the headline for use as a filename
7. Rename the file, handling duplicates by appending a number
8. Print a summary table of old → new filenames

The script also handles mounted filesystem lock issues automatically by copying files to a
temp directory for OCR processing when direct reads fail.

## After Running

Present the results as a clear summary table showing what was renamed:

| # | Original Filename | New Filename | Status |
|---|---|---|---|
| 1 | Image 2026-02-25 15-52-38.pdf | Headline Goes Here.pdf | ✅ Renamed |
| 2 | Image 2026-02-25 15-53-03.pdf | Another Article.pdf | ✅ Renamed |
| 3 | some-file.png | some-file.png | ⚠️ No title found |

Flag any files that couldn't be processed and explain why. Note that minor OCR artefacts
in headlines (e.g. misread characters) are expected from Tesseract — only flag files where
no headline could be extracted at all.

## Important Notes

- Always process **every** file in the folder. Do not leave any file out.
- If OCR is uncertain about a headline, prefer keeping the original name over guessing wrong.
- The script handles both single-page and multi-page PDFs — only the first page is used for title extraction.
- For image files (.png, .jpg, .jpeg), OCR is run directly on the image.

Overview

This skill OCRs and renames news article files (PDFs and images) by extracting the article headline and using it as the filename. It targets the Vivien (PA)/News/ folder and converts unhelpful names like Image 2026-02-25.pdf into readable, headline-based filenames. The result makes the News folder immediately browsable and searchable.

How this skill works

The bundled script scans the target folder for supported files (.pdf, .png, .jpg, .jpeg) and skips hidden files. For each file it extracts the first page as an image (or uses the image directly), runs Tesseract OCR, and applies heuristics to locate the main headline. It cleans common OCR errors, sanitises the text for a safe filename, handles duplicates by appending numbers, and prints a concise summary of changes. If direct reads fail due to mounted filesystem locks, files are copied to a temp directory for processing.

When to use it

  • You want to rename articles or tidy the News folder
  • You need to organise scanned newspaper clippings or PDFs
  • You want article titles extracted from the first page of scans
  • You need to make the Vivien (PA)/News/ folder browsable
  • You mention “rename articles”, “news folder”, “article titles”, or “news clippings”

Best practices

  • Run the script on the entire News folder so no files are skipped
  • Keep a backup of the folder before bulk renaming in case of unexpected results
  • Review the summary output and flagged files after processing
  • Prefer keeping original filenames if OCR confidence is low
  • Use the script on a machine with Tesseract installed and sufficient RAM for PDF rendering

Example use cases

  • Batch-renaming a week’s worth of scanned newspaper PDFs to their headlines
  • Cleaning up a folder of photographed clippings from a library research visit
  • Processing daily press clippings dropped into Vivien (PA)/News/ by colleagues
  • Converting messy camera-phone images of articles into searchable, headline-named files
  • Rapidly preparing a browsable archive of front-page articles for review

FAQ

What file types are supported?

PDF, PNG, JPG and JPEG files are supported. Hidden files and other extensions are skipped.

What happens if no headline can be found?

The script leaves the original filename and flags the file in the summary so you can review why OCR failed.