home / skills / ratacat / claude-skills / ebook-extractor

ebook-extractor skill

/skills/ebook-extractor

This skill extracts plain text from EPUB, MOBI, and PDF ebooks for analysis or reading without relying on LLMs.

npx playbooks add skill ratacat/claude-skills --skill ebook-extractor

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
1.9 KB
---
name: ebook-extractor
description: Use when user wants to extract text from ebooks (EPUB, MOBI, PDF). Use for converting ebooks to plain text for analysis, processing, or reading. Handles all common ebook formats.
---

# Ebook Text Extractor

## Overview
Extract plain text from EPUB, MOBI, and PDF files using Python scripts. No LLM calls - pure text extraction.

## Supported Formats

| Format | Tool Used | Notes |
|--------|-----------|-------|
| EPUB | `ebooklib` + `BeautifulSoup` | Direct parsing, preserves structure |
| MOBI | Calibre `ebook-convert` | Converts to EPUB first, then extracts |
| PDF | `PyMuPDF` (fitz) | Fast, handles most PDFs well |

## Usage

**Unified extractor (auto-detects format):**
```bash
python3 ~/.claude/skills/ebook-extractor/scripts/extract.py /path/to/book.epub
python3 ~/.claude/skills/ebook-extractor/scripts/extract.py /path/to/book.mobi
python3 ~/.claude/skills/ebook-extractor/scripts/extract.py /path/to/book.pdf
```

**Output options:**
```bash
# To stdout (default)
python3 scripts/extract.py book.epub

# To file
python3 scripts/extract.py book.epub -o output.txt
python3 scripts/extract.py book.epub > output.txt
```

**Format-specific scripts:**
```bash
python3 scripts/extract_epub.py book.epub
python3 scripts/extract_mobi.py book.mobi
python3 scripts/extract_pdf.py book.pdf
```

## Setup

```bash
# One-command setup (installs all dependencies)
~/.claude/skills/ebook-extractor/setup.sh

# Or manually:
pip install -r ~/.claude/skills/ebook-extractor/requirements.txt
brew install calibre  # macOS, for MOBI support
```

## Script Location
`~/.claude/skills/ebook-extractor/scripts/`

## Common Issues

| Problem | Solution |
|---------|----------|
| Missing package | Run `setup.sh` or `pip install -r requirements.txt` |
| MOBI fails | Ensure Calibre is installed: `brew install calibre` |
| PDF garbled | Some PDFs are image-based; OCR needed (not supported) |

Overview

This skill extracts plain text from EPUB, MOBI, and PDF ebooks so you can analyze, process, or read content in downstream tools. It is implemented in Python and performs local parsing and conversion—no external LLM calls. The tool focuses on reliable structure-aware extraction for EPUB, conversion-based handling for MOBI, and fast PDF text scraping.

How this skill works

The extractor auto-detects the input format and routes files to format-specific handlers. EPUB files are parsed with ebooklib and cleaned with BeautifulSoup to preserve reading order. MOBI files are converted to EPUB using Calibre's ebook-convert, then parsed like EPUB. PDFs are read with PyMuPDF (fitz) to pull text blocks quickly; image-only PDFs are not OCRed.

When to use it

  • Convert EPUB, MOBI, or PDF ebooks to plain text for NLP, indexing, or summarization.
  • Prepare books for readability in text-only environments or screen readers.
  • Preprocess ebook content for search, topic modeling, or citation extraction.
  • Batch-convert a library of ebooks into text files for archival or analysis.
  • Quickly inspect the textual content of an ebook without opening a reader app.

Best practices

  • Run the provided setup.sh or install requirements to ensure dependencies (ebooklib, BeautifulSoup, PyMuPDF) are present.
  • Install Calibre if you need MOBI support; MOBI files are converted to EPUB before extraction.
  • For large batches, pipe output to files rather than stdout to avoid console buffering issues.
  • If a PDF returns garbled text, check if it is image-based; OCR is not included and needs a separate step.
  • Trim or post-process extracted text (remove headers/footers) depending on downstream needs.

Example use cases

  • Extract chapters from an EPUB to feed into a topic modeling pipeline.
  • Convert a collection of MOBI novels to text files for full-text search indexing.
  • Pull text from a PDF academic monograph to produce summaries or citation lists.
  • Create plain-text versions of books for accessibility tools or e-readers that prefer raw text.
  • Preprocess ebook text for training or evaluation of NLP models.

FAQ

Does this skill perform OCR on scanned PDFs?

No. The PDF extractor reads embedded text via PyMuPDF. Image-only PDFs will appear garbled and require an external OCR step.

How do I enable MOBI support?

Install Calibre and make sure ebook-convert is on your PATH. The script will convert MOBI to EPUB and then extract text.