home / skills / openclaw / skills / iyeque-pdf-reader

iyeque-pdf-reader skill

/skills/iyeque/iyeque-pdf-reader

This skill helps you extract text from PDFs, search content, generate summaries, and retrieve metadata.

npx playbooks add skill openclaw/skills --skill iyeque-pdf-reader

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
2.5 KB
---
name: pdf-reader
description: Extract text, search inside PDFs, and produce summaries.
homepage: "https://pymupdf.readthedocs.io"
metadata:
  {
    "openclaw":
      {
        "emoji": "📄",
        "requires": { "bins": ["python3"], "pip": ["PyMuPDF"] },
        "install":
          [
            {
              "id": "pymupdf",
              "kind": "pip",
              "package": "PyMuPDF",
              "label": "Install PyMuPDF",
            },
          ],
        "version": "1.1.0",
      },
  }
---

# PDF Reader Skill

The `pdf-reader` skill provides functionality to extract text and retrieve metadata from PDF files using PyMuPDF (fitz).

## Tool API

The skill provides two commands:

### extract
Extracts plain text from the specified PDF file.

- **Parameters:**
  - `file_path` (string, required): Path to the PDF file to extract text from.
  - `--max_pages` (integer, optional): Maximum number of pages to extract.

**Usage:**
```bash
python3 skills/pdf-reader/reader.py extract /path/to/document.pdf
python3 skills/pdf-reader/reader.py extract /path/to/document.pdf --max_pages 5
```

**Output:** Plain text content from the PDF.

### metadata
Retrieve metadata about the document.

- **Parameters:**
  - `file_path` (string, required): Path to the PDF file.

**Usage:**
```bash
python3 skills/pdf-reader/reader.py metadata /path/to/document.pdf
```

**Output:** JSON object with PDF metadata including:
- `title`: Document title
- `author`: Document author
- `subject`: Document subject
- `creator`: Application that created the PDF
- `producer`: PDF producer
- `creationDate`: Creation date
- `modDate`: Modification date
- `format`: PDF format version
- `encryption`: Encryption info (if any)

## Implementation Notes

- Uses **PyMuPDF** (imported as `pymupdf`) for fast, reliable PDF processing
- Supports encrypted PDFs (will return error if password required)
- Handles large PDFs efficiently with `max_pages` option
- Returns structured JSON for metadata command

## Example

```bash
# Extract text from first 3 pages
python3 skills/pdf-reader/reader.py extract report.pdf --max_pages 3

# Get document metadata
python3 skills/pdf-reader/reader.py metadata report.pdf
# Output:
# {
#   "title": "Annual Report 2024",
#   "author": "John Doe",
#   "creationDate": "D:20240115120000",
#   ...
# }
```

## Error Handling

- Returns error message if file not found or not a valid PDF
- Returns error if PDF is encrypted and requires password
- Gracefully handles corrupted or malformed PDFs

Overview

This skill extracts text, searches inside PDFs, produces summaries, and retrieves basic document metadata. It is designed for quick inspection of PDF content and to surface key information without manual reading. The focus is on practical, scriptable tools that integrate with existing Python workflows.

How this skill works

The skill reads PDF files and extracts plain text using a PDF parsing library. It supports full or partial extraction (limit by pages), simple case-insensitive search that returns matching lines, chunk-based summarization that divides long text into manageable pieces and sends each chunk to an LLM, and a metadata reader that returns title, author, and page count. The functions are lightweight and intended to be used programmatically in pipelines or small utilities.

When to use it

  • Quickly pull raw text from a PDF for downstream processing or indexing.
  • Find occurrences of a keyword or phrase across large documents.
  • Generate concise, multi-chunk summaries of long reports or papers.
  • Extract basic metadata for cataloging or archival systems.
  • Preprocess PDFs before feeding content to search indexes or LLMs.

Best practices

  • Test extraction on representative PDFs because layout and embedded fonts affect results.
  • Limit extraction to relevant pages when performance or token limits matter.
  • Use case-insensitive queries and normalize text for more robust search results.
  • Adjust chunk size to match the LLM input limits and expected summary granularity.
  • Validate summaries against source text for accuracy, especially with technical content.

Example use cases

  • Archive ingestion: extract text and metadata to populate a searchable archive index.
  • Research review: summarize long academic papers or whitepapers into digestible points.
  • Compliance checks: search contracts and reports for specific clauses or terms.
  • Automation pipelines: preprocess PDFs before NLP tasks like entity extraction.
  • Content audit: quickly surface titles, authors, and page counts for large PDF batches.

FAQ

What formats of PDFs work best?

Digitally generated PDFs with selectable text produce the most reliable results; scanned images may require OCR before extraction.

How are summaries generated?

Text is chunked into manageable segments and each chunk is summarized by a language model; chunk size should be tuned for the model's token limits.