home / skills / microck / ordinary-claude-skills / pdf-processing

pdf-processing skill

/skills_all/pdf-processing

This skill helps you extract text and tables from PDFs, merge documents, and automate form filling for efficient document processing.

This is most likely a fork of the pdf-processing skill from dy9759
npx playbooks add skill microck/ordinary-claude-skills --skill pdf-processing

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
3.1 KB
---
name: PDF Processing
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---

# PDF Processing

## Quick start

Use pdfplumber to extract text from PDFs:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)
```

## Extracting tables

Extract tables from PDFs with automatic detection:

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        for row in table:
            print(row)
```

## Extracting all pages

Process multi-page documents efficiently:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    full_text = ""
    for page in pdf.pages:
        full_text += page.extract_text() + "\n\n"

    print(full_text)
```

## Form filling

For PDF form filling, see [FORMS.md](FORMS.md) for the complete guide including field analysis and validation.

## Merging PDFs

Combine multiple PDF files:

```python
from pypdf import PdfMerger

merger = PdfMerger()

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

merger.write("merged.pdf")
merger.close()
```

## Splitting PDFs

Extract specific pages or ranges:

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Extract pages 2-5
for page_num in range(1, 5):
    writer.add_page(reader.pages[page_num])

with open("output.pdf", "wb") as output:
    writer.write(output)
```

## Available packages

- **pdfplumber** - Text and table extraction (recommended)
- **pypdf** - PDF manipulation, merging, splitting
- **pdf2image** - Convert PDFs to images (requires poppler)
- **pytesseract** - OCR for scanned PDFs (requires tesseract)

## Common patterns

**Extract and save text:**
```python
import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text = "\n\n".join(page.extract_text() for page in pdf.pages)

with open("output.txt", "w") as f:
    f.write(text)
```

**Extract tables to CSV:**
```python
import pdfplumber
import csv

with pdfplumber.open("tables.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

    with open("output.csv", "w", newline="") as f:
        writer = csv.writer(f)
        for table in tables:
            writer.writerows(table)
```

## Error handling

Handle common PDF issues:

```python
import pdfplumber

try:
    with pdfplumber.open("document.pdf") as pdf:
        if len(pdf.pages) == 0:
            print("PDF has no pages")
        else:
            text = pdf.pages[0].extract_text()
            if text is None or text.strip() == "":
                print("Page contains no extractable text (might be scanned)")
            else:
                print(text)
except Exception as e:
    print(f"Error processing PDF: {e}")
```

## Performance tips

- Process pages in batches for large PDFs
- Use multiprocessing for multiple files
- Extract only needed pages rather than entire document
- Close PDF objects after use

Overview

This skill extracts text and tables from PDF files, fills PDF forms, and merges or splits documents. It provides practical code patterns and package recommendations for common PDF tasks. Use it to automate document processing, data extraction, and batch PDF manipulation.

How this skill works

The skill uses pdfplumber for text and table extraction and pypdf for merging and splitting. For scanned PDFs it recommends converting pages to images with pdf2image and applying OCR with pytesseract. It includes error handling patterns and performance tips for large or many files.

When to use it

  • You need to extract plain text from one or many PDFs.
  • You must detect and export tabular data from reports or invoices.
  • You want to fill or validate PDF form fields programmatically.
  • You need to merge multiple PDFs into one document or split a PDF by page ranges.
  • You must process scanned PDFs using OCR to get searchable text.

Best practices

  • Prefer pdfplumber for structured text and table extraction; it often preserves layout better than generic parsers.
  • For scanned or image-based PDFs, convert pages to images (pdf2image) and run OCR (pytesseract).
  • Process only needed pages when possible to save memory and time.
  • Batch pages or use multiprocessing when handling many files to improve throughput.
  • Always close PDF objects and handle exceptions to avoid file locks and partial outputs.

Example use cases

  • Extract all text from a multi-page contract and save as .txt for indexing.
  • Detect tables on financial reports and export them to CSV for analysis.
  • Merge multiple monthly statements into a single file to share with stakeholders.
  • Split a large PDF into chapters or extract page ranges for distribution.
  • Fill known PDF form fields (name, date, address) in bulk for automated document generation.

FAQ

Which Python libraries should I install first?

Start with pdfplumber for extraction and pypdf for merging/splitting. Add pdf2image and pytesseract only if you need OCR for scanned documents.

What if extract_text() returns None or empty strings?

That usually means the page is image-based. Convert the page to an image and run OCR, or check that the PDF has embedded text layers.

How do I handle very large PDFs?

Process pages in batches, extract only required page ranges, and use multiprocessing for multiple files. Ensure you close PDF objects after each file.