home / skills / 89jobrien / steve / pdf-processing

pdf-processing skill

/steve/skills/pdf-processing

This skill helps you extract text and tables, fill forms, merge, and split PDFs using Python libraries for efficient document processing.

npx playbooks add skill 89jobrien/steve --skill pdf-processing

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
3.2 KB
---
name: PDF Processing
description: Extract text and tables from PDF files, fill forms, merge documents.
  Use when working with PDF files or when the user mentions PDFs, forms, or document
  extraction.
author: Joseph OBrien
status: unpublished
updated: '2025-12-23'
version: 1.0.1
tag: skill
type: skill
---

# PDF Processing

## Quick start

Use pdfplumber to extract text from PDFs:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)
```

## Extracting tables

Extract tables from PDFs with automatic detection:

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        for row in table:
            print(row)
```

## Extracting all pages

Process multi-page documents efficiently:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    full_text = ""
    for page in pdf.pages:
        full_text += page.extract_text() + "\n\n"

    print(full_text)
```

## Form filling

For PDF form filling, see [FORMS.md](FORMS.md) for the complete guide including field analysis and validation.

## Merging PDFs

Combine multiple PDF files:

```python
from pypdf import PdfMerger

merger = PdfMerger()

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

merger.write("merged.pdf")
merger.close()
```

## Splitting PDFs

Extract specific pages or ranges:

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Extract pages 2-5
for page_num in range(1, 5):
    writer.add_page(reader.pages[page_num])

with open("output.pdf", "wb") as output:
    writer.write(output)
```

## Available packages

- **pdfplumber** - Text and table extraction (recommended)
- **pypdf** - PDF manipulation, merging, splitting
- **pdf2image** - Convert PDFs to images (requires poppler)
- **pytesseract** - OCR for scanned PDFs (requires tesseract)

## Common patterns

**Extract and save text:**

```python
import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text = "\n\n".join(page.extract_text() for page in pdf.pages)

with open("output.txt", "w") as f:
    f.write(text)
```

**Extract tables to CSV:**

```python
import pdfplumber
import csv

with pdfplumber.open("tables.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

    with open("output.csv", "w", newline="") as f:
        writer = csv.writer(f)
        for table in tables:
            writer.writerows(table)
```

## Error handling

Handle common PDF issues:

```python
import pdfplumber

try:
    with pdfplumber.open("document.pdf") as pdf:
        if len(pdf.pages) == 0:
            print("PDF has no pages")
        else:
            text = pdf.pages[0].extract_text()
            if text is None or text.strip() == "":
                print("Page contains no extractable text (might be scanned)")
            else:
                print(text)
except Exception as e:
    print(f"Error processing PDF: {e}")
```

## Performance tips

- Process pages in batches for large PDFs
- Use multiprocessing for multiple files
- Extract only needed pages rather than entire document
- Close PDF objects after use

Overview

This skill extracts text and tables from PDF files, fills form fields, merges and splits documents, and converts pages to images for OCR. It relies on lightweight Python libraries to handle common PDF workflows reliably. Use it to automate document processing, data extraction, and batch transformations.

How this skill works

The skill uses pdfplumber for text and table extraction and pypdf for merging and splitting. For scanned or image-based PDFs it suggests converting pages with pdf2image and applying OCR with pytesseract. It includes patterns for reading all pages, extracting tables to CSV, filling form fields, and basic error handling and performance tips.

When to use it

  • You need to extract structured or unstructured text from digital PDFs
  • You must extract tables and export them to CSV or spreadsheets
  • You need to merge multiple PDFs or split large documents into parts
  • You must fill or validate PDF form fields programmatically
  • You need OCR for scanned documents or to convert pages to images

Best practices

  • Prefer pdfplumber for accurate text and table extraction from digital PDFs
  • Process only required pages to reduce memory and CPU use
  • Batch pages or use multiprocessing when handling many files
  • Fall back to pdf2image + pytesseract for scanned pages with no extractable text
  • Always close PDF objects and handle exceptions to avoid file corruption

Example use cases

  • Extract all text from a multi-page contract and save to a text file for searching
  • Detect and export invoice tables to CSV for financial ingestion
  • Merge multiple reports into a single PDF for archival or distribution
  • Split a large scanned document into chapter files and apply OCR to each page
  • Automatically fill and validate PDF forms (names, dates, checkboxes) before distribution

FAQ

What if extract_text() returns None or empty?

That usually means the page is scanned or contains images. Convert the page to an image with pdf2image and run OCR with pytesseract.

Which library should I use for merging and splitting?

Use pypdf for reliable merging, splitting, and basic form field handling; combine it with pdfplumber for extraction tasks.