home / skills / krosebrook / source-of-truth-monorepo / pdf-processing

pdf-processing skill

/plugins/marketplaces/claude-code-templates/cli-tool/components/skills/document-processing/pdf-processing

This skill helps you extract text and tables from PDF files, fill forms, and merge documents using Python libraries.

This is most likely a fork of the pdf-processing skill from 89jobrien

npx playbooks add skill krosebrook/source-of-truth-monorepo --skill pdf-processing

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

3.1 KB

---
name: PDF Processing
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---

# PDF Processing

## Quick start

Use pdfplumber to extract text from PDFs:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)
```

## Extracting tables

Extract tables from PDFs with automatic detection:

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        for row in table:
            print(row)
```

## Extracting all pages

Process multi-page documents efficiently:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    full_text = ""
    for page in pdf.pages:
        full_text += page.extract_text() + "\n\n"

    print(full_text)
```

## Form filling

For PDF form filling, see [FORMS.md](FORMS.md) for the complete guide including field analysis and validation.

## Merging PDFs

Combine multiple PDF files:

```python
from pypdf import PdfMerger

merger = PdfMerger()

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

merger.write("merged.pdf")
merger.close()
```

## Splitting PDFs

Extract specific pages or ranges:

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Extract pages 2-5
for page_num in range(1, 5):
    writer.add_page(reader.pages[page_num])

with open("output.pdf", "wb") as output:
    writer.write(output)
```

## Available packages

- **pdfplumber** - Text and table extraction (recommended)
- **pypdf** - PDF manipulation, merging, splitting
- **pdf2image** - Convert PDFs to images (requires poppler)
- **pytesseract** - OCR for scanned PDFs (requires tesseract)

## Common patterns

**Extract and save text:**
```python
import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text = "\n\n".join(page.extract_text() for page in pdf.pages)

with open("output.txt", "w") as f:
    f.write(text)
```

**Extract tables to CSV:**
```python
import pdfplumber
import csv

with pdfplumber.open("tables.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

    with open("output.csv", "w", newline="") as f:
        writer = csv.writer(f)
        for table in tables:
            writer.writerows(table)
```

## Error handling

Handle common PDF issues:

```python
import pdfplumber

try:
    with pdfplumber.open("document.pdf") as pdf:
        if len(pdf.pages) == 0:
            print("PDF has no pages")
        else:
            text = pdf.pages[0].extract_text()
            if text is None or text.strip() == "":
                print("Page contains no extractable text (might be scanned)")
            else:
                print(text)
except Exception as e:
    print(f"Error processing PDF: {e}")
```

## Performance tips

- Process pages in batches for large PDFs
- Use multiprocessing for multiple files
- Extract only needed pages rather than entire document
- Close PDF objects after use

Overview

This skill extracts text and tables from PDFs, fills form fields, and merges or splits documents. It combines reliable tools for text/table extraction, OCR support for scanned pages, and utilities for PDF manipulation to make document workflows automatable. It is built for programmatic processing in TypeScript and Python-driven pipelines.

How this skill works

It uses pdfplumber for high-fidelity text and table extraction and pypdf for merging, splitting, and low-level PDF operations. For scanned or image-based PDFs it integrates pdf2image plus pytesseract for OCR. Typical flows read pages, extract text/tables, optionally run OCR, then write results or updated PDF files.

When to use it

Extract structured text or tables from multi-page PDFs
Automate filling or validating PDF form fields
Merge multiple PDFs into a single document or split them into ranges
Process scanned documents requiring OCR
Preprocess PDFs for downstream parsing, search indexing, or data export

Best practices

Extract only the pages you need rather than whole documents to save memory and time
Batch pages and use multiprocessing for large datasets to improve throughput
Detect pages with no extractable text and route them to OCR to avoid wasted work
Close PDF objects explicitly and release file handles after operations
Validate form field names and types before writing to avoid corrupting PDFs

Example use cases

Convert multi-page reports to plain text or CSV tables for analytics
Merge quarterly invoices into a single file for archival and delivery
Split a large scanned dossier into client-specific PDF bundles
Fill and validate standardized form templates (invoices, applications) programmatically
Run OCR on scanned meeting notes and export searchable text for indexing

FAQ

What tools handle scanned PDFs?

Use pdf2image to rasterize pages and pytesseract for OCR; detect pages with no extractable text first to limit OCR work.

How do I get tables into CSV?

Extract tables with pdfplumber and write rows to CSV using the csv module or a DataFrame for further cleaning.

How can I merge or split files reliably?

Use pypdf (PdfMerger, PdfReader, PdfWriter) to append files or extract specific page ranges, then write the output to a new PDF file.