home / skills / krosebrook / source-of-truth-monorepo / pdf-processing

This skill helps you extract text and tables from PDF files, fill forms, and merge documents using Python libraries.

This is most likely a fork of the pdf-processing skill from dy9759
npx playbooks add skill krosebrook/source-of-truth-monorepo --skill pdf-processing

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
3.1 KB
---
name: PDF Processing
description: Extract text and tables from PDF files, fill forms, merge documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction.
---

# PDF Processing

## Quick start

Use pdfplumber to extract text from PDFs:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)
```

## Extracting tables

Extract tables from PDFs with automatic detection:

```python
import pdfplumber

with pdfplumber.open("report.pdf") as pdf:
    page = pdf.pages[0]
    tables = page.extract_tables()

    for table in tables:
        for row in table:
            print(row)
```

## Extracting all pages

Process multi-page documents efficiently:

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    full_text = ""
    for page in pdf.pages:
        full_text += page.extract_text() + "\n\n"

    print(full_text)
```

## Form filling

For PDF form filling, see [FORMS.md](FORMS.md) for the complete guide including field analysis and validation.

## Merging PDFs

Combine multiple PDF files:

```python
from pypdf import PdfMerger

merger = PdfMerger()

for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
    merger.append(pdf)

merger.write("merged.pdf")
merger.close()
```

## Splitting PDFs

Extract specific pages or ranges:

```python
from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()

# Extract pages 2-5
for page_num in range(1, 5):
    writer.add_page(reader.pages[page_num])

with open("output.pdf", "wb") as output:
    writer.write(output)
```

## Available packages

- **pdfplumber** - Text and table extraction (recommended)
- **pypdf** - PDF manipulation, merging, splitting
- **pdf2image** - Convert PDFs to images (requires poppler)
- **pytesseract** - OCR for scanned PDFs (requires tesseract)

## Common patterns

**Extract and save text:**
```python
import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    text = "\n\n".join(page.extract_text() for page in pdf.pages)

with open("output.txt", "w") as f:
    f.write(text)
```

**Extract tables to CSV:**
```python
import pdfplumber
import csv

with pdfplumber.open("tables.pdf") as pdf:
    tables = pdf.pages[0].extract_tables()

    with open("output.csv", "w", newline="") as f:
        writer = csv.writer(f)
        for table in tables:
            writer.writerows(table)
```

## Error handling

Handle common PDF issues:

```python
import pdfplumber

try:
    with pdfplumber.open("document.pdf") as pdf:
        if len(pdf.pages) == 0:
            print("PDF has no pages")
        else:
            text = pdf.pages[0].extract_text()
            if text is None or text.strip() == "":
                print("Page contains no extractable text (might be scanned)")
            else:
                print(text)
except Exception as e:
    print(f"Error processing PDF: {e}")
```

## Performance tips

- Process pages in batches for large PDFs
- Use multiprocessing for multiple files
- Extract only needed pages rather than entire document
- Close PDF objects after use

Overview

This skill extracts text and tables from PDFs, fills form fields, and merges or splits documents. It combines reliable tools for text/table extraction, OCR support for scanned pages, and utilities for PDF manipulation to make document workflows automatable. It is built for programmatic processing in TypeScript and Python-driven pipelines.

How this skill works

It uses pdfplumber for high-fidelity text and table extraction and pypdf for merging, splitting, and low-level PDF operations. For scanned or image-based PDFs it integrates pdf2image plus pytesseract for OCR. Typical flows read pages, extract text/tables, optionally run OCR, then write results or updated PDF files.

When to use it

  • Extract structured text or tables from multi-page PDFs
  • Automate filling or validating PDF form fields
  • Merge multiple PDFs into a single document or split them into ranges
  • Process scanned documents requiring OCR
  • Preprocess PDFs for downstream parsing, search indexing, or data export

Best practices

  • Extract only the pages you need rather than whole documents to save memory and time
  • Batch pages and use multiprocessing for large datasets to improve throughput
  • Detect pages with no extractable text and route them to OCR to avoid wasted work
  • Close PDF objects explicitly and release file handles after operations
  • Validate form field names and types before writing to avoid corrupting PDFs

Example use cases

  • Convert multi-page reports to plain text or CSV tables for analytics
  • Merge quarterly invoices into a single file for archival and delivery
  • Split a large scanned dossier into client-specific PDF bundles
  • Fill and validate standardized form templates (invoices, applications) programmatically
  • Run OCR on scanned meeting notes and export searchable text for indexing

FAQ

What tools handle scanned PDFs?

Use pdf2image to rasterize pages and pytesseract for OCR; detect pages with no extractable text first to limit OCR work.

How do I get tables into CSV?

Extract tables with pdfplumber and write rows to CSV using the csv module or a DataFrame for further cleaning.

How can I merge or split files reliably?

Use pypdf (PdfMerger, PdfReader, PdfWriter) to append files or extract specific page ranges, then write the output to a new PDF file.