home / skills / 89jobrien / steve / pdf-processing
This skill helps you extract text and tables, fill forms, merge, and split PDFs using Python libraries for efficient document processing.
npx playbooks add skill 89jobrien/steve --skill pdf-processingReview the files below or copy the command above to add this skill to your agents.
---
name: PDF Processing
description: Extract text and tables from PDF files, fill forms, merge documents.
Use when working with PDF files or when the user mentions PDFs, forms, or document
extraction.
author: Joseph OBrien
status: unpublished
updated: '2025-12-23'
version: 1.0.1
tag: skill
type: skill
---
# PDF Processing
## Quick start
Use pdfplumber to extract text from PDFs:
```python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
text = pdf.pages[0].extract_text()
print(text)
```
## Extracting tables
Extract tables from PDFs with automatic detection:
```python
import pdfplumber
with pdfplumber.open("report.pdf") as pdf:
page = pdf.pages[0]
tables = page.extract_tables()
for table in tables:
for row in table:
print(row)
```
## Extracting all pages
Process multi-page documents efficiently:
```python
import pdfplumber
with pdfplumber.open("document.pdf") as pdf:
full_text = ""
for page in pdf.pages:
full_text += page.extract_text() + "\n\n"
print(full_text)
```
## Form filling
For PDF form filling, see [FORMS.md](FORMS.md) for the complete guide including field analysis and validation.
## Merging PDFs
Combine multiple PDF files:
```python
from pypdf import PdfMerger
merger = PdfMerger()
for pdf in ["file1.pdf", "file2.pdf", "file3.pdf"]:
merger.append(pdf)
merger.write("merged.pdf")
merger.close()
```
## Splitting PDFs
Extract specific pages or ranges:
```python
from pypdf import PdfReader, PdfWriter
reader = PdfReader("input.pdf")
writer = PdfWriter()
# Extract pages 2-5
for page_num in range(1, 5):
writer.add_page(reader.pages[page_num])
with open("output.pdf", "wb") as output:
writer.write(output)
```
## Available packages
- **pdfplumber** - Text and table extraction (recommended)
- **pypdf** - PDF manipulation, merging, splitting
- **pdf2image** - Convert PDFs to images (requires poppler)
- **pytesseract** - OCR for scanned PDFs (requires tesseract)
## Common patterns
**Extract and save text:**
```python
import pdfplumber
with pdfplumber.open("input.pdf") as pdf:
text = "\n\n".join(page.extract_text() for page in pdf.pages)
with open("output.txt", "w") as f:
f.write(text)
```
**Extract tables to CSV:**
```python
import pdfplumber
import csv
with pdfplumber.open("tables.pdf") as pdf:
tables = pdf.pages[0].extract_tables()
with open("output.csv", "w", newline="") as f:
writer = csv.writer(f)
for table in tables:
writer.writerows(table)
```
## Error handling
Handle common PDF issues:
```python
import pdfplumber
try:
with pdfplumber.open("document.pdf") as pdf:
if len(pdf.pages) == 0:
print("PDF has no pages")
else:
text = pdf.pages[0].extract_text()
if text is None or text.strip() == "":
print("Page contains no extractable text (might be scanned)")
else:
print(text)
except Exception as e:
print(f"Error processing PDF: {e}")
```
## Performance tips
- Process pages in batches for large PDFs
- Use multiprocessing for multiple files
- Extract only needed pages rather than entire document
- Close PDF objects after use
This skill extracts text and tables from PDF files, fills form fields, merges and splits documents, and converts pages to images for OCR. It relies on lightweight Python libraries to handle common PDF workflows reliably. Use it to automate document processing, data extraction, and batch transformations.
The skill uses pdfplumber for text and table extraction and pypdf for merging and splitting. For scanned or image-based PDFs it suggests converting pages with pdf2image and applying OCR with pytesseract. It includes patterns for reading all pages, extracting tables to CSV, filling form fields, and basic error handling and performance tips.
What if extract_text() returns None or empty?
That usually means the page is scanned or contains images. Convert the page to an image with pdf2image and run OCR with pytesseract.
Which library should I use for merging and splitting?
Use pypdf for reliable merging, splitting, and basic form field handling; combine it with pdfplumber for extraction tasks.