home / skills / plurigrid / asi / pdf
This skill helps you manipulate PDFs efficiently by merging, splitting, text and table extraction, and creation using Python libraries.
npx playbooks add skill plurigrid/asi --skill pdfReview the files below or copy the command above to add this skill to your agents.
---
name: pdf
description: Comprehensive PDF manipulation toolkit for extracting text and tables,
creating new PDFs, merging/splitting documents, and handling forms. When Claude
needs to fill in a PDF form or programmatically process, generate, or analyze PDF
documents at scale.
license: Apache-2.0
metadata:
trit: 0
source: anthropics/skills
---
# PDF Processing Guide
## Quick Start
```python
from pypdf import PdfReader, PdfWriter
# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")
# Extract text
text = ""
for page in reader.pages:
text += page.extract_text()
```
## Python Libraries
### pypdf - Basic Operations
#### Merge PDFs
```python
from pypdf import PdfWriter, PdfReader
writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
reader = PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
with open("merged.pdf", "wb") as output:
writer.write(output)
```
#### Split PDF
```python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
writer = PdfWriter()
writer.add_page(page)
with open(f"page_{i+1}.pdf", "wb") as output:
writer.write(output)
```
### pdfplumber - Text and Table Extraction
#### Extract Tables
```python
import pdfplumber
import pandas as pd
with pdfplumber.open("document.pdf") as pdf:
all_tables = []
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
if table:
df = pd.DataFrame(table[1:], columns=table[0])
all_tables.append(df)
```
### reportlab - Create PDFs
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.save()
```
## Command-Line Tools
```bash
# Extract text (poppler-utils)
pdftotext input.pdf output.txt
# Merge PDFs (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
```
## Quick Reference
| Task | Best Tool | Command/Code |
|------|-----------|--------------|
| Merge PDFs | pypdf | `writer.add_page(page)` |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | `page.extract_text()` |
| Extract tables | pdfplumber | `page.extract_tables()` |
| Create PDFs | reportlab | Canvas or Platypus |
| OCR scanned PDFs | pytesseract | Convert to image first |
This skill is a comprehensive PDF manipulation toolkit focused on extracting text and tables, merging and splitting documents, and creating simple PDFs programmatically. It leverages Python libraries and common command-line tools to handle digital and scanned PDFs. The goal is practical, scriptable workflows for data extraction and document assembly.
The skill uses pypdf for basic reading, merging, and splitting of PDF pages and reportlab for generating PDFs. For robust text and structured table extraction it integrates pdfplumber; for OCR on scanned pages it recommends converting pages to images and running pytesseract. It also documents equivalent command-line utilities (pdftotext, qpdf) for fast shell-based operations.
Which tool is best for extracting tables reliably?
Use pdfplumber to detect and extract tables, then convert the output to pandas DataFrames for cleaning and export.
How do I handle scanned PDFs with no selectable text?
Rasterize pages to images, run pytesseract for OCR, then post-process the text; increasing DPI improves OCR quality.