home / skills / enoch-robinson / agent-skill-collection / pdf

pdf skill

/skills/documents/pdf

This skill helps you manipulate PDF documents programmatically by merging, splitting, extracting text and tables, rotating pages, and creating new files.

npx playbooks add skill enoch-robinson/agent-skill-collection --skill pdf

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.5 KB
---
name: pdf
description: PDF 处理工具包。用于提取文本和表格、创建新 PDF、合并拆分文档、旋转页面、处理表单。当需要程序化处理、生成或分析 PDF 文档时使用此技能。
---

# PDF Processing Guide

## 快速开始

```python
from pypdf import PdfReader, PdfWriter

# 读取 PDF
reader = PdfReader("document.pdf")
print(f"页数: {len(reader.pages)}")

# 提取文本
text = ""
for page in reader.pages:
    text += page.extract_text()
```

## Python 库选择

| 任务 | 推荐库 | 用途 |
|------|--------|------|
| 基础操作 | pypdf | 合并、拆分、旋转、元数据 |
| 文本提取 | pdfplumber | 文本和表格提取 |
| 创建 PDF | reportlab | 生成新 PDF |
| OCR 扫描件 | pytesseract | 图片文字识别 |

## 常用操作

### 合并 PDF
```python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)
```

### 拆分 PDF
```python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)
```

### 提取表格
```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)
```

### 创建 PDF
```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("new.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World!")
c.save()
```

### 旋转页面
```python
reader = PdfReader("input.pdf")
writer = PdfWriter()

page = reader.pages[0]
page.rotate(90)  #顺时针旋转 90 度
writer.add_page(page)

with open("rotated.pdf", "wb") as output:
    writer.write(output)
```

## 命令行工具

```bash
# 提取文本 (poppler-utils)
pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt  # 保留布局

# 合并 PDF (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# 拆分页面
qpdf input.pdf --pages .1-5 -- pages1-5.pdf
```

## 快速参考

| 任务 | 代码 |
|------|------|
| 合并 | `writer.add_page(page)` |
| 拆分 | 每页单独保存 |
| 提取文本 | `page.extract_text()` |
| 提取表格 | `page.extract_tables()` |
| 创建 | `canvas.Canvas()` |
| 旋转 | `page.rotate(90)` |

Overview

This skill provides a practical PDF processing toolkit for extracting text and tables, creating and editing PDFs, merging and splitting documents, rotating pages, and handling simple form workflows. It bundles common programmatic patterns and library recommendations to automate PDF generation and analysis tasks. Use it to build reliable pipelines for document ingestion, transformation, and export.

How this skill works

The skill uses lightweight Python libraries for different responsibilities: pypdf (basic read/write, merge, split, rotate), pdfplumber (accurate text and table extraction), reportlab (generate new PDFs), and pytesseract for OCR on scanned pages. It exposes common code patterns to read pages, extract content, assemble new documents, and perform command-line operations when needed.

When to use it

  • Automating extraction of text or tables from many PDF files for downstream analysis.
  • Merging multiple reports or splitting a large PDF into individual pages.
  • Generating PDFs programmatically, such as invoices or reports.
  • Rotating mis-scanned pages or normalizing page orientation before OCR.
  • Processing scanned documents by combining OCR with table extraction.

Best practices

  • Choose the right library per task: pypdf for structure, pdfplumber for tables, reportlab for creation, pytesseract for OCR.
  • Preserve original layout when extracting structured data by using pdfplumber's layout-aware extraction.
  • Stream pages rather than loading large PDFs entirely into memory for scalability.
  • Validate outputs (text integrity, table columns) with small samples before batch runs.
  • Use command-line tools (pdftotext, qpdf) for fast, reliable operations in bulk processing pipelines.

Example use cases

  • Extract all invoice tables from a folder of PDFs to CSV for accounting reconciliation.
  • Merge monthly reports into a single consolidated PDF for distribution.
  • Split a multipage scanned contract into separate page files for electronic signatures.
  • Create templated PDF receipts using reportlab populated with database values.
  • Rotate and OCR scanned forms to extract handwritten or printed fields programmatically.

FAQ

Which library should I use for table extraction?

Use pdfplumber for the most reliable table and layout-aware extractions; pypdf can extract text but is less structured for tables.

How do I handle scanned PDFs with only images?

Run OCR (for example pytesseract) on page images after rotating/cropping as needed; combine with pdfplumber to locate table regions before OCR when possible.