home / skills / plurigrid / asi / pdf

pdf skill

/skills/pdf

This skill helps you manipulate PDFs efficiently by merging, splitting, text and table extraction, and creation using Python libraries.

This is most likely a fork of the pdf skill from project-n-e-k-o
npx playbooks add skill plurigrid/asi --skill pdf

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.4 KB
---
name: pdf
description: Comprehensive PDF manipulation toolkit for extracting text and tables,
  creating new PDFs, merging/splitting documents, and handling forms. When Claude
  needs to fill in a PDF form or programmatically process, generate, or analyze PDF
  documents at scale.
license: Apache-2.0
metadata:
  trit: 0
  source: anthropics/skills
---

# PDF Processing Guide

## Quick Start

```python
from pypdf import PdfReader, PdfWriter

# Read a PDF
reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

# Extract text
text = ""
for page in reader.pages:
    text += page.extract_text()
```

## Python Libraries

### pypdf - Basic Operations

#### Merge PDFs
```python
from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)
```

#### Split PDF
```python
reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)
```

### pdfplumber - Text and Table Extraction

#### Extract Tables
```python
import pdfplumber
import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)
```

### reportlab - Create PDFs

```python
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.save()
```

## Command-Line Tools

```bash
# Extract text (poppler-utils)
pdftotext input.pdf output.txt

# Merge PDFs (qpdf)
qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

# Split pages
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
```

## Quick Reference

| Task | Best Tool | Command/Code |
|------|-----------|--------------|
| Merge PDFs | pypdf | `writer.add_page(page)` |
| Split PDFs | pypdf | One page per file |
| Extract text | pdfplumber | `page.extract_text()` |
| Extract tables | pdfplumber | `page.extract_tables()` |
| Create PDFs | reportlab | Canvas or Platypus |
| OCR scanned PDFs | pytesseract | Convert to image first |

Overview

This skill is a comprehensive PDF manipulation toolkit focused on extracting text and tables, merging and splitting documents, and creating simple PDFs programmatically. It leverages Python libraries and common command-line tools to handle digital and scanned PDFs. The goal is practical, scriptable workflows for data extraction and document assembly.

How this skill works

The skill uses pypdf for basic reading, merging, and splitting of PDF pages and reportlab for generating PDFs. For robust text and structured table extraction it integrates pdfplumber; for OCR on scanned pages it recommends converting pages to images and running pytesseract. It also documents equivalent command-line utilities (pdftotext, qpdf) for fast shell-based operations.

When to use it

  • Extract narrative text or full-page content from programmatic workflows
  • Extract tabular data embedded in PDF pages for analysis or CSV export
  • Merge multiple reports or split a large PDF into page-level files
  • Create simple programmatic PDFs (reports, labels, receipts) via code
  • Perform OCR on scanned pages when text extraction fails

Best practices

  • Prefer pdfplumber for table extraction; convert extracted tables to pandas DataFrame for cleaning
  • Use pypdf for page-level manipulation (merge/split) to preserve layout and annotations
  • When OCR is required, rasterize pages at sufficient DPI before pytesseract to improve accuracy
  • Validate extracted tables against headers and row counts; programmatically handle empty or merged cells
  • Keep a processing pipeline: load -> extract -> clean -> export (CSV/JSON/DF) to ensure reproducibility

Example use cases

  • Batch-extract tables from vendor invoices into a single CSV for accounting
  • Merge chapter PDFs into one deliverable and add bookmarks or metadata
  • Split a multi-page scanned manuscript into individual page files for archiving
  • Generate templated PDF reports with reportlab and attach extracted data tables
  • Run a shell pipeline to convert PDFs to text for downstream NLP or indexing

FAQ

Which tool is best for extracting tables reliably?

Use pdfplumber to detect and extract tables, then convert the output to pandas DataFrames for cleaning and export.

How do I handle scanned PDFs with no selectable text?

Rasterize pages to images, run pytesseract for OCR, then post-process the text; increasing DPI improves OCR quality.