home / skills / henkisdabro / wookstar-claude-plugins / pdf-processing-pro

This skill enables production-ready PDF processing with forms, OCR, and batch workflows, offering robust validation and error handling for high-volume

npx playbooks add skill henkisdabro/wookstar-claude-plugins --skill pdf-processing-pro

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
3.9 KB
---
name: PDF Processing Pro
description: Production-ready PDF processing with forms, tables, OCR, validation, and batch operations. Use when working with complex PDF workflows in production environments, processing large volumes of PDFs, or requiring robust error handling and validation. Do NOT use for simple text extraction - use pdf-extract for quick reads.
---

# PDF Processing Pro

Production-ready PDF processing toolkit with pre-built scripts, comprehensive error handling, and support for complex workflows.

## Quick start

### Extract text from PDF

```python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    text = pdf.pages[0].extract_text()
    print(text)
```

### Analyse PDF form (using included script)

```bash
python scripts/analyze_form.py input.pdf --output fields.json
# Returns: JSON with all form fields, types, and positions
```

### Fill PDF form with validation

```bash
python scripts/fill_form.py input.pdf data.json output.pdf
# Validates all fields before filling, includes error reporting
```

### Extract tables from PDF

```bash
python scripts/extract_tables.py report.pdf --output tables.csv
# Extracts all tables with automatic column detection
```

## Features

### Production-ready scripts

- Error handling with detailed messages and proper exit codes
- Input validation, type checking, and configurable logging
- Full type annotations and CLI interface (`--help` on all scripts)

### Comprehensive workflows

- PDF forms, table extraction, OCR processing
- Batch operations, pre/post-processing validation

## Advanced topics

### PDF form processing

Complete form workflows including field analysis, dynamic filling, validation rules, multi-page forms, and checkbox/radio handling. See [references/forms.md](references/forms.md).

### Table extraction

Complex table extraction including multi-page tables, merged cells, nested tables, custom detection, and CSV/Excel export. See [references/tables.md](references/tables.md).

### OCR processing

Scanned PDFs and image-based documents including Tesseract integration, language support, image preprocessing, and confidence scoring. See [references/ocr.md](references/ocr.md).

## Included scripts

| Script | Purpose | Usage |
|--------|---------|-------|
| analyze_form.py | Extract form field info | `python scripts/analyze_form.py input.pdf [--output fields.json] [--verbose]` |
| fill_form.py | Fill PDF forms with data | `python scripts/fill_form.py input.pdf data.json output.pdf [--validate]` |
| validate_form.py | Validate form data before filling | `python scripts/validate_form.py data.json schema.json` |
| extract_tables.py | Extract tables to CSV/Excel | `python scripts/extract_tables.py input.pdf [--output tables.csv] [--format csv\|excel]` |
| extract_text.py | Extract text with formatting | `python scripts/extract_text.py input.pdf [--output text.txt] [--preserve-formatting]` |
| merge_pdfs.py | Merge multiple PDFs | `python scripts/merge_pdfs.py file1.pdf file2.pdf --output merged.pdf` |
| split_pdf.py | Split PDF into pages | `python scripts/split_pdf.py input.pdf --output-dir pages/` |
| validate_pdf.py | Validate PDF integrity | `python scripts/validate_pdf.py input.pdf` |

## Dependencies

All scripts require:

```bash
pip install pdfplumber pypdf pillow pytesseract pandas
```

Optional for OCR:

```bash
# macOS: brew install tesseract
# Ubuntu: apt-get install tesseract-ocr
# Windows: Download from GitHub releases
```

## References

| File | Contents |
|------|----------|
| [references/forms.md](references/forms.md) | Complete form processing guide |
| [references/tables.md](references/tables.md) | Advanced table extraction |
| [references/ocr.md](references/ocr.md) | Scanned PDF processing |
| [references/workflows.md](references/workflows.md) | Common workflows, error handling, performance tips, best practices |
| [references/troubleshooting.md](references/troubleshooting.md) | Troubleshooting common issues and getting help |

Overview

This skill provides a production-ready PDF processing toolkit for complex workflows including forms, tables, OCR, validation, and batch operations. It bundles ready-to-run Python scripts with robust error handling, input validation, and CLI interfaces so you can integrate reliable PDF processing into production pipelines. Use this for high-volume or mission-critical jobs where accuracy, validation, and observability matter.

How this skill works

The skill inspects PDF content and metadata, analyzes and extracts structured elements (form fields, tables, text), and applies transformations like filling forms, OCR for scanned pages, and splitting/merging. Each script validates inputs, returns clear exit codes and error messages, and supports batching and pre/post-processing hooks for pipeline integration. Optional OCR uses Tesseract with preprocessing and confidence scoring to improve accuracy on image-based documents.

When to use it

  • Processing large volumes of PDFs in production where failures must be caught and reported.
  • Automating complex form workflows with validation, multi-page forms, and checkbox/radio handling.
  • Extracting complex or multi-page tables with merged cells and exporting to CSV/Excel.
  • Processing scanned or image PDFs that require OCR with language and preprocessing support.
  • Batch operations that need configurable logging, exit codes, and integration into CI/CD or ETL pipelines.

Best practices

  • Validate input PDFs and accompanying JSON schema before running fill or batch scripts to avoid partial failures.
  • Run OCR only when necessary; prefer text extraction for digital PDFs to save time and reduce noise.
  • Use the CLI verbose/logging flags in production to capture detailed diagnostics and set appropriate log rotation/retention.
  • Parallelize batch jobs carefully and monitor memory/CPU, especially during OCR and large table extraction.
  • Keep Tesseract language packs and system OCR dependencies updated on your servers for consistent results.

Example use cases

  • Audit pipeline that validates and fills thousands of form PDFs with pre-validated JSON payloads and detailed error reporting.
  • Monthly financial report ingestion that extracts multi-page tables to CSV/Excel for downstream analytics.
  • Digitizing archival scanned documents with image preprocessing and Tesseract OCR, saving text and confidence scores.
  • Document management tasks: split scans into pages, merge signed pages into final PDFs, and validate PDF integrity before archiving.

FAQ

Is this tool suitable for simple text extraction?

No. For quick reads or simple text extraction use a lightweight extractor. This skill is optimized for production scenarios and complex workflows.

Which system dependencies are required for OCR?

Tesseract is required for OCR. Install via package manager (brew/apt) or download for Windows, and ensure language packs are present for target languages.

Can I run these scripts in parallel for batches?

Yes, but monitor resource usage. OCR and large table extraction are CPU and memory intensive; use controlled concurrency and job queuing.