home / skills / jmagly / aiwg / pdf-extractor

pdf-extractor skill

safe

/agentic/code/addons/doc-intelligence/skills/pdf-extractor

This skill extracts structured content from PDF files, enabling searchable text, tables, and images for documentation and reports.

npx playbooks add skill jmagly/aiwg --skill pdf-extractor

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

6.3 KB

---
name: pdf-extractor
description: Extract text, tables, and images from PDF files. Use when converting PDF documentation, manuals, or reports to searchable text.
tools: Read, Write, Bash
---

# PDF Extractor Skill

## Purpose

Single responsibility: Extract structured content (text, tables, images) from PDF files into organized, searchable formats. (BP-4)

## Grounding Checkpoint (Archetype 1 Mitigation)

Before executing, VERIFY:

- [ ] PDF file exists and is readable (`file <path>` confirms PDF format)
- [ ] PDF is not corrupted (`pdfinfo <path>` returns metadata)
- [ ] Password known if encrypted
- [ ] Output directory is writable
- [ ] Required tools available (pdfplumber, pytesseract for OCR)

**DO NOT proceed without verification. Inspect PDF metadata first.**

## Uncertainty Escalation (Archetype 2 Mitigation)

ASK USER instead of guessing when:

- PDF appears to be scanned (needs OCR) but OCR tools unavailable
- Multiple table formats detected - unclear which parser to use
- Password-protected but no password provided
- Image extraction quality unclear (resolution, format preferences)
- Language detection needed for OCR

**NEVER assume PDF structure without inspection.**

## Context Scope (Archetype 3 Mitigation)

| Context Type | Included | Excluded |
|--------------|----------|----------|
| RELEVANT | Target PDF, extraction options, output path | Other PDF files |
| PERIPHERAL | Similar PDF structure examples | Unrelated documents |
| DISTRACTOR | Previous extraction attempts | Other file formats |

## Workflow Steps

### Step 1: Inspect PDF (Grounding)

```bash
# Check file type
file document.pdf

# Get PDF metadata
pdfinfo document.pdf

# Check page count
pdfinfo document.pdf | grep Pages

# Check if encrypted
pdfinfo document.pdf | grep Encrypted
```

### Step 2: Determine Extraction Strategy

| PDF Type | Detection | Strategy |
|----------|-----------|----------|
| Text-based | `pdftotext` produces readable text | Direct extraction |
| Scanned/Image | `pdftotext` produces empty/garbled | OCR required |
| Mixed | Some pages text, some images | Hybrid approach |
| Tables | Visual grid patterns | Table extraction mode |
| Forms | Interactive fields | Form field extraction |

### Step 3: Execute Extraction

**Option A: With skill-seekers (if installed)**

```bash
# Basic extraction
skill-seekers pdf --pdf document.pdf --name myskill

# With table extraction
skill-seekers pdf --pdf document.pdf --name myskill --extract-tables

# With OCR for scanned docs
skill-seekers pdf --pdf document.pdf --name myskill --ocr

# With parallel processing (large PDFs)
skill-seekers pdf --pdf document.pdf --name myskill --parallel --workers 8

# Password-protected
skill-seekers pdf --pdf document.pdf --name myskill --password "secret"
```

**Option B: Manual extraction guidance**

```bash
# Basic text extraction
pdftotext -layout document.pdf output.txt

# Extract with page markers
pdftotext -layout -eol unix document.pdf output.txt

# Extract images
pdfimages -all document.pdf images/

# OCR scanned PDF (requires tesseract)
pdftoppm document.pdf page -png
tesseract page-*.png output -l eng
```

### Step 4: Validate Output

```bash
# Check extraction quality
head -100 output/<skill-name>/references/content.md

# Verify table extraction
grep -A 10 "| " output/<skill-name>/references/*.md

# Check image extraction
ls -la output/<skill-name>/assets/images/
```

## Recovery Protocol (Archetype 4 Mitigation)

On error:

1. **PAUSE** - Stop extraction, preserve partial output
2. **DIAGNOSE** - Check error type:
   - `File not found` → Verify path
   - `Password required` → Ask user for password
   - `Corrupt PDF` → Try repair with `qpdf --check`
   - `OCR failed` → Check tesseract installation, language packs
   - `Memory error` → Process in chunks, reduce workers
3. **ADAPT** - Switch strategy based on diagnosis
4. **RETRY** - Resume with adapted approach (max 3 attempts)
5. **ESCALATE** - Ask user for guidance

## Checkpoint Support

State saved to: `.aiwg/working/checkpoints/pdf-extractor/`

For large PDFs, extraction saves progress per chunk:
```
checkpoints/pdf-extractor/
├── document_metadata.json
├── pages_1-50.json
├── pages_51-100.json
└── current_position.json
```

## Output Structure

```
output/<skill-name>/
├── SKILL.md              # Skill description with PDF summary
├── references/
│   ├── index.md          # Table of contents
│   ├── chapter_1.md      # Content by section
│   ├── chapter_2.md
│   └── tables.md         # Extracted tables
└── assets/
    └── images/           # Extracted images (if enabled)
        ├── page_1_fig_1.png
        └── page_5_chart_1.png
```

## Configuration Options

```json
{
  "name": "mymanual",
  "description": "Product manual documentation",
  "pdf_path": "docs/manual.pdf",
  "extract_options": {
    "chunk_size": 10,
    "min_quality": 6.0,
    "extract_images": true,
    "min_image_size": 150,
    "ocr_enabled": false,
    "ocr_language": "eng",
    "table_extraction": true
  },
  "categories": {
    "getting_started": ["introduction", "setup", "installation"],
    "usage": ["using", "operation", "guide"],
    "reference": ["appendix", "specifications", "api"]
  }
}
```

## Extraction Quality Metrics

| Metric | Good | Acceptable | Poor |
|--------|------|------------|------|
| Text extraction rate | >95% | 80-95% | <80% |
| Table accuracy | >90% | 70-90% | <70% |
| Image quality | >300 DPI | 150-300 DPI | <150 DPI |
| OCR confidence | >90% | 70-90% | <70% |

## Troubleshooting

| Issue | Diagnosis | Solution |
|-------|-----------|----------|
| Garbled text | Scanned PDF | Enable OCR mode |
| Missing tables | Complex layout | Use `--extract-tables` with pdfplumber |
| Poor OCR | Low resolution | Increase DPI, check language pack |
| Memory error | Large PDF | Use chunked extraction, reduce workers |
| Corrupt PDF | File damaged | Try `qpdf --check` or `mutool clean` |

## Dependencies

**Required:**
- Python 3.10+
- pdfplumber or pypdf

**Optional (for advanced features):**
- pytesseract + tesseract-ocr (for OCR)
- Pillow (for image processing)
- camelot-py (for complex tables)

## References

- Skill Seekers PDF Support: https://github.com/jmagly/Skill_Seekers/blob/main/docs/PDF_MCP_TOOL.md
- REF-001: Production-Grade Agentic Workflows (BP-1, BP-4)
- REF-002: LLM Failure Modes (Archetype 1-4 mitigations)

Overview

This skill extracts structured content—text, tables, and images—from PDF files and organizes output into searchable folders. It validates the PDF before running, chooses an appropriate extraction strategy (direct text, OCR, or hybrid), and saves progress with checkpoints for large documents. The goal is reliable, auditable conversion of manuals, reports, and documentation into reusable content.

How this skill works

The skill first inspects the PDF metadata, page count, and encryption status to decide whether direct extraction or OCR is required. It then runs targeted tools (pdftotext/pdfplumber for text and tables, pdfimages for images, and Tesseract for OCR when needed), processes the file in configurable chunks, and writes structured markdown and asset files. Errors pause the run, diagnostics are collected, and the skill either retries with adjusted settings or asks the user for guidance.

When to use it

Converting product manuals, user guides, or technical reports into searchable documentation.
Extracting tables from financial reports or data sheets for downstream analysis.
Digitizing scanned or image-based PDFs where OCR is required.
Preparing documentation for ingestion into knowledge bases or search indexes.
Processing large PDFs that benefit from chunked extraction and checkpoints.

Best practices

Always run the initial grounding checks: verify file exists, metadata, encryption, and tool availability before extraction.
Detect scanned vs text pages first; enable OCR only for pages that need it to save time and preserve quality.
Use chunked extraction and checkpointing for large PDFs to avoid memory errors and allow safe retries.
Provide language and image quality preferences (DPI) for OCR to improve accuracy.
Validate outputs by sampling page markdown, table files, and image assets before bulk processing.

Example use cases

Convert a 200-page product manual into a folder of chapter-wise markdown files and extracted figures for a help center.
Extract tables from a quarterly financial PDF to CSV-ready markdown for analysts.
Run hybrid extraction on mixed PDFs where some pages are scanned and others contain selectable text.
Batch-process scanned field reports with OCR enabled and save intermediate checkpoints for resume after failures.
Recover text from legacy PDFs that show as corrupted by running metadata checks and attempted repairs before extraction.

FAQ

What happens if the PDF is password-protected?

The skill halts and asks for the password instead of guessing; provide the password to proceed or supply an unlocked copy.

How do I handle very large PDFs that cause memory errors?

Enable chunked extraction, reduce worker count, and use checkpoints to resume from the last successful chunk.

When should I enable OCR?

Enable OCR when initial inspection shows pages are scanned or pdftotext returns empty/garbled text; specify language packs for best results.