home / skills / ntaksh42 / agents / pdf-processor

pdf-processor skill

/.claude/skills/pdf-processor

This skill helps you extract text and tables, generate PDFs from HTML or Markdown, and manage forms and metadata.

npx playbooks add skill ntaksh42/agents --skill pdf-processor

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.7 KB
---
name: pdf-processor
description: Process, extract, and generate PDF documents with text extraction and form handling. Use when working with PDF files or extracting PDF content.
---

# PDF Processor Skill

PDFファイルの作成、編集、解析を行うスキルです。

## 概要

PDFの読み取り、テキスト抽出、フォーム処理、新規PDF作成を支援します。

## 主な機能

- **テキスト抽出**: PDFからテキストとテーブルを抽出
- **PDF生成**: HTMLやMarkdownからPDF作成
- **フォーム処理**: PDFフォームの読み書き
- **分割・結合**: 複数PDFの操作
- **透かし追加**: セキュリティマーク
- **パスワード保護**: 暗号化PDF作成
- **メタデータ編集**: タイトル、作成者等

## 使用方法

### テキスト抽出

```python
# Python + PyPDF2
import PyPDF2

with open('document.pdf', 'rb') as file:
    reader = PyPDF2.PdfReader(file)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    print(text)
```

### PDF生成(HTMLから)

```python
# Python + pdfkit/weasyprint
import pdfkit

pdfkit.from_file('document.html', 'output.pdf')

# または
from weasyprint import HTML
HTML('document.html').write_pdf('output.pdf')
```

### PDF結合

```python
from PyPDF2 import PdfMerger

merger = PdfMerger()
merger.append('file1.pdf')
merger.append('file2.pdf')
merger.write('combined.pdf')
merger.close()
```

### フォーム処理

```python
from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader('form.pdf')
writer = PdfWriter()

# フォームフィールドに値を設定
writer.add_page(reader.pages[0])
writer.update_page_form_field_values(
    writer.pages[0],
    {'name': 'John Doe', 'email': '[email protected]'}
)

with open('filled_form.pdf', 'wb') as output:
    writer.write(output)
```

### JavaScript/Node.js

```javascript
// pdf-lib
const { PDFDocument } = require('pdf-lib');
const fs = require('fs');

async function createPdf() {
  const pdfDoc = await PDFDocument.create();
  const page = pdfDoc.addPage([600, 400]);

  page.drawText('Hello World!', {
    x: 50,
    y: 350,
    size: 30
  });

  const pdfBytes = await pdfDoc.save();
  fs.writeFileSync('output.pdf', pdfBytes);
}
```

## ライブラリ

### Python
- **PyPDF2**: PDF読み書き
- **pdfplumber**: テーブル抽出
- **ReportLab**: PDF生成
- **WeasyPrint**: HTML→PDF
- **pdfkit**: wkhtmltopdf wrapper

### JavaScript/Node.js
- **pdf-lib**: PDF作成・編集
- **pdfjs-dist**: PDF解析
- **puppeteer**: HTML→PDF
- **jsPDF**: ブラウザでPDF生成

### Go
- **gofpdf**: PDF生成
- **unidoc**: 商用ライブラリ

## バージョン情報

- スキルバージョン: 1.0.0
- 最終更新: 2025-01-22

Overview

This skill processes, extracts, and generates PDF documents to streamline PDF workflows. It supports text and table extraction, form reading and filling, PDF creation from HTML/Markdown, splitting/merging, watermarking, encryption, and metadata editing. The goal is to make common PDF tasks scriptable across Python, Node.js, and Go toolchains. It’s practical for automation, data extraction, and document production pipelines.

How this skill works

The skill exposes methods for reading PDFs, extracting page-level text and structured tables, and manipulating form fields. It can generate new PDFs from HTML/Markdown or programmatically draw content, then apply watermarks, passwords, and metadata. Typical implementations use libraries like PyPDF2/pdfplumber/WeasyPrint in Python, pdf-lib/puppeteer in Node.js, or gofpdf/unidoc in Go. Examples show basic read, write, merge, and form-fill flows to integrate into scripts or services.

When to use it

  • Automating extraction of text and tables from many PDF files
  • Filling or generating PDF forms for data collection or reporting
  • Converting HTML or Markdown documents into printable PDFs
  • Merging, splitting, watermarking, or encrypting PDFs for distribution
  • Editing PDF metadata and applying consistent document properties

Best practices

  • Prefer pdfplumber or table-specific parsers for reliable table extraction rather than raw text parsing
  • Use HTML-to-PDF tools (WeasyPrint, wkhtmltopdf/puppeteer) for complex layout fidelity
  • When filling forms, validate field names from the source PDF before applying values
  • Apply password protection and watermarks as final steps after all edits to avoid corrupting signatures
  • Process large batches with streaming readers and avoid loading entire PDFs into memory

Example use cases

  • Batch extract invoice line items from vendor PDFs into CSV using pdfplumber and a mapping script
  • Generate filled customer statements from HTML templates and send encrypted PDFs to recipients
  • Merge multiple report PDFs and add a draft watermark for internal review
  • Programmatically create certificates in Node.js with pdf-lib and export high-resolution PDFs
  • Automate form-filling workflows: read blank form fields, populate from a database, and save completed PDFs

FAQ

Which libraries are recommended for text vs. layout conversion?

Use pdfplumber or PyPDF2 for text and table extraction; use WeasyPrint, pdfkit (wkhtmltopdf), or puppeteer for converting HTML to visually accurate PDFs.

Can this handle scanned PDFs (images)?

Scanned PDFs need OCR before reliable text extraction; integrate an OCR engine (Tesseract or cloud OCR) to convert images to searchable text prior to parsing.