home / skills / openclaw / skills / pdf-ocr

pdf-ocr skill

safe

/skills/dadaniya99/pdf-ocr

This skill extracts text from scanned PDFs using OCR to make documents searchable and editable.

npx playbooks add skill openclaw/skills --skill pdf-ocr

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

2.1 KB

---
name: pdf-ocr
description: PDF扫描件转Word文档。支持中文OCR识别，自动裁掉页眉页脚，保留插图，彩色章节封面页保留为图片。使用百度OCR API（免费额度1000次/月）。当用户要求把扫描PDF转成文字/Word时触发。
---

# PDF扫描件 OCR 转换技能 📄

## 配置
- **百度 OCR API Key**: vOBOM7tO0lL8cKMJdZy453Ai
- **百度 OCR Secret Key**: bib8MvDPTfXXdPz4JyzIyDCvCeKxtpyu
- **免费额度**: **1000次/月**（1次=1页），592页以内一次免费跑完
- **接口**: 通用文字识别（高精度版）`accurate_basic`

## 依赖安装
```bash
pip install pymupdf python-docx pillow
```

## 使用方法

```bash
python3 {baseDir}/scripts/pdf_to_docx.py <PDF路径> [输出目录]
```

输出文件在 `[输出目录]/xxx_全文_ocr.docx`，文件较大时用脚本压缩图片：

```bash
python3 {baseDir}/scripts/compress_docx.py <docx路径> <输出路径>
```

## 处理策略

| 页面类型 | 判断方式 | 处理方式 |
|---------|---------|---------|
| 正文页 | 默认 | 裁掉顶部6%（页眉）+底部4%（页脚），OCR识别文字 |
| 插图页 | OCR无文字输出 | 保留为图片嵌入Word |
| 彩色封面/章节页 | 彩色像素占比>25% | 保留为图片，加灰色标注 |

## 已知限制
- **图文混排页**（图表里有文字）：OCR会把图表内文字识别为正文，需人工替换
  - 解决：用户找到问题页，告知PDF页码，截图后手动替换
- **白底目录页**：不会被自动识别为特殊页，会被OCR识别（效果一般）
  - 解决：转换后人工替换目录页为图片

## 实战案例（《预测之书》592页）
- 处理时间：约20分钟（含0.6s/页间隔）
- 输出原始大小：303MB（嵌入144张图片）
- 压缩后大小：3.4MB（图片降分辨率至600px宽，质量60%）
- 识别效果：正文准确率高，图表页需人工处理
- 每50页自动保存一次进度，防止中途崩溃

## 注意事项
- 免费版 QPS=2，脚本已加0.6秒/页间隔
- 裁剪比例（页眉6%/页脚4%）可在脚本顶部调整
- OCR完成后建议抽查几页校对准确率
- 原始高清版保留在服务器，压缩版用于分发

Overview

This skill extracts text from scanned documents and image-based PDFs using optical character recognition. It converts image PDFs into searchable files, produces plain text or structured outputs, and supports batch processing. It handles printed and typed text well and offers limited handwritten recognition.

How this skill works

The skill runs OCR on each page, detects language, and returns a text layer or a searchable PDF. It can preserve layout, extract tables and form fields, and report per-page confidence metrics. Pre-processing steps like deskewing, contrast adjustment, and noise reduction improve results.

When to use it

Digitize paper archives to searchable text or PDFs
Extract form fields, tables, or plain text from scanned documents
Batch-process many image-based PDFs for indexing or backup
Prepare documents for full-text search or data extraction
Convert receipts, invoices, or printed reports into editable text

Best practices

Scan at 300 DPI minimum (600 DPI for small text) and use high contrast
Deskew and crop pages; remove noise and shadows before OCR
Specify language or allow auto-detection for mixed-language documents
Validate low-confidence pages manually, especially handwriting and cursive
Choose structured extraction for tables/forms and plain text for general content

Example use cases

Create searchable PDFs from legacy scanned books for a digital library
Extract names, dates, and fields from filled paper forms for RPA workflows
Batch-OCR invoices and receipts to feed accounting systems
Convert meeting notes or typed reports into editable documents
Extract table data from printed reports for spreadsheet import

FAQ

How accurate is the OCR?

Accuracy depends on document quality and type: typed documents often exceed 95%; printed books and forms are usually 80–95%; handwriting and decorative fonts are much lower and may need review.

Which output formats are available?

You can get plain text, structured extraction (fields, sections, tables), or a searchable PDF with a text layer and a processing summary.