home / skills / letta-ai / skills / extracting-pdf-text

extracting-pdf-text skill

/tools/extracting-pdf-text

This skill extracts text from PDFs for LLM ingestion, supporting PyMuPDF, pdfplumber, OCR, and Mistral API for accurate RAG workflows.

npx playbooks add skill letta-ai/skills --skill extracting-pdf-text

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
3.0 KB
---
name: extracting-pdf-text
description: Extract text from PDFs for LLM consumption. Use when processing PDFs for RAG, document analysis, or text extraction. Supports API services (Mistral OCR) and local tools (PyMuPDF, pdfplumber). Handles text-based PDFs, tables, and scanned documents with OCR.
---

# Extracting PDF Text for LLMs

This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.

## Quick Decision Guide

| PDF Type | Best Approach | Script |
|----------|--------------|--------|
| Simple text PDF | PyMuPDF | `scripts/extract_pymupdf.py` |
| PDF with tables | pdfplumber | `scripts/extract_pdfplumber.py` |
| Scanned/image PDF (local) | pytesseract | `scripts/extract_with_ocr.py` |
| Complex layout, highest accuracy | Mistral OCR API | `scripts/extract_mistral_ocr.py` |
| End-to-end RAG pipeline | marker-pdf | `pip install marker-pdf` |

## Recommended Workflow

1. **Try PyMuPDF first** - fastest, handles most text-based PDFs well
2. **If tables are mangled** - switch to pdfplumber
3. **If scanned/image-based** - use Mistral OCR API (best accuracy) or local OCR (free but slower)

## Local Extraction (No API Required)

### PyMuPDF - Fast General Extraction

Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.

```bash
uv run scripts/extract_pymupdf.py input.pdf output.md
```

The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses `pymupdf4llm` which formats text for RAG systems.

### pdfplumber - Table Extraction

Best for: PDFs with tables, financial documents, structured data.

```bash
uv run scripts/extract_pdfplumber.py input.pdf output.md
```

Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.

### Local OCR - Scanned Documents

Best for: Scanned PDFs when API access is unavailable.

```bash
uv run scripts/extract_with_ocr.py input.pdf output.txt
```

Requires: `pytesseract`, `pdf2image`, and Tesseract installed (`brew install tesseract` on macOS).

## API-Based Extraction

### Mistral OCR API

Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.

**Pricing**: ~1000 pages per dollar (very cost-effective)

```bash
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md
```

Features:
- Outputs clean markdown
- Preserves document structure (headings, lists, tables)
- Handles images, math equations, multilingual text
- 95%+ accuracy on complex documents

For detailed API options and other services, see [references/api-services.md](references/api-services.md).

## Output Format Recommendations

For LLM consumption, markdown is preferred:
- Preserves semantic structure (headings become context boundaries)
- Tables remain readable
- Compatible with most RAG chunking strategies

For detailed comparisons of local tools, see [references/local-tools.md](references/local-tools.md).

Overview

This skill extracts text from PDFs and prepares it for large language model consumption. It supports fast local extractors (PyMuPDF, pdfplumber), local OCR for scanned files, and a higher-accuracy API option (Mistral OCR). Outputs are optimized for retrieval-augmented generation (RAG) workflows by emitting clean markdown that preserves headings, lists, and tables.

How this skill works

The skill inspects the PDF type and applies the best extractor: PyMuPDF for general text, pdfplumber for table-heavy pages, and Tesseract-based local OCR for scanned images. For complex layouts and highest accuracy it can call the Mistral OCR API to return structured markdown with preserved document structure and images. The final output is formatted to work well with chunking and embedding pipelines used by LLMs.

When to use it

  • Processing text-based PDFs for ingestion into RAG or document search systems
  • Extracting tables and structured data from financial or tabular PDFs
  • Handling scanned or image-based PDFs where local OCR or API OCR is needed
  • Needing clean markdown output that preserves headings and lists for LLM context windows
  • Choosing between local (free) and API (higher accuracy and layout preservation) extraction options

Best practices

  • Start with PyMuPDF for speed and broad coverage; switch to pdfplumber if tables are misaligned
  • Use local OCR (pytesseract + pdf2image) when APIs aren’t available, but expect slower processing and slightly lower accuracy
  • Use Mistral OCR API for complex layouts, multilingual text, math, or when top accuracy is required
  • Output markdown for LLMs to preserve semantic boundaries and improve chunking and retrieval
  • Validate a few pages manually to confirm extractor choice before batch-processing large archives

Example use cases

  • Building a RAG pipeline: extract PDF content to markdown, chunk and embed for retrieval
  • Financial data extraction: convert tables to markdown for downstream parsing and analysis
  • Digitizing scanned archives: OCR scanned PDFs and preserve headings for searchable corpora
  • Academic paper processing: extract sections and equations accurately using Mistral OCR for citation analysis
  • Compliance review: extract and index contractual text for quick lookup and automated QA

FAQ

Which extractor should I try first?

Try PyMuPDF first for most text PDFs; if tables or layout are poor, try pdfplumber or switch to OCR for scanned pages.

When should I use the Mistral OCR API?

Use Mistral OCR when documents have complex layouts, multilingual content, images/equations, or when you need the highest accuracy and structure preservation.