home / skills / existential-birds / beagle / docling
/skills/docling
npx playbooks add skill existential-birds/beagle --skill doclingReview the files below or copy the command above to add this skill to your agents.
---
name: docling
description: Docling document parser for PDF, DOCX, PPTX, HTML, images, and 15+ formats. Use when parsing documents, extracting text, converting to Markdown/HTML/JSON, chunking for RAG pipelines, or batch processing files. Triggers on DocumentConverter, convert, convert_all, export_to_markdown, HierarchicalChunker, HybridChunker, ConversionResult.
---
# Docling Document Parser
Docling is a document parsing library that converts PDFs, Word documents, PowerPoint, images, and other formats into structured data with advanced layout understanding.
## Quick Start
Basic document conversion:
```python
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # URL, Path, or BytesIO
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
```
## Core Concepts
### DocumentConverter
The main entry point for document conversion. Supports various input formats and conversion options.
```python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
# Basic converter (all formats enabled)
converter = DocumentConverter()
# Restricted formats
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX]
)
# Custom pipeline options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
```
### ConversionResult
All conversion operations return a `ConversionResult` containing:
- `document`: The parsed `DoclingDocument`
- `status`: `ConversionStatus.SUCCESS`, `PARTIAL_SUCCESS`, or `FAILURE`
- `errors`: List of errors encountered during conversion
- `input`: Information about the source document
```python
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
data = result.document.export_to_dict()
```
## Supported Formats
### Input Formats
- **Documents**: PDF, DOCX, PPTX, XLSX
- **Markup**: HTML, Markdown, AsciiDoc
- **Data**: CSV, JSON (Docling format)
- **Images**: PNG, JPEG, TIFF, BMP, WEBP
- **Audio**: WAV, MP3
- **Video Text**: WebVTT
- **Schema-specific**: USPTO XML, JATS XML, METS-GBS
### Output Formats
- **Markdown**: `export_to_markdown()` or `save_as_markdown()`
- **HTML**: `export_to_html()` or `save_as_html()`
- **JSON**: `export_to_dict()` or `save_as_json()` (note: no `export_to_json()` method)
- **Text**: `export_to_text()` or `export_to_markdown(strict_text=True)` or `save_as_markdown(strict_text=True)`
- **DocTags**: `export_to_doctags()` or `save_as_doctags()`
## Common Patterns
### Single File Conversion
```python
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
# Export to different formats
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
json_data = result.document.export_to_dict()
# Or save directly to file
result.document.save_as_markdown("output.md")
result.document.save_as_html("output.html")
result.document.save_as_json("output.json")
```
### Batch Processing
See [references/batch.md](references/batch.md) for details on `convert_all()`.
### URL Conversion
```python
converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")
```
### Binary Stream Conversion
```python
from io import BytesIO
from docling.datamodel.base_models import DocumentStream
with open("document.pdf", "rb") as f:
buf = BytesIO(f.read())
source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)
```
### Format-Specific Options
```python
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
# Configure PDF-specific options
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "es"]
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
```
### Resource Limits
```python
converter = DocumentConverter()
# Limit file size (bytes) and page count
result = converter.convert(
"large_document.pdf",
max_file_size=20_971_520, # 20 MB
max_num_pages=100
)
```
### Document Chunking
See [references/chunking.md](references/chunking.md) for RAG integration.
## DoclingDocument Structure
The `DoclingDocument` is a Pydantic model representing parsed content:
```python
# Access document structure
doc = result.document
# Content items (lists)
doc.texts # TextItem instances (paragraphs, headings, etc.)
doc.tables # TableItem instances
doc.pictures # PictureItem instances
doc.key_value_items # Key-value pairs
# Structure (tree nodes)
doc.body # Main content hierarchy
doc.furniture # Headers, footers, page numbers
doc.groups # Lists, chapters, sections
# Iterate all elements in reading order
for item, level in doc.iterate_items():
print(f"{' ' * level}{item.label}: {item.text[:50]}")
```
## Advanced Features
### OCR Configuration
```python
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
EasyOcrOptions,
TesseractOcrOptions,
TesseractCliOcrOptions,
OcrMacOptions,
RapidOcrOptions
)
# EasyOCR (default)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])
# Tesseract
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])
# RapidOCR
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions()
```
### Table Extraction Options
```python
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
TableFormerMode
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_table_structure = True
# Use cell matching (map to PDF cells)
pipeline_options.table_structure_options.do_cell_matching = True
# Or use predicted cells
pipeline_options.table_structure_options.do_cell_matching = False
# Choose accuracy mode
pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
```
### Page Images
```python
pipeline_options = PdfPipelineOptions()
pipeline_options.generate_page_images = True # Needed for HTML export with images
# Export with embedded images
result.document.save_as_html(
"output.html",
image_mode=ImageRefMode.EMBEDDED
)
```
## Error Handling
```python
from docling.datamodel.base_models import ConversionStatus
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
print("Conversion successful")
elif result.status == ConversionStatus.PARTIAL_SUCCESS:
print("Partial conversion:")
for error in result.errors:
print(f" {error.error_message}")
else: # FAILURE
print("Conversion failed:")
for error in result.errors:
print(f" {error.error_message}")
```
For batch processing with error handling:
```python
# Continue processing on errors
results = converter.convert_all(
["doc1.pdf", "doc2.pdf", "doc3.pdf"],
raises_on_error=False
)
for result in results:
if result.status == ConversionStatus.SUCCESS:
result.document.save_as_markdown(f"{result.input.file.stem}.md")
else:
print(f"Failed: {result.input.file}")
```
## CLI Usage
```bash
# Basic conversion
docling document.pdf
# Convert to specific output
docling --to markdown document.pdf
# With custom model path
docling --artifacts-path /path/to/models document.pdf
# Using VLM pipeline
docling --pipeline vlm --vlm-model granite_docling document.pdf
```
## Reference Documentation
- [Parsing Options](references/parsing.md) - DocumentConverter initialization, format-specific options, OCR configuration
- [Batch Processing](references/batch.md) - convert_all(), error handling, concurrency patterns
- [Chunking](references/chunking.md) - HierarchicalChunker, HybridChunker, RAG integration
- [Output Formats](references/output.md) - export_to_markdown(), export_to_html(), export_to_dict(), document structure
## Key Types
- `DocumentConverter`: Main conversion class
- `ConversionResult`: Result of conversion with document and status
- `DoclingDocument`: Unified document representation (Pydantic model)
- `InputFormat`: Enum of supported input formats
- `ConversionStatus`: SUCCESS, PARTIAL_SUCCESS, FAILURE
- `PdfPipelineOptions`: Configuration for PDF pipeline
- `ImageRefMode`: EMBEDDED, REFERENCED, PLACEHOLDER
## Integration Examples
### LangChain
```python
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownTextSplitter
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_text(markdown)
```
### LlamaIndex
```python
from docling.document_converter import DocumentConverter
from docling.chunking import HybridChunker
from llama_index.core import Document
converter = DocumentConverter()
result = converter.convert("document.pdf")
chunker = HybridChunker()
chunks = list(chunker.chunk(result.document))
documents = [
Document(text=chunk.text, metadata=chunk.meta.export_json_dict())
for chunk in chunks
]
```
## Notes
- Docling uses a synchronous API (no native async support)
- Models are downloaded automatically on first use (can be prefetched)
- Supports local execution for air-gapped environments
- Supports GPU acceleration for OCR and table detection
- Default models run on CPU; GPU requires configuration