home / skills / microck / ordinary-claude-skills / markitdown

markitdown skill

safe

This skill converts diverse file formats to Markdown optimized for LLMs, preserving structure and enabling efficient analysis.

npx playbooks add skill microck/ordinary-claude-skills --skill markitdown

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

6.7 KB

---
name: markitdown
description: Convert various file formats (PDF, Office documents, images, audio, web content, structured data) to Markdown optimized for LLM processing. Use when converting documents to markdown, extracting text from PDFs/Office files, transcribing audio, performing OCR on images, extracting YouTube transcripts, or processing batches of files. Supports 20+ formats including DOCX, XLSX, PPTX, PDF, HTML, EPUB, CSV, JSON, images with OCR, and audio with transcription.
---

# MarkItDown

## Overview

MarkItDown is a Python utility that converts various file formats into Markdown format, optimized for use with large language models and text analysis pipelines. It preserves document structure (headings, lists, tables, hyperlinks) while producing clean, token-efficient Markdown output.

## When to Use This Skill

Use this skill when users request:
- Converting documents to Markdown format
- Extracting text from PDF, Word, PowerPoint, or Excel files
- Performing OCR on images to extract text
- Transcribing audio files to text
- Extracting YouTube video transcripts
- Processing HTML, EPUB, or web content to Markdown
- Converting structured data (CSV, JSON, XML) to readable Markdown
- Batch converting multiple files or ZIP archives
- Preparing documents for LLM analysis or RAG systems

## Core Capabilities

### 1. Document Conversion

Convert Office documents and PDFs to Markdown while preserving structure.

**Supported formats:**
- PDF files (with optional Azure Document Intelligence integration)
- Word documents (DOCX)
- PowerPoint presentations (PPTX)
- Excel spreadsheets (XLSX, XLS)

**Basic usage:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("document.pdf")
print(result.text_content)
```

**Command-line:**
```bash
markitdown document.pdf -o output.md
```

See `references/document_conversion.md` for detailed documentation on document-specific features.

### 2. Media Processing

Extract text from images using OCR and transcribe audio files to text.

**Supported formats:**
- Images (JPEG, PNG, GIF, etc.) with EXIF metadata extraction
- Audio files with speech transcription (requires speech_recognition)

**Image with OCR:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("image.jpg")
print(result.text_content)  # Includes EXIF metadata and OCR text
```

**Audio transcription:**
```python
result = md.convert("audio.wav")
print(result.text_content)  # Transcribed speech
```

See `references/media_processing.md` for advanced media handling options.

### 3. Web Content Extraction

Convert web-based content and e-books to Markdown.

**Supported formats:**
- HTML files and web pages
- YouTube video transcripts (via URL)
- EPUB books
- RSS feeds

**YouTube transcript:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
```

See `references/web_content.md` for web extraction details.

### 4. Structured Data Handling

Convert structured data formats to readable Markdown tables.

**Supported formats:**
- CSV files
- JSON files
- XML files

**CSV to Markdown table:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("data.csv")
print(result.text_content)  # Formatted as Markdown table
```

See `references/structured_data.md` for format-specific options.

### 5. Advanced Integrations

Enhance conversion quality with AI-powered features.

**Azure Document Intelligence:**
For enhanced PDF processing with better table extraction and layout analysis:
```python
from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="<endpoint>", docintel_key="<key>")
result = md.convert("complex.pdf")
```

**LLM-Powered Image Descriptions:**
Generate detailed image descriptions using GPT-4o:
```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
result = md.convert("presentation.pptx")  # Images described with LLM
```

See `references/advanced_integrations.md` for integration details.

### 6. Batch Processing

Process multiple files or entire ZIP archives at once.

**ZIP file processing:**
```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("archive.zip")
print(result.text_content)  # All files converted and concatenated
```

**Batch script:**
Use the provided batch processing script for directory conversion:
```bash
python scripts/batch_convert.py /path/to/documents /path/to/output
```

See `scripts/batch_convert.py` for implementation details.

## Installation

**Full installation (all features):**
```bash
uv pip install 'markitdown[all]'
```

**Modular installation (specific features):**
```bash
uv pip install 'markitdown[pdf]'           # PDF support
uv pip install 'markitdown[docx]'          # Word support
uv pip install 'markitdown[pptx]'          # PowerPoint support
uv pip install 'markitdown[xlsx]'          # Excel support
uv pip install 'markitdown[audio]'         # Audio transcription
uv pip install 'markitdown[youtube]'       # YouTube transcripts
```

**Requirements:**
- Python 3.10 or higher

## Output Format

MarkItDown produces clean, token-efficient Markdown optimized for LLM consumption:
- Preserves headings, lists, and tables
- Maintains hyperlinks and formatting
- Includes metadata where relevant (EXIF, document properties)
- No temporary files created (streaming approach)

## Common Workflows

**Preparing documents for RAG:**
```python
from markitdown import MarkItDown

md = MarkItDown()

# Convert knowledge base documents
docs = ["manual.pdf", "guide.docx", "faq.html"]
markdown_content = []

for doc in docs:
    result = md.convert(doc)
    markdown_content.append(result.text_content)

# Now ready for embedding and indexing
```

**Document analysis pipeline:**
```bash
# Convert all PDFs in directory
for file in documents/*.pdf; do
    markitdown "$file" -o "markdown/$(basename "$file" .pdf).md"
done
```

## Plugin System

MarkItDown supports extensible plugins for custom conversion logic. Plugins are disabled by default for security:

```python
from markitdown import MarkItDown

# Enable plugins if needed
md = MarkItDown(enable_plugins=True)
```

## Resources

This skill includes comprehensive reference documentation for each capability:

- **references/document_conversion.md** - Detailed PDF, DOCX, PPTX, XLSX conversion options
- **references/media_processing.md** - Image OCR and audio transcription details
- **references/web_content.md** - HTML, YouTube, and EPUB extraction
- **references/structured_data.md** - CSV, JSON, XML conversion formats
- **references/advanced_integrations.md** - Azure Document Intelligence and LLM integration
- **scripts/batch_convert.py** - Batch processing utility for directories

Overview

This skill converts a wide range of file types into clean, token-efficient Markdown optimized for LLM workflows. It preserves document structure (headings, lists, tables, links) and includes useful metadata like EXIF and document properties. Use it to prepare content for indexing, retrieval-augmented generation, or human review.

How this skill works

The converter ingests files (PDF, Office formats, HTML, EPUB, images, audio, CSV/JSON/XML, ZIP archives) and extracts text, layout, and metadata. It performs OCR on images, transcribes audio, pulls YouTube transcripts and transforms structured data into readable Markdown tables. Optionally integrates with document intelligence services and LLMs for improved table/layout extraction and image descriptions.

When to use it

Preparing documents for RAG or embedding pipelines
Extracting text from PDFs, DOCX, PPTX, XLSX for analysis
Transcribing audio recordings or extracting OCR text from images
Converting web pages, EPUBs, or YouTube videos to markdown
Processing batches of files or ZIP archives for large-scale conversion

Best practices

Strip or redact sensitive data before conversion when needed
Choose modular installs to reduce dependencies (install only needed feature extras)
Use batch mode for many files to maintain consistent formatting and metadata capture
Enable advanced integrations (Azure Document Intelligence or an LLM) only when higher fidelity table/layout or image descriptions are required
Validate converted tables and spreadsheets for schema or numeric accuracy before downstream use

Example use cases

Convert a product manual PDF and a set of XLSX specs into Markdown for indexing into a vector DB
OCR scanned receipts and extract structured expense tables for accounting review
Transcribe meeting audio into Markdown with speaker segments for summarization
Extract HTML guides and YouTube transcripts to assemble a searchable knowledge base
Batch-convert a ZIP archive of legacy docs into a standardized markdown corpus for migration

FAQ

Which file formats are supported?

Supported formats include PDF, DOCX, PPTX, XLSX/XLS, HTML, EPUB, CSV, JSON, XML, common image formats (JPEG, PNG, GIF) with OCR, and audio for transcription. ZIP archives and YouTube URLs are also supported.

Can I preserve complex tables and layout?

Yes. The tool preserves headings, lists, tables and links. For complex layouts or better table extraction you can enable Azure Document Intelligence or LLM-assisted processing to improve fidelity.

Is batch processing available?

Yes. There is a batch processing script and ZIP handling to convert multiple files consistently and concatenate outputs if desired.