home / skills / julianobarbosa / claude-code-skills / markitdown-skill

markitdown-skill skill

safe

This skill helps you convert diverse documents to clean, LLM-ready Markdown, preserving structure for efficient analysis and processing.

npx playbooks add skill julianobarbosa/claude-code-skills --skill markitdown-skill

Review the files below or copy the command above to add this skill to your agents.

Files (8)

SKILL.md

7.1 KB

---
name: markitdown-skill
description: Guide for using Microsoft MarkItDown - a Python utility for converting files to Markdown. Use when converting PDF, Word, PowerPoint, Excel, images, audio, HTML, CSV, JSON, XML, ZIP, YouTube URLs, EPubs, Jupyter notebooks, RSS feeds, or Wikipedia pages to Markdown format. Also use for document processing pipelines, LLM preprocessing, or text extraction tasks.
---

# MarkItDown Skill

Microsoft's Python utility for converting various file formats to Markdown
for LLM and text analysis pipelines.

## Overview

MarkItDown converts documents while preserving structure (headings, lists,
tables, links). It's optimized for LLM consumption rather than
human-readable output.

### Supported Formats

| Category | Formats |
|----------|---------|
| Documents | PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX, XLS) |
| Media | Images (EXIF + OCR), Audio (WAV, MP3 transcription) |
| Web | HTML, YouTube URLs, Wikipedia, RSS/Atom feeds |
| Data | CSV, JSON, XML, Jupyter notebooks (.ipynb) |
| Archives | ZIP (iterates contents), EPub |
| Email | Outlook MSG files |

## Quick Start

### Installation

```bash
# Full installation (recommended)
pip install 'markitdown[all]'

# Minimal with specific formats
pip install 'markitdown[pdf,docx,pptx]'

# Using uv
uv pip install 'markitdown[all]'
```

#### Optional Dependencies

| Extra | Description |
|-------|-------------|
| `[all]` | All optional dependencies |
| `[pdf]` | PDF file support |
| `[docx]` | Word documents |
| `[pptx]` | PowerPoint presentations |
| `[xlsx]` | Excel spreadsheets |
| `[xls]` | Legacy Excel files |
| `[outlook]` | Outlook MSG files |
| `[az-doc-intel]` | Azure Document Intelligence |
| `[audio-transcription]` | WAV/MP3 transcription |
| `[youtube-transcription]` | YouTube video transcripts |

### Command-Line Usage

```bash
# Basic conversion
markitdown document.pdf > output.md

# Specify output file
markitdown document.pdf -o output.md

# Pipe input
cat document.pdf | markitdown > output.md

# With Azure Document Intelligence
markitdown document.pdf -o output.md -d -e "<endpoint>"
```

### Python API

```python
from markitdown import MarkItDown

# Basic conversion
md = MarkItDown()
result = md.convert("document.xlsx")
print(result.text_content)

# With LLM for image descriptions
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image in detail"
)
result = md.convert("image.jpg")
print(result.text_content)

# With Azure Document Intelligence
md = MarkItDown(docintel_endpoint="<your-endpoint>")
result = md.convert("complex-document.pdf")
print(result.text_content)
```

## Common Use Cases

### Batch Convert Directory

```python
from markitdown import MarkItDown
from pathlib import Path

md = MarkItDown()
input_dir = Path("./documents")
output_dir = Path("./markdown")
output_dir.mkdir(exist_ok=True)

for file in input_dir.glob("*"):
    if file.is_file():
        try:
            result = md.convert(str(file))
            output_file = output_dir / f"{file.stem}.md"
            output_file.write_text(result.text_content)
            print(f"Converted: {file.name}")
        except Exception as e:
            print(f"Failed: {file.name} - {e}")
```

### Process for LLM Context

```python
from markitdown import MarkItDown

def prepare_for_llm(file_path: str) -> str:
    """Convert document to LLM-ready markdown."""
    md = MarkItDown()
    result = md.convert(file_path)

    # Add source reference
    content = f"# Source: {file_path}\n\n{result.text_content}"
    return content

# Use with your LLM
context = prepare_for_llm("report.pdf")
```

### Extract YouTube Transcript

```bash
# CLI
markitdown "https://www.youtube.com/watch?v=VIDEO_ID" > transcript.md
```

```python
# Python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://www.youtube.com/watch?v=VIDEO_ID")
print(result.text_content)
```

### Image OCR with AI Description

```python
from markitdown import MarkItDown
from openai import OpenAI

# Initialize with LLM support
client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o"
)

# Convert image with AI description
result = md.convert("screenshot.png")
print(result.text_content)
```

### Convert Jupyter Notebook

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("analysis.ipynb")
print(result.text_content)  # Code cells, outputs, markdown
```

### Extract Wikipedia Content

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://en.wikipedia.org/wiki/Python")
print(result.text_content)  # Main article content only
```

### Parse RSS Feed

```python
from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("https://example.com/feed.xml")
print(result.text_content)  # Feed entries as markdown
```

## Plugin System

MarkItDown supports third-party plugins for extended functionality.

```bash
# List installed plugins
markitdown --list-plugins

# Enable plugins during conversion
markitdown --use-plugins document.pdf
```

```python
# Enable plugins in Python
md = MarkItDown(enable_plugins=True)
result = md.convert("document.pdf")
```

> Search GitHub for `#markitdown-plugin` to find available plugins.

## MCP Server Integration

MarkItDown offers an MCP (Model Context Protocol) server for integration
with LLM applications like Claude Desktop.

```bash
# Install MCP server
pip install markitdown-mcp

# Or from source
git clone https://github.com/microsoft/markitdown.git
cd markitdown/packages/markitdown-mcp
pip install -e .
```

See [markitdown-mcp][mcp-repo] for configuration details.

[mcp-repo]: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp

## Docker Usage

```bash
# Build image
docker build -t markitdown:latest .

# Convert file
docker run --rm -i markitdown:latest < document.pdf > output.md
```

## Troubleshooting

| Issue | Solution |
|-------|----------|
| Missing dependencies | Install with `pip install 'markitdown[all]'` |
| PDF extraction fails | Try Azure Document Intelligence for complex PDFs |
| Image text not extracted | Ensure OCR dependencies installed or use LLM mode |
| Large file timeout | Process in chunks or use streaming |
| Plugin not found | Run `markitdown --list-plugins` to verify installation |

### Common Errors

```bash
# ModuleNotFoundError for specific format
pip install 'markitdown[pdf]'  # Install missing dependency

# Azure authentication
export AZURE_DOCUMENT_INTELLIGENCE_ENDPOINT="<endpoint>"
export AZURE_DOCUMENT_INTELLIGENCE_KEY="<key>"
```

## Requirements

- Python >= 3.10
- Virtual environment recommended

```bash
# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows

# Install
pip install 'markitdown[all]'
```

## References

- `references/cli-reference.md` - Complete CLI options
- `references/api-reference.md` - Python API details
- `references/examples.md` - Extended examples
- `references/advanced-features.md` - Custom converters, URI handling
- GitHub: <https://github.com/microsoft/markitdown>
- PyPI: <https://pypi.org/project/markitdown/>

Overview

This skill guides using Microsoft MarkItDown, a Python utility that converts many file types into Markdown optimized for LLM consumption and text-analysis pipelines. It preserves document structure (headings, lists, tables, links) and supports optional AI-assisted transcription and image description. Use it to prepare diverse sources for indexing, prompting, or downstream NLP tasks.

How this skill works

MarkItDown inspects input files or URLs, applies format-specific parsers and optional OCR/transcription, and emits structured Markdown suitable for LLMs. It can call external services (Azure Document Intelligence or an LLM client) for complex PDFs, image descriptions, or audio transcription. The tool exposes both a CLI and a Python API for single-file conversion, batch workflows, and plugin extensions.

When to use it

Converting PDFs, Word, PowerPoint, Excel, or email files into LLM-ready Markdown
Extracting text from images (OCR) or getting AI-generated image descriptions
Transcribing audio files or extracting YouTube transcripts
Batch-processing directories or archives (ZIP, EPUB, Jupyter notebooks)
Preparing web content (HTML, RSS, Wikipedia) or structured data (CSV, JSON, XML) for indexing

Best practices

Install optional extras for the formats you need (e.g., [pdf], [docx], [audio-transcription]) to avoid missing-dependency errors
Use the Python API for programmatic pipelines and add a source header to track provenance
For complex PDFs, enable Azure Document Intelligence or chunk large files to avoid timeouts
Enable an LLM client only when you need descriptions or semantic enrichment to control cost
Validate converted output by spot-checking headings, tables, and extracted text before indexing

Example use cases

Batch convert a folder of reports to Markdown for ingestion into a vector database
Extract a YouTube video transcript via the CLI to create searchable meeting notes
Convert scanned receipts with OCR and optionally enrich descriptions with an LLM
Turn Jupyter notebooks into Markdown that retains code cells and outputs for documentation
Parse an RSS feed or Wikipedia page into Markdown snippets for prompt context

FAQ

How do I add support for a specific file format?

Install the corresponding extra, e.g. pip install 'markitdown[pdf]' or 'markitdown[all]' to get everything.

Can MarkItDown use an LLM or Azure for better extraction?

Yes. Pass an LLM client and model for descriptions or use Azure Document Intelligence by providing the endpoint and key.