home / skills / benchflow-ai / skillsbench / marker

This skill converts PDFs to Markdown while preserving LaTeX equations and document structure using marker_single.

npx playbooks add skill benchflow-ai/skillsbench --skill marker

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
2.0 KB
---
name: marker
description: Convert PDF documents to Markdown using marker_single. Use when Claude needs to extract text content from PDFs while preserving LaTeX formulas, equations, and document structure. Ideal for academic papers and technical documents containing mathematical notation.
---

# Marker PDF-to-Markdown Converter

Convert PDFs to Markdown while preserving LaTeX formulas and document structure. Uses the `marker_single` CLI from the marker-pdf package.

## Dependencies
- `marker_single` on PATH (`pip install marker-pdf` if missing)
- Python 3.10+ (available in the task image)

## Quick Start

```python
from scripts.marker_to_markdown import pdf_to_markdown

markdown_text = pdf_to_markdown("paper.pdf")
print(markdown_text)
```

## Python API

- `pdf_to_markdown(pdf_path, *, timeout=600, cleanup=True) -> str`
  - Runs `marker_single --output_format markdown --disable_image_extraction`
  - `cleanup=True`: use a temp directory and delete after reading the Markdown
  - `cleanup=False`: keep outputs in `<pdf_stem>_marker/` next to the PDF
  - Exceptions: `FileNotFoundError` if the PDF is missing, `RuntimeError` for marker failures, `TimeoutError` if it exceeds the timeout
- Tips: bump `timeout` for large PDFs; set `cleanup=False` to inspect intermediate files

## Command-Line Usage

```bash
# Basic conversion (prints markdown to stdout)
python scripts/marker_to_markdown.py paper.pdf

# Keep temporary files
python scripts/marker_to_markdown.py paper.pdf --keep-temp

# Custom timeout
python scripts/marker_to_markdown.py paper.pdf --timeout 600
```

## Output Locations
- `cleanup=True`: outputs stored in a temporary directory and removed automatically
- `cleanup=False`: outputs saved to `<pdf_stem>_marker/`; markdown lives at `<pdf_stem>_marker/<pdf_stem>/<pdf_stem>.md` when present (otherwise the first `.md` file is used)

## Troubleshooting
- `marker_single` not found: install `marker-pdf` or ensure the CLI is on PATH
- No Markdown output: re-run with `--keep-temp`/`cleanup=False` and check `stdout`/`stderr` saved in the output folder

Overview

This skill converts PDF documents to Markdown while preserving LaTeX formulas, equations, and document structure. It wraps the marker_single CLI from the marker-pdf package to extract text content and math for academic and technical PDFs. The skill returns a Markdown string and can keep or clean up intermediate files.

How this skill works

The skill runs marker_single with the markdown output format and disables image extraction by default. It creates a temporary output folder unless cleanup is disabled, reads the generated .md file, and returns the Markdown text. It surfaces common errors: FileNotFoundError for missing PDFs, RuntimeError for marker failures, and TimeoutError when conversion exceeds the configured timeout.

When to use it

  • Converting academic papers or preprints to editable Markdown while keeping LaTeX math intact
  • Extracting structured text from technical reports that include equations and numbered sections
  • Preparing content for static site generators, note-taking apps, or further NLP processing
  • Batch-processing PDFs where you want programmatic control over timeout and output retention
  • Debugging conversion issues by preserving intermediate outputs

Best practices

  • Install marker-pdf and ensure marker_single is on PATH before running the skill
  • Increase the timeout for large or image-heavy PDFs to avoid premature TimeoutError
  • Use cleanup=True for one-off conversions and cleanup=False to inspect intermediate files
  • If no Markdown appears, re-run with cleanup=False and check the saved stdout/stderr in the output folder
  • Keep source PDFs organized so resulting <pdf_stem>_marker folders are easy to locate

Example use cases

  • Convert a LaTeX-heavy research paper to Markdown for inclusion in a knowledge base
  • Extract equations and section headings from lecture notes for study cards or summaries
  • Automate conversion of technical manuals to Markdown before running search/indexing jobs
  • Debug corrupted or partially converted PDFs by inspecting marker_single logs in the temp folder
  • Preprocess PDFs before applying downstream NLP models that expect plain text with math blocks

FAQ

What if marker_single is not installed?

Install marker-pdf (pip install marker-pdf) or ensure the marker_single CLI is available on PATH.

How do I get intermediate files for debugging?

Set cleanup=False (or --keep-temp) to keep the <pdf_stem>_marker/ folder. Check stdout/stderr and any generated .md files there.

Can images be extracted?

By default image extraction is disabled to focus on Markdown and math. Modify the CLI invocation if you need images.