home / skills / dkyazzentwatwa / chatgpt-skills / document-converter-suite

document-converter-suite skill

/document-converter-suite

This skill converts between eight document formats with best-effort extraction to produce clean, structured outputs suitable for editing and reuse.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill document-converter-suite

Review the files below or copy the command above to add this skill to your agents.

Files (41)
SKILL.md
5.6 KB
---
name: document-converter-suite
description: Convert between 8 formats (PDF, DOCX, PPTX, XLSX, TXT, CSV, MD, HTML). Best-effort text extraction, batch processing, and document format transformation.
---

# Document Converter Suite

## Overview

Provide a best-effort conversion workflow between **8 document formats**:

**Office Formats**: PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX)
**Text Formats**: Plain Text (TXT), CSV, Markdown (MD), HTML

Uses `pypdf`, `python-docx`, `python-pptx`, `openpyxl`, `reportlab`, `mistune`, `beautifulsoup4`, and `Pillow`.

Prefer **reliable extraction + rebuild** (text, headings, bullets, basic tables) over pixel-perfect layout.

## When to use

Use when the request involves:

- Converting a file between **.pdf / .docx / .pptx / .xlsx / .txt / .csv / .md / .html**
- Making a document **more editable** by moving its content into Office or text formats
- Exporting slide text or spreadsheet cell grids to a different format
- Converting Markdown/HTML documentation to Office formats or vice versa
- Extracting tables from Office documents to CSV/XLSX
- Batch-converting a folder of mixed documents

**Supported conversion paths**: 64 total (8×8 matrix) - see `references/conversion_matrix.md`

Avoid promising visual fidelity. Emphasize that output is **clean and structured**, not identical.

## Workflow decision tree

1. **Identify input and desired output** (extensions matter).
2. **Classify the user's goal**:
   - **Editable content** → proceed with this suite.
   - **Visually identical rendering** → explain limitations; suggest external rendering tools.
3. **Pick conversion mode**:
   - Single file → run `scripts/convert.py`.
   - Folder/batch → run `scripts/batch_convert.py`.
4. **Tune safety caps** if needed:
   - PDF: `--max-pages`, `--max-chars`
   - XLSX: `--max-rows`, `--max-cols`
5. **Run conversion**, then sanity-check output size and structure.
6. **Iterate** (e.g., increase max rows/cols, split large docs, or choose a different target format).

## Quick start

### Single-file conversion

Run:

```bash
python scripts/convert.py <input-file> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>
```

Examples:

```bash
# Office format conversions
python scripts/convert.py report.pdf --to docx
python scripts/convert.py deck.pptx --to pdf --out deck_export.pdf
python scripts/convert.py data.xlsx --to pptx --max-rows 40 --max-cols 12

# Text format conversions
python scripts/convert.py documentation.md --to docx
python scripts/convert.py data.csv --to xlsx
python scripts/convert.py report.docx --to html
python scripts/convert.py notes.txt --to md
```

### Batch conversion

Run:

```bash
python scripts/batch_convert.py <input-dir> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>
```

Examples:

```bash
python scripts/batch_convert.py ./inbox --to docx --recursive
python scripts/batch_convert.py ./inbox --to pdf --outdir ./out --recursive --overwrite
python scripts/batch_convert.py ./markdown-docs --to html --pattern "*.md"
python scripts/batch_convert.py ./data --to xlsx --pattern "*.csv"
```

## Conversion behavior

Follow these defaults (and say them out loud if the user might be expecting magic):

### Office Format Conversions

- **PDF → (DOCX/PPTX/XLSX/TXT/MD/HTML)**: extract text with `pypdf`; no OCR; each page becomes a section/slide block.
- **DOCX → (PDF/PPTX/XLSX/TXT/CSV/MD/HTML)**: export paragraphs, headings (with improved detection), and tables.
  - **Improved heading detection**: now uses font size + bold + ALL CAPS heuristics, not just style names.
- **PPTX → (DOCX/PDF/XLSX/TXT/CSV/MD/HTML)**: export slide titles + text frames; export tables.
  - **Multi-table support**: PPTX now creates one slide per table when multiple tables exist.
- **XLSX → (DOCX/PPTX/PDF/TXT/CSV/MD/HTML)**: export bounded value grid per sheet (defaults: 200×50).
  - **Truncation warnings**: printed to stderr when data exceeds limits (e.g., "Sheet 'Data': Truncated 500 rows → 200 rows").

### Text Format Conversions

- **TXT → (DOCX/PPTX/XLSX/PDF/CSV/MD/HTML)**: lines become paragraphs/bullets; simple structure preservation.
- **CSV → (XLSX/DOCX/PPTX/HTML)**: headers + rows mapped to tables/sheets; auto-delimiter detection.
- **MD → (DOCX/PPTX/XLSX/PDF/TXT/CSV/HTML)**: parsed with `mistune`; headings, lists, tables, code blocks preserved.
  - **High fidelity**: Markdown ↔ HTML and Markdown ↔ DOCX maintain structure well.
- **HTML → (DOCX/PPTX/XLSX/PDF/TXT/CSV/MD)**: parsed with `beautifulsoup4`; semantic structure extracted.
  - **High fidelity**: HTML ↔ Markdown and HTML ↔ DOCX maintain structure well.

### Quality Improvements

- **Multi-table PPTX**: Creates one slide per table (instead of dropping extra tables)
- **Smart heading detection**: DOCX headings detected by style, font size+bold, or ALL CAPS+bold
- **Data truncation warnings**: XLSX conversions warn when data is truncated
- **Image extraction foundation**: `image_handler.py` provides hash-based deduplication for future image support

Load extra detail from:

- `references/conversion_matrix.md` - Full 8×8 conversion matrix
- `references/limitations.md` - Format-specific limitations and edge cases

## Guardrails and honesty rules

- State "best-effort" explicitly for any conversion request.
- Do not claim formatting fidelity (fonts, spacing, images, charts, animations).
- Call out scanned PDFs as a likely failure mode (no OCR).
- For giant spreadsheets, prefer increasing caps gradually and/or limiting to specific sheets (if user provides intent).

## Bundled scripts

- `scripts/convert.py`: single-file CLI converter
- `scripts/batch_convert.py`: batch converter for directories
- `scripts/lib/*`: internal readers/writers and conversion orchestration

Overview

This skill provides a best-effort conversion workflow between eight common document formats: PDF, DOCX, PPTX, XLSX, TXT, CSV, MD, and HTML. It focuses on reliable text extraction and structured rebuilds (headings, lists, basic tables) rather than pixel-perfect visual fidelity. Batch processing, truncation safeguards, and export options for Office and text formats are included.

How this skill works

The suite classifies the input and desired output, then selects a conversion path that extracts semantic content (text, headings, lists, tables, simple grids) and reconstructs it in the target format. It uses tested Python libraries for parsing and writing, applies heuristics for improved heading detection, and emits truncation or safety warnings for large spreadsheets or long PDFs. For folders, a batch mode processes many files with consistent rules and overwrite or outdir controls.

When to use it

  • Convert files between PDF, DOCX, PPTX, XLSX, TXT, CSV, MD, and HTML
  • Make documents more editable by moving content into Office or plain-text formats
  • Export slide text or spreadsheet grids into other formats
  • Turn Markdown/HTML documentation into Office formats or vice versa
  • Extract tables from Office files to CSV/XLSX or batch-convert a folder of mixed documents

Best practices

  • Treat all conversions as best-effort — expect clean, structured output, not identical visual layout
  • For scanned PDFs, run OCR first; this tool does not perform OCR
  • When converting large spreadsheets, increase max-rows/max-cols incrementally or request specific sheets
  • Use batch mode for consistent, repeatable conversions of many files
  • Sanity-check outputs and tune safety caps (max-pages, max-chars, max-rows, max-cols) if data is truncated

Example use cases

  • Turn a client PDF report into an editable DOCX for revisions
  • Export slide text from a PPTX into Markdown or DOCX for documentation
  • Convert CSV datasets into XLSX or HTML tables for sharing
  • Batch-convert a folder of Markdown docs to HTML for a static site export
  • Extract tables from DOCX or PPTX into CSV for data processing

FAQ

Does the tool preserve exact visual layout and fonts?

No. It preserves semantic structure (text, headings, lists, basic tables). It does not guarantee pixel-perfect visual fidelity, fonts, spacing, images, charts, or animations.

Can it convert scanned PDFs or images with text?

Not directly. Scanned PDFs require OCR before using this converter; the suite relies on text extraction libraries and does not perform OCR.