home / skills / dkyazzentwatwa / chatgpt-skills / document-converter-suite
This skill converts between eight document formats with best-effort extraction to produce clean, structured outputs suitable for editing and reuse.
npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill document-converter-suiteReview the files below or copy the command above to add this skill to your agents.
---
name: document-converter-suite
description: Convert between 8 formats (PDF, DOCX, PPTX, XLSX, TXT, CSV, MD, HTML). Best-effort text extraction, batch processing, and document format transformation.
---
# Document Converter Suite
## Overview
Provide a best-effort conversion workflow between **8 document formats**:
**Office Formats**: PDF, Word (DOCX), PowerPoint (PPTX), Excel (XLSX)
**Text Formats**: Plain Text (TXT), CSV, Markdown (MD), HTML
Uses `pypdf`, `python-docx`, `python-pptx`, `openpyxl`, `reportlab`, `mistune`, `beautifulsoup4`, and `Pillow`.
Prefer **reliable extraction + rebuild** (text, headings, bullets, basic tables) over pixel-perfect layout.
## When to use
Use when the request involves:
- Converting a file between **.pdf / .docx / .pptx / .xlsx / .txt / .csv / .md / .html**
- Making a document **more editable** by moving its content into Office or text formats
- Exporting slide text or spreadsheet cell grids to a different format
- Converting Markdown/HTML documentation to Office formats or vice versa
- Extracting tables from Office documents to CSV/XLSX
- Batch-converting a folder of mixed documents
**Supported conversion paths**: 64 total (8×8 matrix) - see `references/conversion_matrix.md`
Avoid promising visual fidelity. Emphasize that output is **clean and structured**, not identical.
## Workflow decision tree
1. **Identify input and desired output** (extensions matter).
2. **Classify the user's goal**:
- **Editable content** → proceed with this suite.
- **Visually identical rendering** → explain limitations; suggest external rendering tools.
3. **Pick conversion mode**:
- Single file → run `scripts/convert.py`.
- Folder/batch → run `scripts/batch_convert.py`.
4. **Tune safety caps** if needed:
- PDF: `--max-pages`, `--max-chars`
- XLSX: `--max-rows`, `--max-cols`
5. **Run conversion**, then sanity-check output size and structure.
6. **Iterate** (e.g., increase max rows/cols, split large docs, or choose a different target format).
## Quick start
### Single-file conversion
Run:
```bash
python scripts/convert.py <input-file> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>
```
Examples:
```bash
# Office format conversions
python scripts/convert.py report.pdf --to docx
python scripts/convert.py deck.pptx --to pdf --out deck_export.pdf
python scripts/convert.py data.xlsx --to pptx --max-rows 40 --max-cols 12
# Text format conversions
python scripts/convert.py documentation.md --to docx
python scripts/convert.py data.csv --to xlsx
python scripts/convert.py report.docx --to html
python scripts/convert.py notes.txt --to md
```
### Batch conversion
Run:
```bash
python scripts/batch_convert.py <input-dir> --to <pdf|docx|pptx|xlsx|txt|csv|md|html>
```
Examples:
```bash
python scripts/batch_convert.py ./inbox --to docx --recursive
python scripts/batch_convert.py ./inbox --to pdf --outdir ./out --recursive --overwrite
python scripts/batch_convert.py ./markdown-docs --to html --pattern "*.md"
python scripts/batch_convert.py ./data --to xlsx --pattern "*.csv"
```
## Conversion behavior
Follow these defaults (and say them out loud if the user might be expecting magic):
### Office Format Conversions
- **PDF → (DOCX/PPTX/XLSX/TXT/MD/HTML)**: extract text with `pypdf`; no OCR; each page becomes a section/slide block.
- **DOCX → (PDF/PPTX/XLSX/TXT/CSV/MD/HTML)**: export paragraphs, headings (with improved detection), and tables.
- **Improved heading detection**: now uses font size + bold + ALL CAPS heuristics, not just style names.
- **PPTX → (DOCX/PDF/XLSX/TXT/CSV/MD/HTML)**: export slide titles + text frames; export tables.
- **Multi-table support**: PPTX now creates one slide per table when multiple tables exist.
- **XLSX → (DOCX/PPTX/PDF/TXT/CSV/MD/HTML)**: export bounded value grid per sheet (defaults: 200×50).
- **Truncation warnings**: printed to stderr when data exceeds limits (e.g., "Sheet 'Data': Truncated 500 rows → 200 rows").
### Text Format Conversions
- **TXT → (DOCX/PPTX/XLSX/PDF/CSV/MD/HTML)**: lines become paragraphs/bullets; simple structure preservation.
- **CSV → (XLSX/DOCX/PPTX/HTML)**: headers + rows mapped to tables/sheets; auto-delimiter detection.
- **MD → (DOCX/PPTX/XLSX/PDF/TXT/CSV/HTML)**: parsed with `mistune`; headings, lists, tables, code blocks preserved.
- **High fidelity**: Markdown ↔ HTML and Markdown ↔ DOCX maintain structure well.
- **HTML → (DOCX/PPTX/XLSX/PDF/TXT/CSV/MD)**: parsed with `beautifulsoup4`; semantic structure extracted.
- **High fidelity**: HTML ↔ Markdown and HTML ↔ DOCX maintain structure well.
### Quality Improvements
- **Multi-table PPTX**: Creates one slide per table (instead of dropping extra tables)
- **Smart heading detection**: DOCX headings detected by style, font size+bold, or ALL CAPS+bold
- **Data truncation warnings**: XLSX conversions warn when data is truncated
- **Image extraction foundation**: `image_handler.py` provides hash-based deduplication for future image support
Load extra detail from:
- `references/conversion_matrix.md` - Full 8×8 conversion matrix
- `references/limitations.md` - Format-specific limitations and edge cases
## Guardrails and honesty rules
- State "best-effort" explicitly for any conversion request.
- Do not claim formatting fidelity (fonts, spacing, images, charts, animations).
- Call out scanned PDFs as a likely failure mode (no OCR).
- For giant spreadsheets, prefer increasing caps gradually and/or limiting to specific sheets (if user provides intent).
## Bundled scripts
- `scripts/convert.py`: single-file CLI converter
- `scripts/batch_convert.py`: batch converter for directories
- `scripts/lib/*`: internal readers/writers and conversion orchestration
This skill provides a best-effort conversion workflow between eight common document formats: PDF, DOCX, PPTX, XLSX, TXT, CSV, MD, and HTML. It focuses on reliable text extraction and structured rebuilds (headings, lists, basic tables) rather than pixel-perfect visual fidelity. Batch processing, truncation safeguards, and export options for Office and text formats are included.
The suite classifies the input and desired output, then selects a conversion path that extracts semantic content (text, headings, lists, tables, simple grids) and reconstructs it in the target format. It uses tested Python libraries for parsing and writing, applies heuristics for improved heading detection, and emits truncation or safety warnings for large spreadsheets or long PDFs. For folders, a batch mode processes many files with consistent rules and overwrite or outdir controls.
Does the tool preserve exact visual layout and fonts?
No. It preserves semantic structure (text, headings, lists, basic tables). It does not guarantee pixel-perfect visual fidelity, fonts, spacing, images, charts, or animations.
Can it convert scanned PDFs or images with text?
Not directly. Scanned PDFs require OCR before using this converter; the suite relies on text extraction libraries and does not perform OCR.