home / skills / eyadsibai / ltk / document-processing

document-processing skill

safe

/plugins/ltk-product/skills/document-processing

This skill helps you manage office documents by merging, converting, and extracting text across PDF, Word, Excel, and PowerPoint formats.

npx playbooks add skill eyadsibai/ltk --skill document-processing

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.5 KB

---
name: document-processing
description: Use when working with "PDF", "Excel", "Word", "PowerPoint", "XLSX", "DOCX", "PPTX", "spreadsheets", "presentations", "extract text", "merge documents", "convert documents", or asking about "office document manipulation"
version: 1.0.0
---

# Document Processing Guide

Work with office documents: PDF, Excel, Word, and PowerPoint.

---

## Format Overview

| Format | Extension | Structure | Best For |
|--------|-----------|-----------|----------|
| **PDF** | .pdf | Binary/text | Reports, forms, archives |
| **Excel** | .xlsx | XML in ZIP | Data, calculations, models |
| **Word** | .docx | XML in ZIP | Text documents, contracts |
| **PowerPoint** | .pptx | XML in ZIP | Presentations, slides |

**Key concept**: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.

---

## PDF Processing

### PDF Tools

| Task | Best Tool |
|------|-----------|
| Basic read/write | pypdf |
| Text extraction | pdfplumber |
| Table extraction | pdfplumber |
| Create PDFs | reportlab |
| OCR scanned PDFs | pytesseract + pdf2image |
| Command line | qpdf, pdftotext |

### Common Operations

| Operation | Approach |
|-----------|----------|
| **Merge** | Loop through files, add pages to writer |
| **Split** | Create new writer per page |
| **Extract tables** | Use pdfplumber, convert to DataFrame |
| **Rotate** | Call `.rotate(degrees)` on page |
| **Encrypt** | Use writer's `.encrypt()` method |
| **OCR** | Convert to images, run pytesseract |

---

## Excel Processing

### Excel Tools

| Task | Best Tool |
|------|-----------|
| Data analysis | pandas |
| Formulas & formatting | openpyxl |
| Simple CSV | pandas |
| Financial models | openpyxl |

### Critical Rule: Use Formulas

| Approach | Result |
|----------|--------|
| **Wrong**: Calculate in Python, write value | Static number, breaks when data changes |
| **Right**: Write Excel formula | Dynamic, recalculates automatically |

### Financial Model Standards

| Convention | Meaning |
|------------|---------|
| Blue text | Hardcoded inputs |
| Black text | Formulas |
| Green text | Links to other sheets |
| Yellow fill | Needs attention |

### Common Formula Errors

| Error | Cause |
|-------|-------|
| #REF! | Invalid cell reference |
| #DIV/0! | Division by zero |
| #VALUE! | Wrong data type |
| #NAME? | Unknown function name |

---

## Word Processing

### Word Tools

| Task | Best Tool |
|------|-----------|
| Text extraction | pandoc |
| Create new | python-docx or docx-js |
| Simple edits | python-docx |
| Tracked changes | Direct XML editing |

### Document Structure

| File | Contains |
|------|----------|
| `word/document.xml` | Main content |
| `word/comments.xml` | Comments |
| `word/media/` | Images |

### Tracked Changes (Redlining)

| Element | XML Tag |
|---------|---------|
| Deletion | `<w:del><w:delText>...</w:delText></w:del>` |
| Insertion | `<w:ins><w:t>...</w:t></w:ins>` |

**Key concept**: For professional/legal documents, use tracked changes XML rather than replacing text directly.

---

## PowerPoint Processing

### PowerPoint Tools

| Task | Best Tool |
|------|-----------|
| Text extraction | markitdown |
| Create new | pptxgenjs (JS) or python-pptx |
| Edit existing | Direct XML or python-pptx |

### Slide Structure

| Path | Contains |
|------|----------|
| `ppt/slides/slide{N}.xml` | Slide content |
| `ppt/notesSlides/` | Speaker notes |
| `ppt/slideMasters/` | Master templates |
| `ppt/media/` | Images |

### Design Principles

| Principle | Guideline |
|-----------|-----------|
| Fonts | Use web-safe: Arial, Helvetica, Georgia |
| Layout | Two-column preferred, avoid vertical stacking |
| Hierarchy | Size, weight, color for emphasis |
| Consistency | Repeat patterns across slides |

---

## Converting Between Formats

| Conversion | Tool |
|------------|------|
| Any → PDF | LibreOffice headless |
| PDF → Images | pdftoppm |
| DOCX → Markdown | pandoc |
| Any → Text | Appropriate extractor |

---

## Best Practices

| Practice | Why |
|----------|-----|
| Use formulas in Excel | Dynamic calculations |
| Preserve formatting on edit | Don't lose styles |
| Test output opens correctly | Catch corruption early |
| Use tracked changes for contracts | Audit trail |
| Extract to markdown for analysis | Easier to process |

## Common Packages

| Language | Packages |
|----------|----------|
| **Python** | pypdf, pdfplumber, openpyxl, python-docx, python-pptx |
| **JavaScript** | docx, pptxgenjs |
| **CLI** | pandoc, qpdf, pdftotext, libreoffice |

Overview

This skill helps you automate and manage office document workflows for PDF, Excel, Word, and PowerPoint. It provides practical guidance and tooling choices for reading, extracting, converting, merging, and editing common office formats. The focus is on reliable, auditable operations like preserving formatting, using Excel formulas, and using tracked changes for legal documents.

How this skill works

The skill inspects file types and recommends libraries and CLI tools best suited for each operation (e.g., pypdf/pdfplumber for PDFs, pandas/openpyxl for Excel, python-docx for Word, python-pptx for presentations). It outlines concrete approaches: unzip DOCX/XLSX/PPTX to access XML, extract tables or text with specialized extractors, run OCR for scanned PDFs, and convert between formats using LibreOffice or pandoc. It also highlights structure locations (word/document.xml, ppt/slides/slideN.xml) to enable precise edits and audits.

When to use it

Extract structured data or tables from PDFs or scanned pages.
Merge, split, rotate, or encrypt PDF files programmatically.
Generate or update Excel models while preserving formulas and formats.
Programmatically edit Word documents and preserve tracked changes for legal workflows.
Create or modify PowerPoint slides, export notes, or extract assets for reuse.

Best practices

Prefer writing Excel formulas into workbooks instead of flattening computed values to keep models dynamic.
Preserve document formatting and media when editing to avoid corrupting files or breaking layouts.
Use tracked changes (edit XML) for contracts and legal documents to maintain an audit trail.
Test outputs by opening them in target applications early to catch corruption or compatibility issues.
Use OCR (pytesseract + pdf2image) only when PDFs are scanned images; prefer text extractors for born-digital PDFs.

Example use cases

Batch-merge monthly PDF reports, add page numbers, and encrypt the final file.
Extract tables from financial PDF statements into DataFrames for analysis with pandas.
Inject formulas and formatted inputs into an Excel financial model using openpyxl while keeping formulas intact.
Convert DOCX to Markdown with pandoc for review, or apply tracked-change edits via XML for legal sign-off.
Extract slides and speaker notes from PPTX and generate a plain-text briefing or reuse images from ppt/media/.

FAQ

Which Python libraries should I pick for basic PDF tasks?

Use pypdf for reading/writing and pdfplumber for reliable text and table extraction; use pytesseract + pdf2image for OCR of scanned pages.

How do I keep Excel workbooks dynamic when updating values?

Write Excel formulas rather than precomputing results in Python. Use openpyxl to insert formulas so the workbook recalculates in Excel.

When should I edit Word/PPTX via XML instead of high-level libraries?

Use XML edits for tracked changes, complex styles, or when exact control over structure is required. Use python-docx or python-pptx for routine edits to avoid low-level errors.