home / skills / eyadsibai / ltk / document-processing
This skill helps you manage office documents by merging, converting, and extracting text across PDF, Word, Excel, and PowerPoint formats.
npx playbooks add skill eyadsibai/ltk --skill document-processingReview the files below or copy the command above to add this skill to your agents.
---
name: document-processing
description: Use when working with "PDF", "Excel", "Word", "PowerPoint", "XLSX", "DOCX", "PPTX", "spreadsheets", "presentations", "extract text", "merge documents", "convert documents", or asking about "office document manipulation"
version: 1.0.0
---
# Document Processing Guide
Work with office documents: PDF, Excel, Word, and PowerPoint.
---
## Format Overview
| Format | Extension | Structure | Best For |
|--------|-----------|-----------|----------|
| **PDF** | .pdf | Binary/text | Reports, forms, archives |
| **Excel** | .xlsx | XML in ZIP | Data, calculations, models |
| **Word** | .docx | XML in ZIP | Text documents, contracts |
| **PowerPoint** | .pptx | XML in ZIP | Presentations, slides |
**Key concept**: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.
---
## PDF Processing
### PDF Tools
| Task | Best Tool |
|------|-----------|
| Basic read/write | pypdf |
| Text extraction | pdfplumber |
| Table extraction | pdfplumber |
| Create PDFs | reportlab |
| OCR scanned PDFs | pytesseract + pdf2image |
| Command line | qpdf, pdftotext |
### Common Operations
| Operation | Approach |
|-----------|----------|
| **Merge** | Loop through files, add pages to writer |
| **Split** | Create new writer per page |
| **Extract tables** | Use pdfplumber, convert to DataFrame |
| **Rotate** | Call `.rotate(degrees)` on page |
| **Encrypt** | Use writer's `.encrypt()` method |
| **OCR** | Convert to images, run pytesseract |
---
## Excel Processing
### Excel Tools
| Task | Best Tool |
|------|-----------|
| Data analysis | pandas |
| Formulas & formatting | openpyxl |
| Simple CSV | pandas |
| Financial models | openpyxl |
### Critical Rule: Use Formulas
| Approach | Result |
|----------|--------|
| **Wrong**: Calculate in Python, write value | Static number, breaks when data changes |
| **Right**: Write Excel formula | Dynamic, recalculates automatically |
### Financial Model Standards
| Convention | Meaning |
|------------|---------|
| Blue text | Hardcoded inputs |
| Black text | Formulas |
| Green text | Links to other sheets |
| Yellow fill | Needs attention |
### Common Formula Errors
| Error | Cause |
|-------|-------|
| #REF! | Invalid cell reference |
| #DIV/0! | Division by zero |
| #VALUE! | Wrong data type |
| #NAME? | Unknown function name |
---
## Word Processing
### Word Tools
| Task | Best Tool |
|------|-----------|
| Text extraction | pandoc |
| Create new | python-docx or docx-js |
| Simple edits | python-docx |
| Tracked changes | Direct XML editing |
### Document Structure
| File | Contains |
|------|----------|
| `word/document.xml` | Main content |
| `word/comments.xml` | Comments |
| `word/media/` | Images |
### Tracked Changes (Redlining)
| Element | XML Tag |
|---------|---------|
| Deletion | `<w:del><w:delText>...</w:delText></w:del>` |
| Insertion | `<w:ins><w:t>...</w:t></w:ins>` |
**Key concept**: For professional/legal documents, use tracked changes XML rather than replacing text directly.
---
## PowerPoint Processing
### PowerPoint Tools
| Task | Best Tool |
|------|-----------|
| Text extraction | markitdown |
| Create new | pptxgenjs (JS) or python-pptx |
| Edit existing | Direct XML or python-pptx |
### Slide Structure
| Path | Contains |
|------|----------|
| `ppt/slides/slide{N}.xml` | Slide content |
| `ppt/notesSlides/` | Speaker notes |
| `ppt/slideMasters/` | Master templates |
| `ppt/media/` | Images |
### Design Principles
| Principle | Guideline |
|-----------|-----------|
| Fonts | Use web-safe: Arial, Helvetica, Georgia |
| Layout | Two-column preferred, avoid vertical stacking |
| Hierarchy | Size, weight, color for emphasis |
| Consistency | Repeat patterns across slides |
---
## Converting Between Formats
| Conversion | Tool |
|------------|------|
| Any → PDF | LibreOffice headless |
| PDF → Images | pdftoppm |
| DOCX → Markdown | pandoc |
| Any → Text | Appropriate extractor |
---
## Best Practices
| Practice | Why |
|----------|-----|
| Use formulas in Excel | Dynamic calculations |
| Preserve formatting on edit | Don't lose styles |
| Test output opens correctly | Catch corruption early |
| Use tracked changes for contracts | Audit trail |
| Extract to markdown for analysis | Easier to process |
## Common Packages
| Language | Packages |
|----------|----------|
| **Python** | pypdf, pdfplumber, openpyxl, python-docx, python-pptx |
| **JavaScript** | docx, pptxgenjs |
| **CLI** | pandoc, qpdf, pdftotext, libreoffice |
This skill helps you automate and manage office document workflows for PDF, Excel, Word, and PowerPoint. It provides practical guidance and tooling choices for reading, extracting, converting, merging, and editing common office formats. The focus is on reliable, auditable operations like preserving formatting, using Excel formulas, and using tracked changes for legal documents.
The skill inspects file types and recommends libraries and CLI tools best suited for each operation (e.g., pypdf/pdfplumber for PDFs, pandas/openpyxl for Excel, python-docx for Word, python-pptx for presentations). It outlines concrete approaches: unzip DOCX/XLSX/PPTX to access XML, extract tables or text with specialized extractors, run OCR for scanned PDFs, and convert between formats using LibreOffice or pandoc. It also highlights structure locations (word/document.xml, ppt/slides/slideN.xml) to enable precise edits and audits.
Which Python libraries should I pick for basic PDF tasks?
Use pypdf for reading/writing and pdfplumber for reliable text and table extraction; use pytesseract + pdf2image for OCR of scanned pages.
How do I keep Excel workbooks dynamic when updating values?
Write Excel formulas rather than precomputing results in Python. Use openpyxl to insert formulas so the workbook recalculates in Excel.
When should I edit Word/PPTX via XML instead of high-level libraries?
Use XML edits for tracked changes, complex styles, or when exact control over structure is required. Use python-docx or python-pptx for routine edits to avoid low-level errors.