home / skills / doanchienthangdev / omgkit / document-processing

document-processing skill

/plugin/skills/tools/document-processing

This skill helps you build and run robust document pipelines by extracting, generating, and batch-processing PDFs, Word, Excel, and PowerPoint files.

npx playbooks add skill doanchienthangdev/omgkit --skill document-processing

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.5 KB
---
name: Processing Documents
description: Processes PDF, DOCX, XLSX, and PPTX files with extraction, generation, and batch operations. Use when building document pipelines, extracting content from office files, or generating reports.
category: tools
triggers:
  - document processing
  - pdf extraction
  - docx parsing
  - excel manipulation
  - spreadsheet data
  - powerpoint generation
  - office documents
---

# Processing Documents

## Quick Start

```typescript
import { PDFDocument } from 'pdf-lib';
import ExcelJS from 'exceljs';
import { Document, Packer, Paragraph, TextRun } from 'docx';

// Extract text from PDF
async function extractPDFText(buffer: Buffer): Promise<string> {
  const pdfDoc = await PDFDocument.load(buffer);
  const pages = pdfDoc.getPages();
  return pages.map(page => page.getTextContent()).join('\n\n');
}

// Read Excel spreadsheet
async function readExcel(buffer: Buffer) {
  const workbook = new ExcelJS.Workbook();
  await workbook.xlsx.load(buffer);
  return workbook.worksheets.map(sheet => ({
    name: sheet.name,
    rows: sheet.getSheetValues(),
  }));
}

// Generate Word document
async function generateDOCX(title: string, content: string[]): Promise<Buffer> {
  const doc = new Document({
    sections: [{
      children: [
        new Paragraph({ children: [new TextRun({ text: title, bold: true, size: 48 })] }),
        ...content.map(text => new Paragraph({ children: [new TextRun(text)] })),
      ],
    }],
  });
  return await Packer.toBuffer(doc);
}
```

## Features

| Feature | Description | Guide |
|---------|-------------|-------|
| PDF Extraction | Extract text, tables, images, and metadata from PDFs | Use pdf-lib or pdf-parse for text extraction |
| PDF Generation | Create PDFs from templates with data binding | Use pdf-lib with text, images, and table elements |
| DOCX Parsing | Parse Word documents preserving structure | Use mammoth or docx library for parsing |
| DOCX Generation | Generate Word documents with formatting | Use docx package with paragraphs and tables |
| Excel Reading | Read spreadsheets with formulas and formatting | Use exceljs to iterate sheets and cells |
| Excel Generation | Create spreadsheets with charts and styling | Use exceljs with conditional formatting |
| PPTX Generation | Create presentations with slides and charts | Use pptxgenjs for slide creation |
| Batch Processing | Process multiple documents with concurrency | Use p-queue for controlled parallel processing |
| Template Engine | Generate documents from templates with placeholders | Use docxtemplater for DOCX templates |
| Streaming | Handle large files without memory exhaustion | Process files in chunks with streams |

## Common Patterns

### Batch Document Processing

```typescript
import PQueue from 'p-queue';

async function processBatch(files: string[], transform: (buffer: Buffer) => Promise<Buffer>) {
  const queue = new PQueue({ concurrency: 4 });
  const results: { file: string; success: boolean; error?: string }[] = [];

  for (const file of files) {
    queue.add(async () => {
      try {
        const buffer = await fs.readFile(file);
        const output = await transform(buffer);
        await fs.writeFile(file.replace(/\.\w+$/, '_processed.pdf'), output);
        results.push({ file, success: true });
      } catch (error) {
        results.push({ file, success: false, error: error.message });
      }
    });
  }

  await queue.onIdle();
  return results;
}
```

### Excel Report Generation

```typescript
async function generateReport(data: Record<string, any>[]): Promise<Buffer> {
  const workbook = new ExcelJS.Workbook();
  const sheet = workbook.addWorksheet('Report');

  // Add headers with styling
  const headers = Object.keys(data[0] || {});
  sheet.addRow(headers);
  sheet.getRow(1).font = { bold: true };
  sheet.getRow(1).fill = { type: 'pattern', pattern: 'solid', fgColor: { argb: 'FFE0E0E0' } };

  // Add data rows
  data.forEach(row => sheet.addRow(headers.map(h => row[h])));

  // Auto-fit columns
  sheet.columns.forEach(col => { col.width = 15; });

  return Buffer.from(await workbook.xlsx.writeBuffer());
}
```

### Invoice Generation from Template

```typescript
import Docxtemplater from 'docxtemplater';
import PizZip from 'pizzip';

async function generateInvoice(templatePath: string, invoiceData: InvoiceData): Promise<Buffer> {
  const templateBuffer = await fs.readFile(templatePath);
  const zip = new PizZip(templateBuffer);
  const doc = new Docxtemplater(zip, { paragraphLoop: true, linebreaks: true });

  doc.render({
    invoiceNumber: invoiceData.number,
    date: invoiceData.date,
    customer: invoiceData.customer,
    items: invoiceData.items,
    total: invoiceData.total,
  });

  return doc.getZip().generate({ type: 'nodebuffer', compression: 'DEFLATE' });
}
```

### PDF Table Extraction

```typescript
async function extractTables(pdfBuffer: Buffer): Promise<ExtractedTable[]> {
  const pdfDoc = await PDFDocument.load(pdfBuffer);
  const tables: ExtractedTable[] = [];

  for (let i = 0; i < pdfDoc.getPageCount(); i++) {
    const page = pdfDoc.getPage(i);
    const content = await extractPageContent(page);
    const detectedTables = detectTableStructures(content);
    tables.push(...detectedTables.map(t => ({ ...t, pageNumber: i + 1 })));
  }

  return tables;
}
```

## Best Practices

| Do | Avoid |
|----|-------|
| Stream large files (>10MB) to prevent memory issues | Loading entire large files into memory |
| Validate file types before processing | Assuming file extensions match content |
| Handle password-protected documents gracefully | Ignoring encrypted document errors |
| Preserve original formatting when transforming | Stripping formatting without user consent |
| Cache parsed results for repeated access | Re-parsing the same document multiple times |
| Use appropriate libraries per format | Building custom parsers for standard formats |
| Set file size limits for uploads | Processing unbounded file sizes |
| Sanitize filenames and paths | Using untrusted paths directly |
| Handle encoding issues (UTF-8, BOM) | Assuming all files use the same encoding |
| Log processing errors with context | Silently failing on corrupt files |

## Related Skills

- **media-processing** - Video and audio processing
- **image-processing** - Image manipulation with Sharp
- **typescript** - Type-safe document handling

## References

- [pdf-lib Documentation](https://pdf-lib.js.org/)
- [ExcelJS Documentation](https://github.com/exceljs/exceljs)
- [docx Documentation](https://docx.js.org/)
- [PptxGenJS Documentation](https://gitbrent.github.io/PptxGenJS/)

Overview

This skill processes PDF, DOCX, XLSX, and PPTX files to extract content, generate documents, and run batch operations. It combines reliable libraries and streaming patterns to handle extraction of text, tables, images, and metadata, plus generation of reports, invoices, and presentations. Use it to build robust document pipelines that scale and preserve formatting.

How this skill works

The skill uses format-specific libraries (pdf-lib, exceljs, docx, pptxgenjs, docxtemplater) to parse and produce files. It supports streaming for large files, template binding for DOCX/PDF generation, table and text extraction from PDFs, and controlled concurrency for batch jobs with p-queue. Output buffers or streams are returned for downstream storage or delivery.

When to use it

  • Extract structured data from office files for indexing or analysis
  • Generate reports, invoices, or presentation decks from templates and data
  • Process large batches of documents with controlled concurrency
  • Convert and normalize mixed-format inputs for downstream pipelines
  • Stream large files to avoid memory exhaustion during processing

Best practices

  • Stream files larger than ~10MB and set clear upload size limits
  • Validate MIME type and file signature before parsing
  • Cache parsed results for repeated access to reduce re-processing
  • Preserve original formatting unless users opt into transformation
  • Sanitize file names and paths; handle encoding and BOM variations
  • Use template engines for repeatable document generation and testing

Example use cases

  • Invoice generation: fill DOCX templates with customer and line-item data and export PDF invoices
  • Excel report builder: create styled spreadsheets with headers, autofit columns, and charts from JSON datasets
  • Batch conversion: concurrently normalize a folder of DOCX/PDF to a single PDF template for archiving
  • PDF data extraction: detect and extract tables across pages for ingestion into a database
  • Presentation generation: produce PPTX slide decks from structured summaries and images

FAQ

Which file types are supported?

PDF, DOCX, XLSX, and PPTX are supported for extraction and generation; libraries used determine feature depth (tables, images, metadata).

How do you handle large files?

Use streaming APIs and process in chunks; set size limits; use a queue with controlled concurrency to avoid memory exhaustion.