home / skills / dkyazzentwatwa / chatgpt-skills / table-extractor

table-extractor skill

/table-extractor

This skill extracts tables from PDFs and images into CSV, Excel, or JSON, including OCR for scanned documents and multi-page support.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill table-extractor

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
3.5 KB
---
name: table-extractor
description: Extract tables from PDFs and images to CSV or Excel. Support for scanned documents with OCR, multi-page PDFs, and complex table structures.
---

# Table Extractor

Extract tables from PDFs and images into structured data formats.

## Features

- **PDF Tables**: Extract tables from digital PDFs
- **Image Tables**: OCR-based extraction from images
- **Multiple Tables**: Extract all tables from document
- **Format Export**: CSV, Excel, JSON output
- **Table Detection**: Auto-detect table boundaries
- **Column Alignment**: Smart column detection
- **Multi-Page**: Process entire PDF documents

## Quick Start

```python
from table_extractor import TableExtractor

extractor = TableExtractor()

# Extract from PDF
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()

# Save first table to CSV
tables[0].to_csv("table.csv")

# Extract from image
extractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)
```

## CLI Usage

```bash
# Extract from PDF
python table_extractor.py --input document.pdf --output tables/

# Extract specific pages
python table_extractor.py --input document.pdf --pages 1-3 --output tables/

# Extract from image
python table_extractor.py --input scan.png --output table.csv

# Export to Excel
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx

# With OCR for scanned PDFs
python table_extractor.py --input scanned.pdf --ocr --output tables/
```

## API Reference

### TableExtractor Class

```python
class TableExtractor:
    def __init__(self)

    # Loading
    def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
    def load_image(self, filepath: str) -> 'TableExtractor'

    # Extraction
    def extract_table(self, page: int = 0) -> pd.DataFrame
    def extract_all(self) -> List[pd.DataFrame]
    def extract_page(self, page: int) -> List[pd.DataFrame]

    # Detection
    def detect_tables(self, page: int = 0) -> List[Dict]
    def get_table_count(self) -> int

    # Configuration
    def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
    def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'

    # Export
    def to_csv(self, tables: List, output_dir: str) -> List[str]
    def to_excel(self, tables: List, output: str) -> str
    def to_json(self, tables: List, output: str) -> str
```

## Supported Formats

### Input
- PDF documents (text-based and scanned)
- Images: PNG, JPEG, TIFF, BMP
- Screenshots with tables

### Output
- CSV (one file per table)
- Excel (multiple sheets)
- JSON (array of tables)
- Pandas DataFrame

## Table Detection

```python
# Detect tables without extracting
tables_info = extractor.detect_tables(page=0)
# Returns:
# [
#     {"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
#     {"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
# ]
```

## Example Workflows

### PDF Report Tables
```python
extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")

# Extract all tables
tables = extractor.extract_all()

# Export each to CSV
for i, table in enumerate(tables):
    table.to_csv(f"table_{i}.csv", index=False)
```

### Scanned Document
```python
extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")

table = extractor.extract_table()
print(table)
```

## Dependencies

- pdfplumber>=0.10.0
- pillow>=10.0.0
- pandas>=2.0.0
- pytesseract>=0.3.10 (for OCR)
- opencv-python>=4.8.0

Overview

This skill extracts tables from PDFs and images into clean, structured formats like CSV, Excel, JSON, or pandas DataFrame. It supports digital PDFs and scanned documents via OCR, handles multi-page inputs, and detects complex table boundaries and column alignment automatically. The goal is reliable, production-ready table extraction for data pipelines and analysis.

How this skill works

The extractor loads a PDF or image, detects table regions on each page, and parses cell content into DataFrame objects. For scanned or raster inputs it runs OCR before table parsing to recover text. It can return multiple tables per page, provide detection metadata (rows, columns, bounding box), and export results to CSV, Excel, or JSON.

When to use it

  • Convert report tables from multi-page PDFs into spreadsheets for analysis
  • Digitize tables from scanned forms, receipts, or screenshots using OCR
  • Batch-extract all tables from a document for ETL or data ingestion
  • Export tables from images to CSV/Excel for sharing or downstream processing
  • Validate and inspect detected tables before automated ingestion

Best practices

  • Enable OCR (set_ocr(True)) for scanned PDFs and photos to improve text recovery
  • Preview detected table bounding boxes (detect_tables) when layouts are complex
  • Use multi-page loading to process entire documents in one run and reduce overhead
  • Tune column detection mode for documents with irregular separators (set_column_detection)
  • Validate exported files by sampling a few tables before bulk processing

Example use cases

  • Financial report extraction: pull balance sheets and tables from quarterly PDFs into Excel for analysis
  • Invoice and receipt digitization: OCR scanned receipts and export line items to CSV for bookkeeping
  • Survey and form processing: extract tabular responses from scanned forms into a DataFrame
  • Research data collection: harvest tables from academic PDFs and convert to JSON for downstream tools
  • Automated ETL: batch-process multi-page PDFs and save each table as a sheet in a single Excel file

FAQ

Does it work with scanned PDFs?

Yes. Enable OCR with set_ocr(True) and specify language; the tool runs OCR before table parsing to recover text from scans.

Can I extract multiple tables from the same page?

Yes. extract_all or extract_page returns all detected tables per page as separate DataFrame objects with metadata.

What output formats are supported?

You can export to CSV (one file per table), Excel (multiple sheets), JSON (array of tables), or use pandas DataFrame objects directly.