home / skills / dkyazzentwatwa / chatgpt-skills / table-extractor
This skill extracts tables from PDFs and images into CSV, Excel, or JSON, including OCR for scanned documents and multi-page support.
npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill table-extractorReview the files below or copy the command above to add this skill to your agents.
---
name: table-extractor
description: Extract tables from PDFs and images to CSV or Excel. Support for scanned documents with OCR, multi-page PDFs, and complex table structures.
---
# Table Extractor
Extract tables from PDFs and images into structured data formats.
## Features
- **PDF Tables**: Extract tables from digital PDFs
- **Image Tables**: OCR-based extraction from images
- **Multiple Tables**: Extract all tables from document
- **Format Export**: CSV, Excel, JSON output
- **Table Detection**: Auto-detect table boundaries
- **Column Alignment**: Smart column detection
- **Multi-Page**: Process entire PDF documents
## Quick Start
```python
from table_extractor import TableExtractor
extractor = TableExtractor()
# Extract from PDF
extractor.load_pdf("document.pdf")
tables = extractor.extract_all()
# Save first table to CSV
tables[0].to_csv("table.csv")
# Extract from image
extractor.load_image("scanned_table.png")
table = extractor.extract_table()
print(table)
```
## CLI Usage
```bash
# Extract from PDF
python table_extractor.py --input document.pdf --output tables/
# Extract specific pages
python table_extractor.py --input document.pdf --pages 1-3 --output tables/
# Extract from image
python table_extractor.py --input scan.png --output table.csv
# Export to Excel
python table_extractor.py --input document.pdf --format xlsx --output tables.xlsx
# With OCR for scanned PDFs
python table_extractor.py --input scanned.pdf --ocr --output tables/
```
## API Reference
### TableExtractor Class
```python
class TableExtractor:
def __init__(self)
# Loading
def load_pdf(self, filepath: str, pages: List[int] = None) -> 'TableExtractor'
def load_image(self, filepath: str) -> 'TableExtractor'
# Extraction
def extract_table(self, page: int = 0) -> pd.DataFrame
def extract_all(self) -> List[pd.DataFrame]
def extract_page(self, page: int) -> List[pd.DataFrame]
# Detection
def detect_tables(self, page: int = 0) -> List[Dict]
def get_table_count(self) -> int
# Configuration
def set_ocr(self, enabled: bool = True, lang: str = "eng") -> 'TableExtractor'
def set_column_detection(self, mode: str = "auto") -> 'TableExtractor'
# Export
def to_csv(self, tables: List, output_dir: str) -> List[str]
def to_excel(self, tables: List, output: str) -> str
def to_json(self, tables: List, output: str) -> str
```
## Supported Formats
### Input
- PDF documents (text-based and scanned)
- Images: PNG, JPEG, TIFF, BMP
- Screenshots with tables
### Output
- CSV (one file per table)
- Excel (multiple sheets)
- JSON (array of tables)
- Pandas DataFrame
## Table Detection
```python
# Detect tables without extracting
tables_info = extractor.detect_tables(page=0)
# Returns:
# [
# {"index": 0, "rows": 10, "cols": 5, "bbox": (x1, y1, x2, y2)},
# {"index": 1, "rows": 8, "cols": 3, "bbox": (x1, y1, x2, y2)}
# ]
```
## Example Workflows
### PDF Report Tables
```python
extractor = TableExtractor()
extractor.load_pdf("quarterly_report.pdf")
# Extract all tables
tables = extractor.extract_all()
# Export each to CSV
for i, table in enumerate(tables):
table.to_csv(f"table_{i}.csv", index=False)
```
### Scanned Document
```python
extractor = TableExtractor()
extractor.set_ocr(enabled=True, lang="eng")
extractor.load_image("scanned_form.png")
table = extractor.extract_table()
print(table)
```
## Dependencies
- pdfplumber>=0.10.0
- pillow>=10.0.0
- pandas>=2.0.0
- pytesseract>=0.3.10 (for OCR)
- opencv-python>=4.8.0
This skill extracts tables from PDFs and images into clean, structured formats like CSV, Excel, JSON, or pandas DataFrame. It supports digital PDFs and scanned documents via OCR, handles multi-page inputs, and detects complex table boundaries and column alignment automatically. The goal is reliable, production-ready table extraction for data pipelines and analysis.
The extractor loads a PDF or image, detects table regions on each page, and parses cell content into DataFrame objects. For scanned or raster inputs it runs OCR before table parsing to recover text. It can return multiple tables per page, provide detection metadata (rows, columns, bounding box), and export results to CSV, Excel, or JSON.
Does it work with scanned PDFs?
Yes. Enable OCR with set_ocr(True) and specify language; the tool runs OCR before table parsing to recover text from scans.
Can I extract multiple tables from the same page?
Yes. extract_all or extract_page returns all detected tables per page as separate DataFrame objects with metadata.
What output formats are supported?
You can export to CSV (one file per table), Excel (multiple sheets), JSON (array of tables), or use pandas DataFrame objects directly.