home / skills / omer-metin / skills-for-antigravity / document-ai

document-ai skill

This skill helps you extract and structure data from PDFs and documents using OCR, table and invoice parsing, and multimodal vision insights.

npx playbooks add skill omer-metin/skills-for-antigravity --skill document-ai

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

1.3 KB

---
name: document-ai
description: Comprehensive patterns for AI-powered document understanding including PDF parsing, OCR, invoice/receipt extraction, table extraction, multimodal RAG with vision models, and structured data output. Use when "document parsing, PDF extraction, OCR, invoice processing, receipt extraction, document understanding, LlamaParse, Unstructured, vision document, table extraction, structured output from PDF, " mentioned. 
---

# Document Ai

## Identity



## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill provides comprehensive patterns and practical implementations for AI-powered document understanding: PDF parsing, OCR, invoice and receipt extraction, table extraction, multimodal retrieval-augmented generation (RAG) with vision models, and structured data output. It focuses on repeatable patterns that produce validated, machine-readable outputs suitable for downstream workflows. Use it when you need robust, auditable document parsing and extraction pipelines.

How this skill works

The skill prescribes concrete patterns for ingestion, OCR, layout analysis, entity and table extraction, and normalization into structured schemas. It ties each pattern to reference guidance: follow references/patterns.md for construction, references/sharp_edges.md for risks and failure modes, and references/validations.md for strict output constraints and validation rules. Implementations combine OCR engines, layout parsers, LLMs (for interpretation and RAG), and post-processing to ensure deterministic structured outputs.

When to use it

Parsing and extracting structured data from PDFs, scans, and images (invoices, receipts, contracts).
Performing robust table extraction from complex layouts for analytics or ETL ingestion.
Building multimodal RAG systems that combine vision models with LLMs for context-aware document QA.
Converting heterogeneous document outputs into validated JSON/CSV for downstream systems.
Diagnosing recurring extraction failures or reducing OCR/interpretation error rates.

Best practices

Start from references/patterns.md to select the appropriate ingestion and parsing pattern before coding.
Validate every extraction against references/validations.md rules and surface schema mismatches early.
Use the failure modes in references/sharp_edges.md to design prechecks and fallback strategies.
Separate OCR/layout stages from LLM interpretation to limit hallucination and improve traceability.
Normalize outputs (dates, currencies, addresses) to canonical formats and keep raw text for audit.

Example use cases

Invoice processing: OCR + table extraction + entity normalization to produce AP-ready JSON.
Receipt capture for expense workflows: photo ingest, OCR, merchant/date/amount extraction, validation.
Contract clause extraction: layout parsing + sentence-level QA via multimodal RAG to populate contract database.
Tabular data pipeline: extract complex tables from financial reports into analytic-ready CSV/Parquet.
Document QA assistant: combine vision models and RAG for user questions over scanned manuals or reports.

FAQ

What reference files must I follow when implementing patterns?

Use references/patterns.md for design patterns, references/sharp_edges.md to understand common failures and risks, and references/validations.md to enforce output rules and schema validation.

How do I reduce hallucinations from LLM-based interpretation?

Keep LLMs focused on interpretation only after deterministic OCR/layout steps, provide grounded context, use strict schema prompts, and validate outputs against references/validations.md.