home / skills / feiwanghub / playground / pdf-skill

pdf-skill skill

safe

/skills/.trae/skills/pdf-skill

This skill extracts text and metadata from PDF files using pypdf, enabling easy retrieval of content, author info, and page count.

npx playbooks add skill feiwanghub/playground --skill pdf-skill

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

479 B

---
name: pdf-skill
description: "Extract text and metadata from PDF files using pypdf."
---

# PDF Extraction Skill

This skill allows you to extract text and metadata from PDF files programmatically.

## Capabilities

- Extract full text from PDF documents
- Retrieve PDF metadata (author, title, creation date, etc.)
- Get basic document information (page count)

## Usage

Extract text from a PDF:

```bash
python3 .shared/pdf-skill/scripts/extract_pdf.py "document.pdf"
```

Overview

This skill extracts text and metadata from PDF files using pypdf. It provides full-text extraction, basic document information like page count, and common metadata such as author, title, and creation date. It is designed for programmatic use in pipelines, scripts, or automation tasks.

How this skill works

The skill opens a PDF file and reads pages with pypdf, concatenating text content while preserving simple page boundaries. It also reads the PDF's document information dictionary to return metadata fields (author, title, subject, creator, producer, creation/modification dates). Finally it reports basic document properties such as page count and can return structured results for downstream processing.

When to use it

Indexing or searching large collections of PDFs
Automating extraction for data pipelines or ETL jobs
Preprocessing documents for NLP, summarization, or classification
Harvesting bibliographic metadata for catalogs or inventories
Quick analysis of PDF contents without manual opening

Best practices

Run on cleaned or OCRed PDFs for better text output from scanned images
Handle multiple encodings and normalize whitespace after extraction
Validate presence of metadata fields and provide defaults if missing
Process large batches in parallel but limit concurrent I/O to avoid resource exhaustion
Log page-level extraction errors and skip problematic pages rather than failing the whole job

Example use cases

Extract full text to feed a search index or vector database
Pull author, title, and dates to populate a document management system
Preprocess PDFs for NLP tasks like named entity recognition or summarization
Build a quick report of page counts and metadata for a document collection audit
Automate ingestion of invoices, contracts, and reports into backend systems

FAQ

Which PDFs work best?

Text-based PDFs yield the best results. Scanned images require OCR before reliable text extraction.

Does it preserve formatting?

It extracts plain text and basic page boundaries but does not preserve complex visual layout or styling.