home / skills / feiwanghub / playground / pdf-skill
/skills/.trae/skills/pdf-skill
This skill extracts text and metadata from PDF files using pypdf, enabling easy retrieval of content, author info, and page count.
npx playbooks add skill feiwanghub/playground --skill pdf-skillReview the files below or copy the command above to add this skill to your agents.
---
name: pdf-skill
description: "Extract text and metadata from PDF files using pypdf."
---
# PDF Extraction Skill
This skill allows you to extract text and metadata from PDF files programmatically.
## Capabilities
- Extract full text from PDF documents
- Retrieve PDF metadata (author, title, creation date, etc.)
- Get basic document information (page count)
## Usage
Extract text from a PDF:
```bash
python3 .shared/pdf-skill/scripts/extract_pdf.py "document.pdf"
```
This skill extracts text and metadata from PDF files using pypdf. It provides full-text extraction, basic document information like page count, and common metadata such as author, title, and creation date. It is designed for programmatic use in pipelines, scripts, or automation tasks.
The skill opens a PDF file and reads pages with pypdf, concatenating text content while preserving simple page boundaries. It also reads the PDF's document information dictionary to return metadata fields (author, title, subject, creator, producer, creation/modification dates). Finally it reports basic document properties such as page count and can return structured results for downstream processing.
Which PDFs work best?
Text-based PDFs yield the best results. Scanned images require OCR before reliable text extraction.
Does it preserve formatting?
It extracts plain text and basic page boundaries but does not preserve complex visual layout or styling.