home / skills / feiwanghub / playground / pdf-skill

pdf-skill skill

/skills/.trae/skills/pdf-skill

This skill extracts text and metadata from PDF files using pypdf, enabling easy retrieval of content, author info, and page count.

npx playbooks add skill feiwanghub/playground --skill pdf-skill

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
479 B
---
name: pdf-skill
description: "Extract text and metadata from PDF files using pypdf."
---

# PDF Extraction Skill

This skill allows you to extract text and metadata from PDF files programmatically.

## Capabilities

- Extract full text from PDF documents
- Retrieve PDF metadata (author, title, creation date, etc.)
- Get basic document information (page count)

## Usage

Extract text from a PDF:

```bash
python3 .shared/pdf-skill/scripts/extract_pdf.py "document.pdf"
```

Overview

This skill extracts text and metadata from PDF files using pypdf. It provides full-text extraction, basic document information like page count, and common metadata such as author, title, and creation date. It is designed for programmatic use in pipelines, scripts, or automation tasks.

How this skill works

The skill opens a PDF file and reads pages with pypdf, concatenating text content while preserving simple page boundaries. It also reads the PDF's document information dictionary to return metadata fields (author, title, subject, creator, producer, creation/modification dates). Finally it reports basic document properties such as page count and can return structured results for downstream processing.

When to use it

  • Indexing or searching large collections of PDFs
  • Automating extraction for data pipelines or ETL jobs
  • Preprocessing documents for NLP, summarization, or classification
  • Harvesting bibliographic metadata for catalogs or inventories
  • Quick analysis of PDF contents without manual opening

Best practices

  • Run on cleaned or OCRed PDFs for better text output from scanned images
  • Handle multiple encodings and normalize whitespace after extraction
  • Validate presence of metadata fields and provide defaults if missing
  • Process large batches in parallel but limit concurrent I/O to avoid resource exhaustion
  • Log page-level extraction errors and skip problematic pages rather than failing the whole job

Example use cases

  • Extract full text to feed a search index or vector database
  • Pull author, title, and dates to populate a document management system
  • Preprocess PDFs for NLP tasks like named entity recognition or summarization
  • Build a quick report of page counts and metadata for a document collection audit
  • Automate ingestion of invoices, contracts, and reports into backend systems

FAQ

Which PDFs work best?

Text-based PDFs yield the best results. Scanned images require OCR before reliable text extraction.

Does it preserve formatting?

It extracts plain text and basic page boundaries but does not preserve complex visual layout or styling.