home / skills / bdambrosio / cognitive_workbench / fetch-text
This skill fetches complete text from URLs or PDFs, auto-detects format, and returns content with metadata and page count.
npx playbooks add skill bdambrosio/cognitive_workbench --skill fetch-textReview the files below or copy the command above to add this skill to your agents.
---
name: fetch-text
type: python
description: "Fetch all text from URL or base64 PDF. Collection-aware (extracts first item if given Collection). Auto-detects format (PDF/HTML/MD/TXT) and extracts complete text content"
---
# fetch-text
Fetch complete text content from URLs or PDFs. Auto-detects format and extracts all text.
## Input
- `target`: URL string, base64-encoded PDF, Note ID, or Collection ID (uses first item's `content` as URL)
## Output
Success (`status: "success"`):
- `value`: JSON string with:
- `text`: Full extracted text
- `format`: `"pdf"` | `"html"` | `"markdown"` | `"text"`
- `metadata`: Source URL and format-specific metadata
- `page_count`: Number of pages (PDF only)
- `char_count`: Total character count
Failure (`status: "failed"`):
- `reason`: Error description
## Behavior
- Auto-detects format from content
- Extracts complete text without filtering
- For Collections: extracts first Note's content field as URL
## Planning Notes
- Use when you have a specific URL and want complete content
- Use `search-web` when searching for information (returns filtered excerpts)
- For structured search results, extract URLs first with `project`
## Examples
```json
{"type":"fetch-text","target":"https://arxiv.org/pdf/1706.03762.pdf","out":"$paper_text"}
{"type":"project","target":"$papers","fields":["metadata.uri"],"out":"$urls"}
{"type":"fetch-text","target":"$urls","out":"$paper_text"}
```
This skill fetch-text extracts the complete textual content from a URL or a base64-encoded PDF and returns structured metadata. It auto-detects the source format (PDF, HTML, Markdown, or plain text) and supports Collection inputs by using the first item's content as the target URL. The output includes full text, format, page and character counts, and source metadata for reliable downstream processing.
Provide a target that is a URL, a base64 PDF, a Note ID, or a Collection ID. The skill inspects the input, auto-detects the content format, downloads or decodes the resource, and extracts all readable text without filtering. For Collections it reads the first note’s content field as the URL and then proceeds with extraction, returning JSON with text, format, metadata, page_count (PDF), and char_count.
What input types are supported?
Supported inputs: HTTP/HTTPS URLs, base64-encoded PDFs, Note IDs, and Collection IDs (it will use the collection’s first item as the source).
How does it report failures?
On failure the skill returns status: "failed" with a concise reason field describing the error (download, decode, or extraction issue).