home / skills / bdambrosio / cognitive_workbench / fetch-text

fetch-text skill

safe

This skill fetches complete text from URLs or PDFs, auto-detects format, and returns content with metadata and page count.

npx playbooks add skill bdambrosio/cognitive_workbench --skill fetch-text

Review the files below or copy the command above to add this skill to your agents.

Files (2)

Skill.md

1.4 KB

---
name: fetch-text
type: python
description: "Fetch all text from URL or base64 PDF. Collection-aware (extracts first item if given Collection). Auto-detects format (PDF/HTML/MD/TXT) and extracts complete text content"
---

# fetch-text

Fetch complete text content from URLs or PDFs. Auto-detects format and extracts all text.

## Input

- `target`: URL string, base64-encoded PDF, Note ID, or Collection ID (uses first item's `content` as URL)

## Output

Success (`status: "success"`):
- `value`: JSON string with:
  - `text`: Full extracted text
  - `format`: `"pdf"` | `"html"` | `"markdown"` | `"text"`
  - `metadata`: Source URL and format-specific metadata
  - `page_count`: Number of pages (PDF only)
  - `char_count`: Total character count

Failure (`status: "failed"`):
- `reason`: Error description

## Behavior

- Auto-detects format from content
- Extracts complete text without filtering
- For Collections: extracts first Note's content field as URL

## Planning Notes

- Use when you have a specific URL and want complete content
- Use `search-web` when searching for information (returns filtered excerpts)
- For structured search results, extract URLs first with `project`

## Examples

```json
{"type":"fetch-text","target":"https://arxiv.org/pdf/1706.03762.pdf","out":"$paper_text"}
{"type":"project","target":"$papers","fields":["metadata.uri"],"out":"$urls"}
{"type":"fetch-text","target":"$urls","out":"$paper_text"}
```

Overview

This skill fetch-text extracts the complete textual content from a URL or a base64-encoded PDF and returns structured metadata. It auto-detects the source format (PDF, HTML, Markdown, or plain text) and supports Collection inputs by using the first item's content as the target URL. The output includes full text, format, page and character counts, and source metadata for reliable downstream processing.

How this skill works

Provide a target that is a URL, a base64 PDF, a Note ID, or a Collection ID. The skill inspects the input, auto-detects the content format, downloads or decodes the resource, and extracts all readable text without filtering. For Collections it reads the first note’s content field as the URL and then proceeds with extraction, returning JSON with text, format, metadata, page_count (PDF), and char_count.

When to use it

You have a specific URL or a base64 PDF and need the complete unfiltered text.
You need to ingest full documents for NLP tasks (summarization, indexing, entity extraction).
You want accurate format detection across PDF/HTML/Markdown/TXT inputs.
You receive a Collection ID and need to extract its first item’s content as the source.

Best practices

Pass a canonical URL or direct base64 PDF to avoid redirects and partial captures.
For collections, ensure the first item’s content is the intended URL or file reference.
If PDFs are large, expect longer extraction time and consider paginated processing on downstream tasks.
Validate returned metadata.uri and format before downstream processing to handle unexpected content types.
Use this skill when you require complete text; use a search-oriented skill when you only need snippets or ranked passages.

Example use cases

Extract full text from academic PDFs (arXiv, publisher PDFs) for corpus creation.
Scrape entire web articles or documentation pages for indexing or offline analysis.
Ingest legal or compliance documents as base64 PDFs to feed an NLP pipeline for entity and clause extraction.
Process a Collection of notes by pointing to the Collection ID to automatically use its first item’s link as the extraction target.
Convert Markdown or plain text files hosted at a URL into a single text payload for summarization.

FAQ

What input types are supported?

Supported inputs: HTTP/HTTPS URLs, base64-encoded PDFs, Note IDs, and Collection IDs (it will use the collection’s first item as the source).

How does it report failures?

On failure the skill returns status: "failed" with a concise reason field describing the error (download, decode, or extraction issue).