home / skills / git-fg / thecattoolkit / crawling-content

This skill helps you fetch and extract content from static web pages rapidly using a zero-latency markdown crawler.

npx playbooks add skill git-fg/thecattoolkit --skill crawling-content

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
658 B
---
name: crawling-content
description: "High-speed read-only web extraction. Use when fetching documentation, blogs, and static pages. Do not use for apps requiring login or interaction."
allowed-tools: [Bash]
---

# Content Crawler Protocol

## Usage
Use `@just-every/crawl` for zero-latency markdown extraction.

### Single Page (Read)
```bash
npx -y @just-every/crawl "https://example.com"
```

### Site Map (Spider)
```bash
npx -y @just-every/crawl "https://example.com" --pages 20 --output json
```

## Failure Mode
If output contains "JavaScript required" or "Access Denied", **STOP**. Switch to `Skill(browsing-web)` to handle the dynamic rendering.

Overview

This skill performs high-speed, read-only web extraction for static pages like documentation, blogs, and public guides. It focuses on fetching clean content (often converted to markdown) with minimal latency. Do not use it for pages that require login, form interaction, or heavy client-side rendering.

How this skill works

The skill issues fast, non-interactive requests to public URLs and extracts textual content and basic metadata. It is optimized for single-page reads and site-spidering runs that return multiple pages in a compact format. If the crawler detects that a page requires JavaScript rendering or returns access blocks, it halts and signals to use a browsing-capable skill.

When to use it

  • Fetch documentation pages, blog posts, or static help articles for ingestion.
  • Quickly mirror publicly available content to index or summarize.
  • Crawl a limited set of pages from a site for offline analysis (read-only).
  • Generate markdown-friendly extracts from static HTML pages.
  • Run bulk reads of sitemap-style targets where no authentication is needed.

Best practices

  • Prefer publicly accessible URLs and avoid pages behind logins or consent walls.
  • Limit pages per run to reasonable counts to avoid overloading servers.
  • Check extracted output immediately; stop and switch tools if you see "JavaScript required" or "Access Denied".
  • Respect robots.txt and rate limits; use polite crawling intervals for large runs.
  • Request JSON or markdown outputs when you need structured downstream processing.

Example use cases

  • Pull the latest API docs pages to create an offline reference set.
  • Extract multiple blog posts from a single site for sentiment or topic analysis.
  • Crawl public tutorial pages to feed into an assistant’s knowledge base.
  • Rapidly capture static landing pages for compliance snapshotting.
  • Harvest how-to guides for summarization and content repurposing.

FAQ

Can this skill handle sites that need login or dynamic rendering?

No. This is read-only and does not perform authentication or client-side rendering. If you encounter JavaScript-dependent pages or access blocks, switch to a browsing-capable skill.

How do I know when to stop a crawl?

If the output contains messages like "JavaScript required" or "Access Denied", stop immediately and use a different tool that supports rendering or authentication.