home / skills / tavily-ai / tavily-plugins / crawl-url

crawl-url skill

not checked

/skills/crawl-url

npx playbooks add skill tavily-ai/tavily-plugins --skill crawl-url

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

3.0 KB

---
name: crawl-url
description: "Crawl any website and save pages as local markdown files. Use when you need to download documentation, knowledge bases, or web content for offline access or analysis. No code required - just provide a URL."
---

# URL Crawler

Crawls websites using Tavily Crawl API and saves each page as a separate markdown file in a flat directory structure.

## Prerequisites

**Tavily API Key Required** - Get your key at https://tavily.com

Add to `~/.claude/settings.json`:
```json
{
  "env": {
    "TAVILY_API_KEY": "tvly-your-api-key-here"
  }
}
```

Restart Claude Code after adding your API key.
## When to Use

Use this skill when the user wants to:
- Crawl and extract content from a website
- Download API documentation, framework docs, or knowledge bases
- Save web content locally for offline access or analysis

## Usage

Execute the crawl script with a URL and optional instruction:

```bash
python scripts/crawl_url.py <URL> [--instruction "guidance text"]
```

### Required Parameters

- **URL**: The website to crawl (e.g., `https://docs.stripe.com/api`)

### Optional Parameters

- `--instruction, -i`: Natural language guidance for the crawler (e.g., "Focus on API endpoints only")
- `--output, -o`: Output directory (default: `<repo_root>/crawled_context/<domain>`)
- `--depth, -d`: Max crawl depth (default: 2, range: 1-5)
- `--breadth, -b`: Max links per level (default: 50)
- `--limit, -l`: Max total pages to crawl (default: 50)

### Output

The script creates a flat directory structure at `<repo_root>/crawled_context/<domain>/` with one markdown file per crawled page. Filenames are derived from URLs (e.g., `docs_stripe_com_api_authentication.md`).

Each markdown file includes:
- Frontmatter with source URL and crawl timestamp
- The extracted content in markdown format

## Examples

### Basic Crawl

```bash
python scripts/crawl_url.py https://docs.anthropic.com
```

Crawls the Anthropic docs with default settings, saves to `<repo_root>/crawled_context/docs_anthropic_com/`.

### With Instruction

```bash
python scripts/crawl_url.py https://react.dev --instruction "Focus on API reference pages and hooks documentation"
```

Uses natural language instruction to guide the crawler toward specific content.

### Custom Output Directory

```bash
python scripts/crawl_url.py https://docs.stripe.com/api -o ./stripe-api-docs
```

Saves results to a custom directory.

### Adjust Crawl Parameters

```bash
python scripts/crawl_url.py https://nextjs.org/docs --depth 3 --breadth 100 --limit 200
```

Increases crawl depth, breadth, and page limit for more comprehensive coverage.

## Important Notes

- **API Key Required**: Set `TAVILY_API_KEY` environment variable (loads from `.env` if available)
- **Crawl Time**: Deeper crawls take longer (depth 3+ may take many minutes)
- **Filename Safety**: URLs are converted to safe filenames automatically
- **Flat Structure**: All files saved in `<repo_root>/crawled_context/<domain>/` directory regardless of original URL hierarchy
- **Duplicate Prevention**: Files are overwritten if URLs generate identical filenames