home / skills / jimliu / baoyu-skills / baoyu-url-to-markdown

baoyu-url-to-markdown skill

needs review

This skill converts any URL to clean markdown using Chrome CDP, with auto or wait-for-user capture modes for saving webpages.

npx playbooks add skill jimliu/baoyu-skills --skill baoyu-url-to-markdown

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

5.7 KB

---
name: baoyu-url-to-markdown
description: Fetch any URL and convert to markdown using Chrome CDP. Supports two modes - auto-capture on page load, or wait for user signal (for pages requiring login). Use when user wants to save a webpage as markdown.
---

# URL to Markdown

Fetches any URL via Chrome CDP and converts HTML to clean markdown.

## Script Directory

**Important**: All scripts are located in the `scripts/` subdirectory of this skill.

**Agent Execution Instructions**:
1. Determine this SKILL.md file's directory path as `SKILL_DIR`
2. Script path = `${SKILL_DIR}/scripts/<script-name>.ts`
3. Replace all `${SKILL_DIR}` in this document with the actual path

**Script Reference**:
| Script | Purpose |
|--------|---------|
| `scripts/main.ts` | CLI entry point for URL fetching |

## Preferences (EXTEND.md)

Use Bash to check EXTEND.md existence (priority order):

```bash
# Check project-level first
test -f .baoyu-skills/baoyu-url-to-markdown/EXTEND.md && echo "project"

# Then user-level (cross-platform: $HOME works on macOS/Linux/WSL)
test -f "$HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md" && echo "user"
```

┌────────────────────────────────────────────────────────┬───────────────────┐
│                          Path                          │     Location      │
├────────────────────────────────────────────────────────┼───────────────────┤
│ .baoyu-skills/baoyu-url-to-markdown/EXTEND.md          │ Project directory │
├────────────────────────────────────────────────────────┼───────────────────┤
│ $HOME/.baoyu-skills/baoyu-url-to-markdown/EXTEND.md    │ User home         │
└────────────────────────────────────────────────────────┴───────────────────┘

┌───────────┬───────────────────────────────────────────────────────────────────────────┐
│  Result   │                                  Action                                   │
├───────────┼───────────────────────────────────────────────────────────────────────────┤
│ Found     │ Read, parse, apply settings                                               │
├───────────┼───────────────────────────────────────────────────────────────────────────┤
│ Not found │ Use defaults                                                              │
└───────────┴───────────────────────────────────────────────────────────────────────────┘

**EXTEND.md Supports**: Default output directory | Default capture mode | Timeout settings

## Features

- Chrome CDP for full JavaScript rendering
- Two capture modes: auto or wait-for-user
- Clean markdown output with metadata
- Handles login-required pages via wait mode

## Usage

```bash
# Auto mode (default) - capture when page loads
npx -y bun ${SKILL_DIR}/scripts/main.ts <url>

# Wait mode - wait for user signal before capture
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> --wait

# Save to specific file
npx -y bun ${SKILL_DIR}/scripts/main.ts <url> -o output.md
```

## Options

| Option | Description |
|--------|-------------|
| `<url>` | URL to fetch |
| `-o <path>` | Output file path (default: auto-generated) |
| `--wait` | Wait for user signal before capturing |
| `--timeout <ms>` | Page load timeout (default: 30000) |

## Capture Modes

| Mode | Behavior | Use When |
|------|----------|----------|
| Auto (default) | Capture on network idle | Public pages, static content |
| Wait (`--wait`) | User signals when ready | Login-required, lazy loading, paywalls |

**Wait mode workflow**:
1. Run with `--wait` → script outputs "Press Enter when ready"
2. Ask user to confirm page is ready
3. Send newline to stdin to trigger capture

## Output Format

YAML front matter with `url`, `title`, `description`, `author`, `published`, `captured_at` fields, followed by converted markdown content.

## Output Directory

```
url-to-markdown/<domain>/<slug>.md
```

- `<slug>`: From page title or URL path (kebab-case, 2-6 words)
- Conflict resolution: Append timestamp `<slug>-YYYYMMDD-HHMMSS.md`

## Environment Variables

| Variable | Description |
|----------|-------------|
| `URL_CHROME_PATH` | Custom Chrome executable path |
| `URL_DATA_DIR` | Custom data directory |
| `URL_CHROME_PROFILE_DIR` | Custom Chrome profile directory |

**Troubleshooting**: Chrome not found → set `URL_CHROME_PATH`. Timeout → increase `--timeout`. Complex pages → try `--wait` mode.

## Extension Support

Custom configurations via EXTEND.md. See **Preferences** section for paths and supported options.

Overview

This skill fetches any public or authenticated webpage using Chrome DevTools Protocol and converts the rendered HTML into clean, readable Markdown with YAML front matter. It supports an automatic capture on page load or a wait-for-user mode for login workflows and lazy-loaded content. Outputs are written to a structured directory with conflict-safe filenames and useful metadata.

How this skill works

The tool launches Chrome (configurable via environment variables) and navigates to the target URL, letting JavaScript fully execute before capturing the DOM. In auto mode it captures on network idle; in wait mode it pauses and waits for a user signal to capture pages that require interaction or login. The script extracts metadata (title, description, author, published date) and converts the page to Markdown, then writes a YAML front matter block followed by the markdown content to a timestamped file path.

When to use it

Save a public article or blog post as Markdown for offline reading or note-taking.
Archive a page that requires authentication by using wait mode to log in first.
Capture pages with heavy client-side rendering or lazy-loaded content.
Automate periodic harvesting of pages while preserving metadata and readable formatting.

Best practices

Use auto mode for static public pages; use --wait for login, paywalls, or infinite-scroll content.
Set URL_CHROME_PATH if the environment cannot find a compatible Chrome/Chromium binary.
Increase --timeout for slow networks or pages with large resources.
Provide an explicit -o path when you need a specific filename or location.
Check and optionally override defaults via the local or user extend file for output locations and modes.

Example use cases

Quickly convert a news article to Markdown for inclusion in a notes repository.
Log into a member-only site, run with --wait, then press Enter to capture the post-login content.
Batch-capture dynamically rendered documentation pages that require running client-side scripts.
Save a marketing landing page with preserved metadata and a readable markdown version for analysis.

FAQ

What if Chrome is not found on my system?

Set the URL_CHROME_PATH environment variable to your Chrome/Chromium executable path.

How do I capture pages behind login or paywalls?

Use --wait mode, perform the login in the opened browser session, then press Enter to trigger capture.