home / skills / firecrawl / cli / firecrawl-scrape

firecrawl-scrape skill

/skills/firecrawl-scrape

This skill extracts clean, llm-optimized markdown from one or more URLs, including JS-rendered pages, for quick content use.

npx playbooks add skill firecrawl/cli --skill firecrawl-scrape

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.0 KB
---
name: firecrawl-scrape
description: |
  Extract clean markdown from any URL, including JavaScript-rendered SPAs. Use this skill whenever the user provides a URL and wants its content, says "scrape", "grab", "fetch", "pull", "get the page", "extract from this URL", or "read this webpage". Handles JS-rendered pages, multiple concurrent URLs, and returns LLM-optimized markdown. Use this instead of WebFetch for any webpage content extraction.
allowed-tools:
  - Bash(firecrawl *)
  - Bash(npx firecrawl *)
---

# firecrawl scrape

Scrape one or more URLs. Returns clean, LLM-optimized markdown. Multiple URLs are scraped concurrently.

## When to use

- You have a specific URL and want its content
- The page is static or JS-rendered (SPA)
- Step 2 in the [workflow escalation pattern](firecrawl-cli): search → **scrape** → map → crawl → browser

## Quick start

```bash
# Basic markdown extraction
firecrawl scrape "<url>" -o .firecrawl/page.md

# Main content only, no nav/footer
firecrawl scrape "<url>" --only-main-content -o .firecrawl/page.md

# Wait for JS to render, then scrape
firecrawl scrape "<url>" --wait-for 3000 -o .firecrawl/page.md

# Multiple URLs (each saved to .firecrawl/)
firecrawl scrape https://example.com https://example.com/blog https://example.com/docs

# Get markdown and links together
firecrawl scrape "<url>" --format markdown,links -o .firecrawl/page.json
```

## Options

| Option                   | Description                                                      |
| ------------------------ | ---------------------------------------------------------------- |
| `-f, --format <formats>` | Output formats: markdown, html, rawHtml, links, screenshot, json |
| `-H`                     | Include HTTP headers in output                                   |
| `--only-main-content`    | Strip nav, footer, sidebar — main content only                   |
| `--wait-for <ms>`        | Wait for JS rendering before scraping                            |
| `--include-tags <tags>`  | Only include these HTML tags                                     |
| `--exclude-tags <tags>`  | Exclude these HTML tags                                          |
| `-o, --output <path>`    | Output file path                                                 |

## Tips

- **Try scrape before browser.** Scrape handles static pages and JS-rendered SPAs. Only escalate to browser when you need interaction (clicks, form fills, pagination).
- Multiple URLs are scraped concurrently — check `firecrawl --status` for your concurrency limit.
- Single format outputs raw content. Multiple formats (e.g., `--format markdown,links`) output JSON.
- Always quote URLs — shell interprets `?` and `&` as special characters.
- Naming convention: `.firecrawl/{site}-{path}.md`

## See also

- [firecrawl-search](../firecrawl-search/SKILL.md) — find pages when you don't have a URL
- [firecrawl-browser](../firecrawl-browser/SKILL.md) — when scrape can't get the content (interaction needed)
- [firecrawl-download](../firecrawl-download/SKILL.md) — bulk download an entire site to local files

Overview

This skill extracts clean, LLM-optimized Markdown from one or more URLs, including JavaScript-rendered single-page applications. It runs concurrent scrapes, supports selective tag inclusion/exclusion, and can wait for client-side rendering before capturing content. Use it whenever you need readable, structured page content for downstream analysis or indexing.

How this skill works

The skill loads pages (headless browser when needed) and renders JavaScript so dynamic content is captured. It strips navigation, sidebars, and footers when requested and can include multiple output formats (markdown, HTML, links, screenshots). Multiple URLs are fetched concurrently and returned as concise, LLM-friendly Markdown or JSON when multiple formats are selected.

When to use it

  • You provide a URL and ask to “scrape”, “fetch”, “grab”, “get the page”, “extract from this URL”, or “read this webpage”
  • Page content is needed from a static site or an SPA that requires JS rendering
  • You need clean, readable Markdown for summarization, indexing, or ingestion into an LLM pipeline
  • You want to scrape multiple pages concurrently and save outputs to files
  • You need main content only (no nav/footer) or selective HTML tag extraction

Best practices

  • Try scrape before a full browser automation step — it handles most static and JS-rendered pages without interaction
  • Quote URLs in shells to avoid issues with ? and & characters
  • Use --only-main-content to remove nav, footer, and sidebars for focused text extraction
  • Use --wait-for <ms> when pages rely on delayed JS rendering or client-side data loading
  • Request multiple formats (markdown,links) to get content and link metadata in a single JSON output

Example use cases

  • Extract article text as Markdown for summarization or content analysis
  • Pull documentation pages into a knowledge base with consistent Markdown formatting
  • Batch-scrape multiple landing pages to build a link index or sitemap snapshot
  • Capture SPA-rendered product pages for price monitoring or feature extraction
  • Save clean Markdown versions of search results pages for downstream LLM processing

FAQ

Can it handle pages that require JavaScript to render?

Yes — the skill can wait for client-side rendering (use --wait-for) and uses a headless renderer to capture dynamic content.

What output formats are available?

Outputs include markdown, html, rawHtml, links, screenshot, and json. Request multiple formats to receive a JSON object containing each format.

How do I extract only the main article text?

Use the --only-main-content option to strip navigation, sidebars, and footers and return the primary content only.