Crawl4AI MCP server

Integrates with Crawl4AI to provide web crawling, content extraction, screenshot capture, PDF generation, and batch processing capabilities with intelligent content detection for sitemaps and RSS feeds.
Back to servers
Setup instructions
Provider
omgwtfwow
Release date
Jul 31, 2025
Language
JavaScript
Stats
9 stars

This MCP server provides a robust interface for web crawling, content extraction, and browser automation through the Crawl4AI platform. It allows you to easily integrate powerful web scraping capabilities into any MCP-compatible client.

Installation

Prerequisites

  • Node.js 18+ and npm
  • A running Crawl4AI server

Starting the Crawl4AI Server

You'll need a running Crawl4AI server to use this MCP server. The simplest way is to use Docker:

docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:0.7.4

Adding to Your MCP Client

Using npx (Recommended)

Add this configuration to your MCP client's settings:

{
  "mcpServers": {
    "crawl4ai": {
      "command": "npx",
      "args": ["mcp-crawl4ai-ts"],
      "env": {
        "CRAWL4AI_BASE_URL": "http://localhost:11235"
      }
    }
  }
}

Using Local Installation

If you prefer a local installation:

{
  "mcpServers": {
    "crawl4ai": {
      "command": "node",
      "args": ["/path/to/mcp-crawl4ai-ts/dist/index.js"],
      "env": {
        "CRAWL4AI_BASE_URL": "http://localhost:11235"
      }
    }
  }
}

Full Configuration with Optional Variables

{
  "mcpServers": {
    "crawl4ai": {
      "command": "npx",
      "args": ["mcp-crawl4ai-ts"],
      "env": {
        "CRAWL4AI_BASE_URL": "http://localhost:11235",
        "CRAWL4AI_API_KEY": "your-api-key",
        "SERVER_NAME": "custom-name",
        "SERVER_VERSION": "1.0.0"
      }
    }
  }
}

Configuration

Environment Variables

# Required
CRAWL4AI_BASE_URL=http://localhost:11235

# Optional - Server Configuration
CRAWL4AI_API_KEY=          # If your server requires auth
SERVER_NAME=crawl4ai-mcp   # Custom name for the MCP server
SERVER_VERSION=1.0.0       # Custom version

Client-Specific Setup

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json

Claude Code

claude mcp add crawl4ai -e CRAWL4AI_BASE_URL=http://localhost:11235 -- npx mcp-crawl4ai-ts

Other MCP Clients

Consult your client's documentation for MCP server configuration. The key details:

  • Command: npx mcp-crawl4ai-ts or node /path/to/dist/index.js
  • Required env: CRAWL4AI_BASE_URL
  • Optional env: CRAWL4AI_API_KEY, SERVER_NAME, SERVER_VERSION

Available Tools

Extract Content as Markdown

{ 
  url: string,                              // Required: URL to extract markdown from
  filter?: 'raw'|'fit'|'bm25'|'llm',       // Filter type (default: 'fit')
  query?: string,                           // Query for bm25/llm filters
  cache?: string                            // Cache-bust parameter (default: '0')
}

Extracts content as markdown with various filtering options. Use 'bm25' or 'llm' filters with a query for specific content extraction.

Capture Webpage Screenshot

{ 
  url: string,                   // Required: URL to capture
  screenshot_wait_for?: number   // Seconds to wait before screenshot (default: 2)
}

Returns base64-encoded PNG. For screenshots after JavaScript execution, use crawl with screenshot: true.

Convert Webpage to PDF

{ 
  url: string  // Required: URL to convert to PDF
}

Returns base64-encoded PDF. For PDFs after JavaScript execution, use crawl with pdf: true.

Execute JavaScript

{ 
  url: string,                    // Required: URL to load
  scripts: string | string[]      // Required: JavaScript to execute
}

Executes JavaScript and returns results. Each script can use 'return' to get values back.

Batch Crawl Multiple URLs

{ 
  urls: string[],           // Required: List of URLs to crawl
  max_concurrent?: number,  // Parallel request limit (default: 5)
  remove_images?: boolean,  // Remove images from output (default: false)
  bypass_cache?: boolean,   // Bypass cache for all URLs (default: false)
  configs?: Array<{         // Optional: Per-URL configurations
    url: string,
    [key: string]: any      // Any crawl parameters for this specific URL
  }>
}

Efficiently crawls multiple URLs in parallel. Each URL gets a fresh browser instance.

Smart Crawl with Auto-Detection

{ 
  url: string,            // Required: URL to crawl
  max_depth?: number,     // Maximum depth for recursive crawling (default: 2)
  follow_links?: boolean, // Follow links in content (default: true)
  bypass_cache?: boolean  // Bypass cache (default: false)
}

Intelligently detects content type (HTML/sitemap/RSS) and processes accordingly.

Get Sanitized HTML

{ 
  url: string  // Required: URL to extract HTML from
}

Returns preprocessed HTML optimized for structure analysis.

Extract and Categorize Links

{ 
  url: string,          // Required: URL to extract links from
  categorize?: boolean  // Group by type (default: true)
}

Extracts all links and groups them by type: internal, external, social media, documents, images.

Deep Recursive Crawling

{ 
  url: string,              // Required: Starting URL
  max_depth?: number,       // Maximum depth to crawl (default: 3)
  max_pages?: number,       // Maximum pages to crawl (default: 50)
  include_pattern?: string, // Regex pattern for URLs to include
  exclude_pattern?: string  // Regex pattern for URLs to exclude
}

Crawls a website following internal links up to specified depth.

Parse XML Sitemaps

{ 
  url: string,              // Required: Sitemap URL (e.g., /sitemap.xml)
  filter_pattern?: string   // Optional: Regex pattern to filter URLs
}

Extracts all URLs from XML sitemaps with optional regex filtering.

Advanced Web Crawling

{
  url: string,                              // URL to crawl
  // Browser Configuration
  browser_type?: 'chromium'|'firefox'|'webkit'|'undetected',  // Browser engine
  viewport_width?: number,                  // Browser width (default: 1080)
  viewport_height?: number,                 // Browser height (default: 600)
  user_agent?: string,                      // Custom user agent
  proxy_server?: string | {                 // Proxy URL (string or object format)
    server: string,
    username?: string,
    password?: string
  },
  cookies?: Array<{name, value, domain}>,   // Pre-set cookies
  headers?: Record<string,string>,          // Custom headers
  
  // Crawler Configuration
  word_count_threshold?: number,            // Min words per block (default: 200)
  excluded_tags?: string[],                 // HTML tags to exclude
  remove_overlay_elements?: boolean,        // Remove popups/modals
  js_code?: string | string[],              // JavaScript to execute
  wait_for?: string,                        // Wait condition (selector or JS)
  wait_for_timeout?: number,                // Wait timeout (default: 30000)
  delay_before_scroll?: number,             // Pre-scroll delay
  scroll_delay?: number,                    // Between-scroll delay
  process_iframes?: boolean,                // Include iframe content
  exclude_external_links?: boolean,         // Remove external links
  screenshot?: boolean,                     // Capture screenshot
  pdf?: boolean,                           // Generate PDF
  session_id?: string,                      // Reuse browser session
  cache_mode?: 'ENABLED'|'BYPASS'|'DISABLED',  // Cache control
  
  css_selector?: string,                    // CSS selector to filter content
  delay_before_return_html?: number,        // Delay before returning HTML
  include_links?: boolean,                  // Include extracted links in response
  resolve_absolute_urls?: boolean,          // Convert relative URLs to absolute
  
  // LLM Extraction
  extraction_type?: 'llm',                  // Only 'llm' extraction is supported
  extraction_schema?: object,               // Schema for structured extraction
  extraction_instruction?: string,          // Natural language extraction prompt
  extraction_strategy?: {                   // Advanced extraction configuration
    provider?: string,
    api_key?: string,
    model?: string,
    [key: string]: any
  },
  
  timeout?: number,                         // Overall timeout (default: 60000)
  verbose?: boolean                         // Detailed logging
}

Browser Session Management

{ 
  action: 'create' | 'clear' | 'list',    // Required: Action to perform
  session_id?: string,                    // For 'create' and 'clear' actions
  initial_url?: string,                   // For 'create' action: URL to load
  browser_type?: 'chromium' | 'firefox' | 'webkit' | 'undetected'  // For 'create' action
}

Unified tool for managing browser sessions with three actions:

  • create: Start a persistent browser session
  • clear: Remove a session from local tracking
  • list: Show all active sessions

Examples:

// Create a new session
{ action: 'create', session_id: 'my-session', initial_url: 'https://example.com' }

// Clear a session
{ action: 'clear', session_id: 'my-session' }

// List all sessions
{ action: 'list' }

Extract Structured Data with AI

{ 
  url: string,          // URL to extract data from
  query: string         // Natural language extraction instructions
}

Uses AI to extract structured data from webpages. This is the recommended way to extract specific information.

Advanced Usage Examples

Example 1: Extract Content with LLM Filtering

{
  "url": "https://example.com/article",
  "filter": "llm",
  "query": "Extract only the sections about machine learning"
}

Example 2: Crawl Multiple URLs with Different Configurations

{
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "max_concurrent": 2,
  "configs": [
    {
      "url": "https://example.com/page1",
      "browser_type": "undetected",
      "screenshot": true
    },
    {
      "url": "https://example.com/page2",
      "wait_for": ".content-loaded",
      "js_code": "window.scrollTo(0, document.body.scrollHeight);"
    }
  ]
}

Example 3: AI-Powered Data Extraction

{
  "url": "https://example.com/product-page",
  "query": "Extract the product name, price, description, and available colors. Format as JSON with these exact fields."
}

How to install this MCP server

For Claude Code

To add this MCP server to Claude Code, run this command in your terminal:

claude mcp add-json "crawl4ai" '{"command":"npx","args":["mcp-crawl4ai-ts"],"env":{"CRAWL4AI_BASE_URL":"http://localhost:11235"}}'

See the official Claude Code MCP documentation for more details.

For Cursor

There are two ways to add an MCP server to Cursor. The most common way is to add the server globally in the ~/.cursor/mcp.json file so that it is available in all of your projects.

If you only need the server in a single project, you can add it to the project instead by creating or adding it to the .cursor/mcp.json file.

Adding an MCP server to Cursor globally

To add a global MCP server go to Cursor Settings > Tools & Integrations and click "New MCP Server".

When you click that button the ~/.cursor/mcp.json file will be opened and you can add your server like this:

{
    "mcpServers": {
        "crawl4ai": {
            "command": "npx",
            "args": [
                "mcp-crawl4ai-ts"
            ],
            "env": {
                "CRAWL4AI_BASE_URL": "http://localhost:11235"
            }
        }
    }
}

Adding an MCP server to a project

To add an MCP server to a project you can create a new .cursor/mcp.json file or add it to the existing one. This will look exactly the same as the global MCP server example above.

How to use the MCP server

Once the server is installed, you might need to head back to Settings > MCP and click the refresh button.

The Cursor agent will then be able to see the available tools the added MCP server has available and will call them when it needs to.

You can also explicitly ask the agent to use the tool by mentioning the tool name and describing what the function does.

For Claude Desktop

To add this MCP server to Claude Desktop:

1. Find your configuration file:

  • macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%\Claude\claude_desktop_config.json
  • Linux: ~/.config/Claude/claude_desktop_config.json

2. Add this to your configuration file:

{
    "mcpServers": {
        "crawl4ai": {
            "command": "npx",
            "args": [
                "mcp-crawl4ai-ts"
            ],
            "env": {
                "CRAWL4AI_BASE_URL": "http://localhost:11235"
            }
        }
    }
}

3. Restart Claude Desktop for the changes to take effect

Want to 10x your AI skills?

Get a free account and learn to code + market your apps using AI (with or without vibes!).

Nah, maybe later