Home / MCP / Web Crawler MCP Server

Web Crawler MCP Server

Provides a configurable web crawling MCP server that follows links, respects delays, and handles concurrent requests for scalable data collection.

javascript

1star

View on GitHub

Installation

Add the following to your MCP client configuration file.

Configuration

View docs

{
    "mcpServers": {
        "web_crawler": {
            "command": "node",
            "args": [
                "/path/to/web-crawler/build/index.js"
            ],
            "env": {
                "CRAWL_LINKS": "false",
                "MAX_DEPTH": "3",
                "REQUEST_DELAY": "1000",
                "TIMEOUT": "5000",
                "MAX_CONCURRENT": "5"
            }
        }
    }
}

You can deploy and run a Web Crawler MCP Server that exposes a crawl endpoint via MCP for configurable web crawling tasks. This server lets you perform controlled crawls with adjustable depth, delays, timeouts, and concurrency, all managed through an MCP client.

How to use

To perform a crawl, connect with your MCP client and invoke the crawl capability exposed by the server. You can specify the target URL and the crawl depth, and the server will handle requests in a controlled, asynchronous way. Use it to collect pages, follow links within configured constraints, and respect rate limits and timeouts. You can adjust how aggressively the crawler behaves by setting environment variables or MCP configuration values that control depth, delays, timeouts, and concurrency. Ensure you have the appropriate permissions and follow any site policies before crawling.

How to install

Prerequisites: Node.js v18 or newer and npm v9 or newer.

Step 1: Install dependencies and build the project.

git clone https://github.com/jitsmaster/web-crawler-mcp.git
cd web-crawler-mcp
npm install
npm run build

Step 2: Create runtime configuration for the server using environment variables.

CRAWL_LINKS=false
MAX_DEPTH=3
REQUEST_DELAY=1000
TIMEOUT=5000
MAX_CONCURRENT=5

Step 3: Start the MCP server.

npm start

Configuration and MCP setup

The server can be wired into your MCP workflow using a local (stdio) runtime configuration. This runs the server as a local process started by a command like node and points to the built entry file. The following configuration snippet shows the exact structure used to run the server locally and pass through the environment variables shown above.

{
  "mcpServers": {
    "web_crawler": {
      "command": "node",
      "args": ["/path/to/web-crawler/build/index.js"],
      "env": {
        "CRAWL_LINKS": "false",
        "MAX_DEPTH": "3",
        "REQUEST_DELAY": "1000",
        "TIMEOUT": "5000",
        "MAX_CONCURRENT": "5"
      }
    }
  }
}

Notes and best practices

Environment variables control crawling behavior. Adjust CRAWL_LINKS to follow or ignore links, set MAX_DEPTH to limit crawl breadth, tune REQUEST_DELAY for pacing, TIMEOUT for request timeouts, and MAX_CONCURRENT to limit simultaneous requests. When deploying in production, consider securing access to the MCP endpoint and applying appropriate rate limiting on the client side to stay within target site policies.

Usage example with MCP client

To initiate a crawl via your MCP client, provide the target URL and optional depth parameter. The server executes the crawl according to the configured environment and MCP settings.

Available tools

crawl

Initiates a crawl via MCP with configurable URL and depth; respects server-side settings such as delay, timeout, and concurrency.