Web Crawler Data Bridge MCP server for AI agents

MCP Server for Web Crawl Data is a powerful tool that enables AI clients to filter and analyze web crawler data through a full-text search interface with boolean support. This server acts as a bridge between various web crawlers and AI systems, allowing for sophisticated content retrieval and analysis.

Installation

To install MCP Server Web Crawl, you'll need:

Python 3.10 or higher
Claude Desktop

Install the package using pip:

pip install mcp-server-webcrawl

Supported Crawlers

MCP Server integrates with multiple web crawlers:

WARC - Standard web archive format
wget - Command-line website mirroring tool for macOS/Linux
InterroBot - GUI crawler and analyzer for macOS/Windows
Katana - Security-focused crawler for macOS/Windows/Linux
SiteOne - GUI crawler and analyzer for macOS/Windows/Linux

Basic Usage

After installing, you'll need to set up a crawler and index your web content. The workflow typically follows these steps:

Crawl a website using one of the supported crawlers
Start the MCP server to analyze the crawled data
Connect Claude Desktop to interact with the indexed content

Using a Crawler

For example, to use wget for crawling:

# Install wget if not already available
# On macOS with Homebrew:
brew install wget

# Basic website mirroring
wget --mirror --convert-links --adjust-extension --page-requisites --no-parent https://example.com/

Refer to the specific setup guides for detailed instructions on each crawler.

Search Capabilities

Boolean Search Syntax

The search engine supports complex queries with field-specific searches:

# Basic keyword search
privacy

# Exact phrase matching
"privacy policy"

# Wildcard search
boundar*

# Field-specific search
url: example.com/somedir
type: html
status: 200
content: h1
headers: text/xml

# Boolean operators
privacy AND policy
privacy OR policy
policy NOT privacy
(login OR signin) AND form

Content Type Filtering

Filter by specific content types:

# Find only HTML pages
type: html

# Find only images
type: img

# Find HTML pages without login text
type: html NOT content: login

Available types include: html, iframe, img, audio, video, font, style, script, rss, text, pdf, doc, and other.

Advanced Features

Using Extras for Efficient Processing

The extras parameter provides additional processing options:

thumbnails: Generates base64 encoded images for AI analysis
markdown: Converts HTML to concise Markdown
snippets: Returns contextual keyword usage
xpath: Extracts specific HTML elements using XPath selectors

Example usage (through the API):

# Request HTML content as Markdown
?extras=markdown

# Get contextual snippets showing search term usage
?extras=snippets

# Extract specific elements using XPath
?extras=xpath&extrasXpath=//h1

Using Prompt Routines

MCP Server includes several pre-built prompt routines for common tasks:

SEO Audit: Technical SEO analysis
404 Audit: Broken link detection
Performance Audit: Website speed analysis
File Audit: File organization analysis
Gopher Interface: Search interface for website exploration

To use a prompt routine:

Download the prompt file (e.g., auditseo.md)
Paste the markdown into Claude Desktop
Type "run pasted for [site name or URL]" to start the analysis

Optimizing Token Usage

When working with large websites, consider:

Using the markdown extra to reduce token usage (about 1/3 the size of HTML)
Using snippets for initial searches before retrieving full content
Using xpath to extract only needed elements
Filtering by type and status to narrow down results

How to install this MCP server

For Claude Code

To add this MCP server to Claude Code, run this command in your terminal:

claude mcp add-json "mcp-server-webcrawl" '{"command":"python","args":["-m","mcp_server_webcrawl"]}'

See the official Claude Code MCP documentation for more details.

For Cursor

There are two ways to add an MCP server to Cursor. The most common way is to add the server globally in the ~/.cursor/mcp.json file so that it is available in all of your projects.

If you only need the server in a single project, you can add it to the project instead by creating or adding it to the .cursor/mcp.json file.

Adding an MCP server to Cursor globally

To add a global MCP server go to Cursor Settings > Tools & Integrations and click "New MCP Server".

When you click that button the ~/.cursor/mcp.json file will be opened and you can add your server like this:

{
    "mcpServers": {
        "mcp-server-webcrawl": {
            "command": "python",
            "args": [
                "-m",
                "mcp_server_webcrawl"
            ]
        }
    }
}

Adding an MCP server to a project

To add an MCP server to a project you can create a new .cursor/mcp.json file or add it to the existing one. This will look exactly the same as the global MCP server example above.

How to use the MCP server

Once the server is installed, you might need to head back to Settings > MCP and click the refresh button.

The Cursor agent will then be able to see the available tools the added MCP server has available and will call them when it needs to.

You can also explicitly ask the agent to use the tool by mentioning the tool name and describing what the function does.

For Claude Desktop

To add this MCP server to Claude Desktop:

1. Find your configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

2. Add this to your configuration file:

{
    "mcpServers": {
        "mcp-server-webcrawl": {
            "command": "python",
            "args": [
                "-m",
                "mcp_server_webcrawl"
            ]
        }
    }
}

3. Restart Claude Desktop for the changes to take effect