home / mcp / documentation scraper & mcp server

Documentation Scraper & MCP Server

A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.

Installation
Add the following to your MCP client configuration file.

Configuration

View docs
{
  "mcpServers": {
    "dragomirweb-crawl4claude": {
      "command": "python",
      "args": [
        "path/to/mcp_docs_server.py"
      ],
      "env": {
        "DOCS_DB_PATH": "path/to/docs_db/documentation.db",
        "DOCS_BASE_URL": "https://docs.example.com/"
      }
    }
  }
}

You run an MCP server that exposes your documentation scraper to Claude Desktop. This server lets you search, browse, and fetch content from your scraped docs via simple MCP-powered tools, making it easy to build AI-assisted documentation workflows.

How to use

Connect Claude Desktop to the MCP server you run locally. Once connected, you can use the built-in tools to search across all scraped documentation, list sections, fetch full page content, browse by section, and view database statistics. Use these tools to build AI-assisted documentation queries, generate summaries, or export content for downstream workflows.

How to install

Prerequisites are Python 3.8 or higher and an active internet connection. You should have about 500MB free disk space per documentation site to store the scraped data.

# Clone the project
git clone <repository-url>
cd documentation-scraper

# Install dependencies
pip install -r requirements.txt

Additional setup and usage notes

Configure the target documentation site and MCP server settings in the configuration. You can override settings with environment variables if needed. The following sections show concrete steps and example configurations to get you up and running.

Configuration

Main configuration can be set in the central Python configuration file. The example below shows the base URL, output directory, depth, and page limits, along with URL filtering and MCP server settings. You can override these values with environment variables for quick experimentation.

# Basic scraping settings
SCRAPER_CONFIG = {
    "base_url": "https://docs.example.com/",
    "output_dir": "docs_db",
    "max_depth": 3,
    "max_pages": 200,
    "delay_between_requests": 0.5,
}

# URL filtering rules
URL_FILTER_CONFIG = {
    "skip_patterns": [r'/api/', r'\.pdf$'],
    "allowed_domains": ["docs.example.com"],
}

# MCP server settings
MCP_CONFIG = {
    "server_name": "docs-server",
    "default_search_limit": 10,
    "max_search_limit": 50,
}

Environment overrides

You can override any setting with environment variables to tailor the server behavior for quick experiments.

export DOCS_DB_PATH="/custom/path/documentation.db"
export DOCS_BASE_URL="https://different-docs.com/"
python mcp_docs_server.py

Claude Desktop integration (manual setup)

If you prefer manual setup, add the MCP server configuration to Claude Desktop as shown. This connects Claude to the local MCP server that serves your scraped docs.

json
{
  "mcpServers": {
    "docs": {
      "command": "python",
      "args": ["path/to/mcp_docs_server.py"],
      "cwd": "path/to/project",
      "env": {
        "DOCS_DB_PATH": "path/to/docs_db/documentation.db"
      }
    }
  }
}

Available MCP tools (imported from integration)

When connected, Claude can use these tools to interact with your scraped documentation.

Available tools

search_documentation

Search for content across all scraped documentation using query terms to retrieve relevant pages, sections, or metadata.

get_documentation_sections

List all available sections and subsections in the scraped documentation database.

get_page_content

Retrieve the full content of a specific page by URL or identifier, including the markdown export when available.

browse_section

Browse pages within a specific section to understand structure and content organization.

get_documentation_stats

Return database statistics such as page count, word totals, and index health for performance monitoring.