home / mcp / documentation scraper & mcp server
A comprehensive, domain-agnostic documentation scraping and AI integration toolkit. Scrape any documentation website, create structured databases, and integrate with Claude Desktop via MCP (Model Context Protocol) for seamless AI-powered documentation assistance.
Configuration
View docs{
"mcpServers": {
"dragomirweb-crawl4claude": {
"command": "python",
"args": [
"path/to/mcp_docs_server.py"
],
"env": {
"DOCS_DB_PATH": "path/to/docs_db/documentation.db",
"DOCS_BASE_URL": "https://docs.example.com/"
}
}
}
}You run an MCP server that exposes your documentation scraper to Claude Desktop. This server lets you search, browse, and fetch content from your scraped docs via simple MCP-powered tools, making it easy to build AI-assisted documentation workflows.
Connect Claude Desktop to the MCP server you run locally. Once connected, you can use the built-in tools to search across all scraped documentation, list sections, fetch full page content, browse by section, and view database statistics. Use these tools to build AI-assisted documentation queries, generate summaries, or export content for downstream workflows.
Prerequisites are Python 3.8 or higher and an active internet connection. You should have about 500MB free disk space per documentation site to store the scraped data.
# Clone the project
git clone <repository-url>
cd documentation-scraper
# Install dependencies
pip install -r requirements.txtConfigure the target documentation site and MCP server settings in the configuration. You can override settings with environment variables if needed. The following sections show concrete steps and example configurations to get you up and running.
Main configuration can be set in the central Python configuration file. The example below shows the base URL, output directory, depth, and page limits, along with URL filtering and MCP server settings. You can override these values with environment variables for quick experimentation.
# Basic scraping settings
SCRAPER_CONFIG = {
"base_url": "https://docs.example.com/",
"output_dir": "docs_db",
"max_depth": 3,
"max_pages": 200,
"delay_between_requests": 0.5,
}
# URL filtering rules
URL_FILTER_CONFIG = {
"skip_patterns": [r'/api/', r'\.pdf$'],
"allowed_domains": ["docs.example.com"],
}
# MCP server settings
MCP_CONFIG = {
"server_name": "docs-server",
"default_search_limit": 10,
"max_search_limit": 50,
}You can override any setting with environment variables to tailor the server behavior for quick experiments.
export DOCS_DB_PATH="/custom/path/documentation.db"
export DOCS_BASE_URL="https://different-docs.com/"
python mcp_docs_server.pyIf you prefer manual setup, add the MCP server configuration to Claude Desktop as shown. This connects Claude to the local MCP server that serves your scraped docs.
json
{
"mcpServers": {
"docs": {
"command": "python",
"args": ["path/to/mcp_docs_server.py"],
"cwd": "path/to/project",
"env": {
"DOCS_DB_PATH": "path/to/docs_db/documentation.db"
}
}
}
}When connected, Claude can use these tools to interact with your scraped documentation.
Search for content across all scraped documentation using query terms to retrieve relevant pages, sections, or metadata.
List all available sections and subsections in the scraped documentation database.
Retrieve the full content of a specific page by URL or identifier, including the markdown export when available.
Browse pages within a specific section to understand structure and content organization.
Return database statistics such as page count, word totals, and index health for performance monitoring.