home / mcp / webcrawler mcp server
Extract website content, map links, and generate Markdown content for multiple URLs.
Configuration
View docs{
"mcpServers": {
"jmh108-md-webcrawl-mcp": {
"command": "fastmcp",
"args": [
"dev",
"server.py",
"--with-editable",
"."
],
"env": {
"OUTPUT_PATH": "./output",
"REQUEST_TIMEOUT": "30",
"MAX_CONCURRENT_REQUESTS": "5"
}
}
}
}You can run a Python-based MCP web crawler that extracts website content and saves it as Markdown files, maps site structure, and processes multiple URLs in batches. This server is useful for building searchable content catalogs from web pages and quickly generating offline documentation-like markdown from live sites.
To use this MCP server, you interact with an MCP client to run tools that crawl web pages, extract content, and build indexes of linked content. You can extract content from a URL and save it as Markdown, then create an index of the content map for easy navigation. Typical workflows let you batch process several URLs and store the resulting Markdown files in a chosen output directory.
Prerequisites: You need Python 3.7 or newer and a working Python environment. You also need FastMCP installed to manage MCP servers.
# 1) Clone the project repository
git clone https://github.com/yourusername/webcrawler.git
cd webcrawler# 2) Install Python dependencies
pip install -r requirements.txt# 3) Optional: Configure environment variables
export OUTPUT_PATH=./output # Set your preferred output directoryInstall the MCP runtime tooling and start the server using FastMCP. The following commands install the MCP server package and then set up a development instance. Use a path you control for server.py as needed.
# 4) Install FastMCP (Python-based MCP runtime)
pip install fastmcp# 5) Install the MCP server entry point
fastmcp install server.py# 6) Run the server in development mode
fastmcp dev server.py --with-editable .You can customize where crawled content is saved by setting OUTPUT_PATH. You can control concurrency with MAX_CONCURRENT_REQUESTS and adjust request timeouts with REQUEST_TIMEOUT. These variables help tune performance and resource usage when crawling many URLs.
Environment variables you may configure include OUTPUT_PATH for the output directory, MAX_CONCURRENT_REQUESTS for parallel requests, and REQUEST_TIMEOUT for per-request timeout in seconds.
# Extract content from a single URL and save as Markdown
mcp call extract_content --url "https://example.com" --output_path "example.md"
# Scan linked content from a URL and create an index of the content map
mcp call scan_linked_content --url "https://example.com" | \
mcp call create_index --content_map - --output_path "index.md"Tool to fetch a webpage and extract its main content, saving it as a Markdown file in the specified output path.
Tool to crawl a URL and enumerate linked pages, producing a content map that can be used to build indexes.
Tool to generate an index from a content map, producing a Markdown index file that links to extracted content.