home / mcp / document crawler mcp server
This project provides a toolset to crawl websites wikis, tool/library documentions and generate Markdown documentation, and make that documentation searchable via a Model Context Protocol (MCP) server, designed for integration with tools like Cursor.
Configuration
View docs{
"mcpServers": {
"alizdavoodi-mcpdocsearch": {
"command": "python",
"args": [
"-m",
"mcp_server.main"
],
"env": {
"ENV": "<unused>"
}
}
}
}You have a focused MCP server that loads crawled Markdown content, breaks it into semantic chunks, generates embeddings for fast semantic search, and exposes powerful tools to query your documentation from clients like Cursor. This server makes it practical to turn a collected body of documentation into a searchable knowledge base that you can query with natural language or precise headings.
You run the MCP server locally and connect a client such as Cursor to access its tools. The server reads Markdown files stored in the storage folder, builds semantic chunks from the content, and pre-computes embeddings so you can search with natural language queries. Use the client to list available documents, inspect their headings, or perform semantic searches across the content.
Prerequisites for running this server are Python and the UV tool for dependency management. Install UV, then install the project dependencies, and finally start the MCP server.
# Install UV (follow the official instructions for your system)
# Then clone the project and install dependencies
git clone https://github.com/alizdavoodi/MCPDocSearch.git
cd MCPDocSearch
uv syncCrawl content is placed under the storage directory as Markdown files. The MCP server automatically loads these Markdown files, chunks content using headings, and creates embeddings with a sentence-transformers model. A cache file stores processed chunks and embeddings to speed up startup on subsequent runs.
The server uses a local cache to speed up startup. If you share the storage directory with untrusted sources, be cautious about loading cached data. Only trusted users should modify the Markdown files in storage.
The server exposes the following MCP tools via fastmcp: list_documents, get_document_headings, search_documentation.
The first startup after crawling content may take several minutes as the server parses, chunks, and embeds all content. Subsequent startups are faster if the cache is valid and the Markdown sources have not changed.
To integrate with Cursor, run the server via stdio transport and connect the doc-query-server to Cursor as your MCP backend.
{
"mcpServers": {
"doc_query": {
"type": "stdio",
"name": "doc_query",
"command": "python",
"args": ["-m", "mcp_server.main"]
}
},
"envVars": []
}Lists crawled documents available in storage.
Retrieves the heading structure for a document.
Performs semantic search over document chunks using embeddings.