home / mcp / document crawler mcp server

Document Crawler MCP Server

This project provides a toolset to crawl websites wikis, tool/library documentions and generate Markdown documentation, and make that documentation searchable via a Model Context Protocol (MCP) server, designed for integration with tools like Cursor.

Installation

Add the following to your MCP client configuration file.

Configuration

View docs

{
  "mcpServers": {
    "alizdavoodi-mcpdocsearch": {
      "command": "python",
      "args": [
        "-m",
        "mcp_server.main"
      ],
      "env": {
        "ENV": "<unused>"
      }
    }
  }
}

You have a focused MCP server that loads crawled Markdown content, breaks it into semantic chunks, generates embeddings for fast semantic search, and exposes powerful tools to query your documentation from clients like Cursor. This server makes it practical to turn a collected body of documentation into a searchable knowledge base that you can query with natural language or precise headings.

How to use

You run the MCP server locally and connect a client such as Cursor to access its tools. The server reads Markdown files stored in the storage folder, builds semantic chunks from the content, and pre-computes embeddings so you can search with natural language queries. Use the client to list available documents, inspect their headings, or perform semantic searches across the content.

How to install

Prerequisites for running this server are Python and the UV tool for dependency management. Install UV, then install the project dependencies, and finally start the MCP server.

# Install UV (follow the official instructions for your system)
# Then clone the project and install dependencies

git clone https://github.com/alizdavoodi/MCPDocSearch.git
cd MCPDocSearch

uv sync

Configuration and notes

Crawl content is placed under the storage directory as Markdown files. The MCP server automatically loads these Markdown files, chunks content using headings, and creates embeddings with a sentence-transformers model. A cache file stores processed chunks and embeddings to speed up startup on subsequent runs.

Security notes

The server uses a local cache to speed up startup. If you share the storage directory with untrusted sources, be cautious about loading cached data. Only trusted users should modify the Markdown files in storage.

Tools exposed by the server

The server exposes the following MCP tools via fastmcp: list_documents, get_document_headings, search_documentation.

Embedding time and performance tips

The first startup after crawling content may take several minutes as the server parses, chunks, and embeds all content. Subsequent startups are faster if the cache is valid and the Markdown sources have not changed.

Cursor integration

To integrate with Cursor, run the server via stdio transport and connect the doc-query-server to Cursor as your MCP backend.

Appendix: MCP runtime command

{
  "mcpServers": {
    "doc_query": {
      "type": "stdio",
      "name": "doc_query",
      "command": "python",
      "args": ["-m", "mcp_server.main"]
    }
  },
  "envVars": []
}

Available tools

list_documents

Lists crawled documents available in storage.

get_document_headings

Retrieves the heading structure for a document.

search_documentation

Performs semantic search over document chunks using embeddings.