Provides a production-ready MCP server for PDF processing with intelligent caching and specialized read, search, and extract tools.
Configuration
View docs{
"mcpServers": {
"jztan-pdf-mcp": {
"command": "pdf-mcp",
"args": [],
"env": {
"PDF_MCP_CACHE_DIR": "path to cache directory (default: ~/.cache/pdf-mcp)",
"PDF_MCP_CACHE_TTL": "TTL in hours (default: 24)"
}
}
}
}You can run the pdf-mcp server locally to process PDF documents with intelligent caching, enabling fast reads, searches, and image extraction for AI agents and applications. This MCP server provides specialized tools to read, navigate, and analyze PDFs efficiently, even when dealing with large files or repeated access.
You interact with pdf-mcp through an MCP client, using the server to access a suite of PDF-focused tools. Start the server locally, then connect your client to the server using the standard MCP connection method described for your client. Once connected, you can inspect a document, read specific page ranges in chunks, search for content before loading, extract images, and leverage a persistent cache to speed up subsequent accesses.
Prerequisites: ensure you have Python and the Python package manager available on your system.
pip install pdf-mcpThe server uses a persistent SQLite cache to accelerate repeated access and survive server restarts. The default cache location is a hidden folder under your home directory. You can configure the cache directory and time-to-live (TTL) for cached items using environment variables.
Important cache details include automatic invalidation when the document changes, a manual clear option, and a configurable TTL to balance freshness with speed.
Environment variables you may use to customize caching include the cache directory and the cache TTL in hours.
The server exposes eight specialized tools to work with PDFs. You typically start by inspecting the document, then read specific pages, search within the document, and optionally extract images or inspect the table of contents. Each tool is designed to help you build concise, chunked workflows that keep context within reasonable limits.
- pdf_info: Gather document metadata, page count, and contents to plan reads. Always begin with this to understand the document.
- pdf_read_pages: Read defined page ranges or specific pages in manageable chunks.
- pdf_read_all: Read the entire document when it is small and a safety limit allows.
- pdf_search: Find relevant sections before loading full content.
- pdf_get_toc: Retrieve the table of contents for quick navigation.
- pdf_extract_images: Extract images from specified pages as base64-encoded PNGs.
- pdf_cache_stats: View statistics about the cache.
- pdf_cache_clear: Clear expired or undesirable cache entries.
For a large document, start by inspecting the PDF, then read relevant page ranges in batches and finally synthesize a response from the gathered chunks.
If you plan to contribute or build locally, you can install the package in editable mode and run tests and checks as part of your workflow.
Get document information including page count, metadata, table of contents, file size, and estimated tokens. This should be called first to understand the document before reading.
Read specific page ranges or individual pages in manageable chunks to control context size during processing.
Read the entire document when it fits within the safety limit defined by the server.
Search within the PDF to locate relevant pages before loading the content.
Retrieve the table of contents to navigate the document structure quickly.
Extract images from specified pages and encode them as base64 PNGs.
Show statistics about the PDF cache, including hit rate and size.
Clear expired or undesired cache entries to free space and refresh data.