Home / MCP / MCP vLLM Benchmarking Tool MCP Server

MCP vLLM Benchmarking Tool MCP Server

A proof-of-concept MCP server to interactively benchmark vLLM endpoints and compare results.

python
Installation
Add the following to your MCP client configuration file.

Configuration

View docs
{
    "mcpServers": {
        "mcp_vllm": {
            "command": "uv",
            "args": [
                "run",
                "/Path/TO/mcp-vllm-benchmarking-tool/server.py"
            ]
        }
    }
}

You can run an MCP server that interactively benchmarks vLLM endpoints and compare performance across configurations. This makes it easy to run repeatable benchmarks, capture results, and analyze how different models or endpoints perform under your workload.

How to use

Install and run the MCP benchmarking server, then use an MCP client to initiate benchmarks. You can prompt the tool to run a vLLM benchmark against a specific endpoint, specify the model, and run multiple iterations to gather stable results. The server handles running the benchmark and presenting a comparison of results, while automatically treating the first iteration as a warmup and excluding it from the final comparison.

How to install

Prerequisites: you need a runtime capable of executing MCP stdio servers (such as uv) and access to a Python script that implements the benchmarking logic.

1. Clone the benchmarking tool repository.

2. Add the MCP server configuration to your MCP setup using the following snippet.

{
  "mcpServers": {
    "mcp_vllm": {
      "command": "uv",
      "args": [
        "run",
        "/Path/TO/mcp-vllm-benchmarking-tool/server.py"
      ]
    }
  }
}

Additional notes

Be aware that some outputs from vLLM can vary and may occasionally produce JSON that looks invalid. If this happens, re-run the benchmark to confirm consistency and track any fluctuations. The flow described here is intended to provide an interactive, repeatable benchmarking mechanism within MCP.

Available tools

benchmark

Runs a vLLM endpoint benchmark, collecting timing and accuracy-like metrics across multiple iterations to produce a comparative report.

ignore_warmup

Exclude the first iteration from the final results to remove warmup bias when comparing benchmarks.