home / mcp / spark sql mcp server
Provides read-only access to Spark SQL data via Thrift/HiveServer2 for AI assistants, with schema discovery and multiple authentication options.
Configuration
View docs{
"mcpServers": {
"aidancorrell-spark-sql-mcp-server": {
"command": "uvx",
"args": [
"spark-sql-mcp-server"
],
"env": {
"SPARK_AUTH": "NONE",
"SPARK_HOST": "your-spark-host.example.com",
"SPARK_PORT": "10000",
"SPARK_DATABASE": "default",
"SPARK_PASSWORD": "YOUR_PASSWORD",
"SPARK_USERNAME": "YOUR_USERNAME",
"SPARK_KERBEROS_SERVICE_NAME": "hive"
}
}
}
}You can run a lightweight MCP server that lets AI assistants query a Spark SQL cluster using the Thrift/HiveServer2 protocol. It supports read-only queries, schema discovery, and multiple authentication methods, making it easy to integrate Spark data into conversational workflows while keeping queries safe and scoped to read operations.
To use this MCP server, you run it locally or in your environment and connect your MCP client (such as Claude) to it. You can perform read-only SQL operations against your Spark cluster, discover databases and tables, and fetch table schemas. Begin by starting the MCP server, then configure your client to reference the server by its MCP connection details. Typical usage patterns include listing available databases, listing tables in a database, describing a table’s schema, and executing read-only queries that return results in a readable format.
Prerequisites: You need Python installed on your system. You will either install the MCP server package via Python’s package manager or run it directly with the runtime tool.
# Install the MCP server package
pip install spark-sql-mcp-server
# Or run directly with the runtime tool
uvx spark-sql-mcp-serverBefore starting, set environment variables to point the MCP server at your Spark cluster and to control how you authenticate. Common variables include the host, port, database, and authentication method. You can run the server with these environment settings and then configure your client to connect using the same values.
The server enforces read-only query execution for safety. Only statements such as SELECT, SHOW, DESCRIBE, EXPLAIN, and WITH are allowed. If a query would modify data or alter schema, it will be blocked before reaching the Spark cluster. Passwords and sensitive details are masked in logs and error messages to reduce exposure.
Steps you would follow in a typical workflow include: setting up the MCP server with the appropriate SPARK_HOST, SPARK_PORT, and SPARK_AUTH values; starting the MCP server; and configuring your AI assistant client to use the server as a data source. Then you can ask questions like which databases exist, what the schema of a specific table is, or to run a read-only query to fetch the top records.
If you cannot connect, verify that the Spark Thrift Server is accessible from the environment where the MCP server runs, and check that the host, port, and authentication settings match your Spark cluster configuration. If queries fail due to permissions, review your Spark user rights and ensure you are using an authentication method that your cluster accepts.
The MCP server is compatible with HiveServer2-compatible systems, including Apache Spark, AWS EMR, Hive, Impala, and Presto. When using EMR, ensure that security groups allow access to the Thrift port and consider using an SSH tunnel to protect credentials in transit.
Environment variables shown in the examples include SPARK_HOST, SPARK_PORT, SPARK_DATABASE, SPARK_AUTH, SPARK_USERNAME, SPARK_PASSWORD, and SPARK_KERBEROS_SERVICE_NAME. The server can be started with a command such as uvx and passing the module name spark-sql-mcp-server, with the authentication and host details provided via environment variables.
If you want to contribute or run tests locally, install the project in editable mode and run the test suite. You can also run a local Docker-based Spark Thrift Server for integration tests. Follow the project’s testing steps to ensure your setup works with Claude or your MCP client.
List all available databases on the connected Spark cluster.
List tables within a specified database.
Describe the schema of a specific table, including column names and types.
Run read-only SQL queries with results formatted for display.