Spark Optimizer MCP server for AI agents

This MCP server provides a client-server architecture for optimizing Apache Spark code through Claude AI integration, delivering intelligent code suggestions and performance analysis within the Model Context Protocol framework.

Installation Requirements

To get started with the Spark MCP Optimizer, you'll need:

Python 3.8+
PySpark 3.2.0+
Anthropic API Key (for Claude AI access)

Setting Up the Environment

Install all required dependencies:

pip install -r requirements.txt

Make sure to set your Anthropic API key as an environment variable:

export ANTHROPIC_API_KEY="your_api_key_here"

Basic Usage

The Spark MCP Optimizer follows a simple workflow for optimizing your Apache Spark code:

Step 1: Prepare Your Code

Place your Spark code to be optimized in the input directory:

# Add your code to this file
vi input/spark_code_input.py

Step 2: Start the MCP Server

Launch the MCP server that will handle optimization requests:

python v1/run_server.py

The server will start and listen for incoming client connections.

Step 3: Run the Client

With the server running, execute the client to optimize your code:

python v1/run_client.py

This will generate two important output files:

output/optimized_spark_example.py: Your optimized Spark code
output/performance_analysis.md: A detailed analysis of the optimizations

Step 4: Compare Performance

Run a comparison between the original and optimized code:

python v1/run_optimized.py

This executes both versions, compares execution times, and updates the performance analysis with metrics.

Advanced Usage

For more control, you can integrate the Spark MCP client into your own applications:

from spark_mcp.client import SparkMCPClient

async def main():
    # Connect to the MCP server
    client = SparkMCPClient()
    await client.connect()

    # Your Spark code to optimize
    spark_code = '''
    # Your PySpark code here
    '''

    # Get optimized code with performance analysis
    optimized_code = await client.optimize_spark_code(
        code=spark_code,
        optimization_level="advanced",
        save_to_file=True  # Save to output/optimized_spark_example.py
    )
    
    # Analyze performance differences
    analysis = await client.analyze_performance(
        original_code=spark_code,
        optimized_code=optimized_code,
        save_to_file=True  # Save to output/performance_analysis.md
    )
    
    await client.close()

Available Optimization Tools

The MCP server provides two primary tools:

1. optimize_spark_code

This tool analyzes and optimizes your PySpark code:

# Example parameters
optimized_code = await client.optimize_spark_code(
    code=your_spark_code,
    optimization_level="advanced",  # "basic" or "advanced"
    save_to_file=True
)

2. analyze_performance

Compare and analyze original vs. optimized code:

analysis = await client.analyze_performance(
    original_code=your_spark_code,
    optimized_code=optimized_code,
    save_to_file=True
)

Example Results

Input Code

# Create DataFrames and join
emp_df = spark.createDataFrame(employees, ["id", "name", "age", "dept", "salary"])
dept_df = spark.createDataFrame(departments, ["dept", "location", "budget"])

# Join and analyze
result = emp_df.join(dept_df, "dept") \
    .groupBy("dept", "location") \
    .agg({"salary": "avg", "age": "avg", "id": "count"}) \
    .orderBy("dept")

Optimized Code

# Performance-optimized version with caching and improved configurations
spark = SparkSession.builder \
    .appName("EmployeeAnalysis") \
    .config("spark.sql.shuffle.partitions", 200) \
    .getOrCreate()

# Create and cache DataFrames
emp_df = spark.createDataFrame(employees, ["id", "name", "age", "dept", "salary"]).cache()
dept_df = spark.createDataFrame(departments, ["dept", "location", "budget"]).cache()

# Optimized join and analysis
result = emp_df.join(dept_df, "dept") \
    .groupBy("dept", "location") \
    .agg(
        avg("salary").alias("avg_salary"),
        avg("age").alias("avg_age"),
        count("id").alias("employee_count")
    ) \
    .orderBy("dept")

Performance Analysis

The system will generate a detailed analysis showing improvements:

## Execution Results Comparison

### Timing Comparison
- Original Code: 5.18 seconds
- Optimized Code: 0.65 seconds
- Performance Improvement: 87.4%

### Optimization Details
- Caching frequently used DataFrames
- Optimized shuffle partitions
- Improved column expressions
- Better memory management

How to install this MCP server

For Claude Code

To add this MCP server to Claude Code, run this command in your terminal:

claude mcp add-json "spark-optimizer" '{"command":"python","args":["v1/run_server.py"]}'

See the official Claude Code MCP documentation for more details.

For Cursor

There are two ways to add an MCP server to Cursor. The most common way is to add the server globally in the ~/.cursor/mcp.json file so that it is available in all of your projects.

If you only need the server in a single project, you can add it to the project instead by creating or adding it to the .cursor/mcp.json file.

Adding an MCP server to Cursor globally

To add a global MCP server go to Cursor Settings > Tools & Integrations and click "New MCP Server".

When you click that button the ~/.cursor/mcp.json file will be opened and you can add your server like this:

{
    "mcpServers": {
        "spark-optimizer": {
            "command": "python",
            "args": [
                "v1/run_server.py"
            ]
        }
    }
}

Adding an MCP server to a project

To add an MCP server to a project you can create a new .cursor/mcp.json file or add it to the existing one. This will look exactly the same as the global MCP server example above.

How to use the MCP server

Once the server is installed, you might need to head back to Settings > MCP and click the refresh button.

The Cursor agent will then be able to see the available tools the added MCP server has available and will call them when it needs to.

You can also explicitly ask the agent to use the tool by mentioning the tool name and describing what the function does.

For Claude Desktop

To add this MCP server to Claude Desktop:

1. Find your configuration file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

2. Add this to your configuration file:

{
    "mcpServers": {
        "spark-optimizer": {
            "command": "python",
            "args": [
                "v1/run_server.py"
            ]
        }
    }
}

3. Restart Claude Desktop for the changes to take effect