home / skills / jeremylongshore / claude-code-plugins-plus-skills / spark-sql-optimizer

spark-sql-optimizer skill

safe

/skills/11-data-pipelines/spark-sql-optimizer

This skill automates spark sql optimizer guidance, producing production-ready configurations, best-practice patterns, and validation results for efficient data

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill spark-sql-optimizer

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.1 KB

---
name: "spark-sql-optimizer"
description: |
  Optimize spark sql optimizer operations. Auto-activating skill for Data Pipelines.
  Triggers on: spark sql optimizer, spark sql optimizer
  Part of the Data Pipelines skill category. Use when working with spark sql optimizer functionality. Trigger with phrases like "spark sql optimizer", "spark optimizer", "spark".
allowed-tools: "Read, Write, Edit, Bash(cmd:*), Grep"
version: 1.0.0
license: MIT
author: "Jeremy Longshore <[email protected]>"
---

# Spark Sql Optimizer

## Overview

This skill provides automated assistance for spark sql optimizer tasks within the Data Pipelines domain.

## When to Use

This skill activates automatically when you:
- Mention "spark sql optimizer" in your request
- Ask about spark sql optimizer patterns or best practices
- Need help with data pipeline skills covering etl, data transformation, workflow orchestration, and streaming data processing.

## Instructions

1. Provides step-by-step guidance for spark sql optimizer
2. Follows industry best practices and patterns
3. Generates production-ready code and configurations
4. Validates outputs against common standards

## Examples

**Example: Basic Usage**
Request: "Help me with spark sql optimizer"
Result: Provides step-by-step guidance and generates appropriate configurations


## Prerequisites

- Relevant development environment configured
- Access to necessary tools and services
- Basic understanding of data pipelines concepts


## Output

- Generated configurations and code
- Best practice recommendations
- Validation results


## Error Handling

| Error | Cause | Solution |
|-------|-------|----------|
| Configuration invalid | Missing required fields | Check documentation for required parameters |
| Tool not found | Dependency not installed | Install required tools per prerequisites |
| Permission denied | Insufficient access | Verify credentials and permissions |


## Resources

- Official documentation for related tools
- Best practices guides
- Community examples and tutorials

## Related Skills

Part of the **Data Pipelines** skill category.
Tags: etl, airflow, spark, streaming, data-engineering

Overview

This skill automates optimization tasks for Spark SQL within data pipelines, offering actionable guidance, code snippets, and configuration templates. It auto-activates when Spark SQL optimizer topics are detected and focuses on improving query performance, resource usage, and reliability. Use it to translate best practices into production-ready changes and validation checks.

How this skill works

The skill inspects query plans, execution metrics, and configuration settings to identify bottlenecks such as shuffle-heavy operations, skew, and suboptimal joins. It suggests rewrite patterns, partitioning strategies, caching options, and Spark configuration tweaks, and can generate code examples and validation steps. Outputs include optimized SQL transforms, Spark job configs, and testable validation checks against common standards.

When to use it

You need to speed up slow Spark SQL jobs or reduce cluster costs
You want automated recommendations for join strategy, partitioning, or caching
You are preparing pipelines for production and need validation and configs
You see skewed stages, excessive shuffles, or high GC/Task retry rates
You need standard patterns for ETL, streaming SQL, or complex transformations

Best practices

Analyze physical plans (EXPLAIN FORMATTED) and metrics before changing code
Prefer broadcast joins for small tables and avoid wide shuffles when possible
Use partition pruning and predicate pushdown to reduce input size
Cache intermediate results selectively and monitor memory pressure
Tune shuffle partitions and executor memory based on workload profile

Example use cases

Rewrite a SQL query to replace a large shuffle join with a broadcast join and show generated Spark conf changes
Detect data skew on a grouping key and suggest salting or repartition strategies with code examples
Generate a production-ready Spark SQL job config that balances parallelism and memory for a given cluster size
Validate that a refactored transformation still produces same results and produces improved metrics
Recommend checkpointing and state management changes for streaming SQL pipelines

FAQ

Does the skill modify my cluster automatically?

No. It provides recommended configuration and code; you apply changes in your environment after review.

What inputs does it need to analyze a job?

Provide the SQL query, EXPLAIN output or physical plan, and job metrics (task durations, shuffle sizes) for best recommendations.

Can it help with streaming jobs?

Yes. It suggests state management, watermarking, checkpointing, and micro-batch sizing for streaming SQL.