home / skills / jeremylongshore / claude-code-plugins-plus-skills / spark-sql-optimizer

spark-sql-optimizer skill

/skills/11-data-pipelines/spark-sql-optimizer

This skill automates spark sql optimizer guidance, producing production-ready configurations, best-practice patterns, and validation results for efficient data

npx playbooks add skill jeremylongshore/claude-code-plugins-plus-skills --skill spark-sql-optimizer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.1 KB
---
name: "spark-sql-optimizer"
description: |
  Optimize spark sql optimizer operations. Auto-activating skill for Data Pipelines.
  Triggers on: spark sql optimizer, spark sql optimizer
  Part of the Data Pipelines skill category. Use when working with spark sql optimizer functionality. Trigger with phrases like "spark sql optimizer", "spark optimizer", "spark".
allowed-tools: "Read, Write, Edit, Bash(cmd:*), Grep"
version: 1.0.0
license: MIT
author: "Jeremy Longshore <[email protected]>"
---

# Spark Sql Optimizer

## Overview

This skill provides automated assistance for spark sql optimizer tasks within the Data Pipelines domain.

## When to Use

This skill activates automatically when you:
- Mention "spark sql optimizer" in your request
- Ask about spark sql optimizer patterns or best practices
- Need help with data pipeline skills covering etl, data transformation, workflow orchestration, and streaming data processing.

## Instructions

1. Provides step-by-step guidance for spark sql optimizer
2. Follows industry best practices and patterns
3. Generates production-ready code and configurations
4. Validates outputs against common standards

## Examples

**Example: Basic Usage**
Request: "Help me with spark sql optimizer"
Result: Provides step-by-step guidance and generates appropriate configurations


## Prerequisites

- Relevant development environment configured
- Access to necessary tools and services
- Basic understanding of data pipelines concepts


## Output

- Generated configurations and code
- Best practice recommendations
- Validation results


## Error Handling

| Error | Cause | Solution |
|-------|-------|----------|
| Configuration invalid | Missing required fields | Check documentation for required parameters |
| Tool not found | Dependency not installed | Install required tools per prerequisites |
| Permission denied | Insufficient access | Verify credentials and permissions |


## Resources

- Official documentation for related tools
- Best practices guides
- Community examples and tutorials

## Related Skills

Part of the **Data Pipelines** skill category.
Tags: etl, airflow, spark, streaming, data-engineering

Overview

This skill automates optimization tasks for Spark SQL within data pipelines, offering actionable guidance, code snippets, and configuration templates. It auto-activates when Spark SQL optimizer topics are detected and focuses on improving query performance, resource usage, and reliability. Use it to translate best practices into production-ready changes and validation checks.

How this skill works

The skill inspects query plans, execution metrics, and configuration settings to identify bottlenecks such as shuffle-heavy operations, skew, and suboptimal joins. It suggests rewrite patterns, partitioning strategies, caching options, and Spark configuration tweaks, and can generate code examples and validation steps. Outputs include optimized SQL transforms, Spark job configs, and testable validation checks against common standards.

When to use it

  • You need to speed up slow Spark SQL jobs or reduce cluster costs
  • You want automated recommendations for join strategy, partitioning, or caching
  • You are preparing pipelines for production and need validation and configs
  • You see skewed stages, excessive shuffles, or high GC/Task retry rates
  • You need standard patterns for ETL, streaming SQL, or complex transformations

Best practices

  • Analyze physical plans (EXPLAIN FORMATTED) and metrics before changing code
  • Prefer broadcast joins for small tables and avoid wide shuffles when possible
  • Use partition pruning and predicate pushdown to reduce input size
  • Cache intermediate results selectively and monitor memory pressure
  • Tune shuffle partitions and executor memory based on workload profile

Example use cases

  • Rewrite a SQL query to replace a large shuffle join with a broadcast join and show generated Spark conf changes
  • Detect data skew on a grouping key and suggest salting or repartition strategies with code examples
  • Generate a production-ready Spark SQL job config that balances parallelism and memory for a given cluster size
  • Validate that a refactored transformation still produces same results and produces improved metrics
  • Recommend checkpointing and state management changes for streaming SQL pipelines

FAQ

Does the skill modify my cluster automatically?

No. It provides recommended configuration and code; you apply changes in your environment after review.

What inputs does it need to analyze a job?

Provide the SQL query, EXPLAIN output or physical plan, and job metrics (task durations, shuffle sizes) for best recommendations.

Can it help with streaming jobs?

Yes. It suggests state management, watermarking, checkpointing, and micro-batch sizing for streaming SQL.