home / skills / dkyazzentwatwa / chatgpt-skills / correlation-explorer

correlation-explorer skill

/correlation-explorer

This skill helps you explore and visualize dataset correlations, identify strong relationships, and prioritize features using multiple methods and heatmaps.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill correlation-explorer

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.4 KB
---
name: correlation-explorer
description: Find and visualize correlations between variables in datasets. Use for data exploration, feature selection, or identifying relationships between columns.
---

# Correlation Explorer

Analyze correlations between variables in CSV/Excel datasets.

## Features

- **Correlation Matrix**: Compute all pairwise correlations
- **Heatmap Visualization**: Color-coded correlation display
- **Significance Testing**: P-values for correlations
- **Multiple Methods**: Pearson, Spearman, Kendall
- **Strong Correlations**: Find highly correlated pairs
- **Target Analysis**: Correlations with specific variable

## Quick Start

```python
from correlation_explorer import CorrelationExplorer

explorer = CorrelationExplorer()

# Load and analyze
explorer.load_csv("sales_data.csv")
matrix = explorer.correlation_matrix()

# Find strong correlations
strong = explorer.find_strong_correlations(threshold=0.7)
print(strong)

# Generate heatmap
explorer.plot_heatmap("correlation_heatmap.png")
```

## CLI Usage

```bash
# Compute correlation matrix
python correlation_explorer.py --input data.csv --output correlations.csv

# Generate heatmap
python correlation_explorer.py --input data.csv --heatmap heatmap.png

# Find strong correlations
python correlation_explorer.py --input data.csv --strong --threshold 0.7

# Correlations with target variable
python correlation_explorer.py --input data.csv --target sales

# Use Spearman correlation
python correlation_explorer.py --input data.csv --method spearman

# Include p-values
python correlation_explorer.py --input data.csv --pvalues
```

## API Reference

### CorrelationExplorer Class

```python
class CorrelationExplorer:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, **kwargs) -> 'CorrelationExplorer'
    def load_dataframe(self, df: pd.DataFrame) -> 'CorrelationExplorer'

    # Analysis
    def correlation_matrix(self, method: str = "pearson") -> pd.DataFrame
    def correlation_with_pvalues(self, method: str = "pearson") -> tuple
    def correlate_with_target(self, target: str, method: str = "pearson") -> pd.Series

    # Discovery
    def find_strong_correlations(self, threshold: float = 0.7) -> list
    def find_weak_correlations(self, threshold: float = 0.3) -> list

    # Visualization
    def plot_heatmap(self, output: str, **kwargs) -> str
    def plot_scatter(self, var1: str, var2: str, output: str) -> str

    # Export
    def to_csv(self, output: str) -> str
    def to_json(self, output: str) -> str
```

## Correlation Methods

| Method | Best For |
|--------|----------|
| `pearson` | Linear relationships, normal data |
| `spearman` | Non-linear, ordinal data |
| `kendall` | Small samples, ordinal data |

```python
# Pearson (default) - parametric
matrix = explorer.correlation_matrix(method="pearson")

# Spearman - rank-based, non-parametric
matrix = explorer.correlation_matrix(method="spearman")

# Kendall - robust to outliers
matrix = explorer.correlation_matrix(method="kendall")
```

## Output Format

### Correlation Matrix
```python
           sales  marketing  customers
sales      1.000      0.854      0.723
marketing  0.854      1.000      0.612
customers  0.723      0.612      1.000
```

### Strong Correlations
```python
[
    {"var1": "sales", "var2": "marketing", "correlation": 0.854, "abs_corr": 0.854},
    {"var1": "sales", "var2": "customers", "correlation": 0.723, "abs_corr": 0.723}
]
```

### With P-Values
```python
{
    "correlations": DataFrame,
    "pvalues": DataFrame,
    "significant": [...],  # p < 0.05
}
```

## Example Workflows

### Feature Selection
```python
explorer = CorrelationExplorer()
explorer.load_csv("features.csv")

# Find features correlated with target
target_corr = explorer.correlate_with_target("target")
important_features = target_corr[abs(target_corr) > 0.3].index.tolist()
print(f"Important features: {important_features}")

# Find multicollinear features (to potentially drop)
strong = explorer.find_strong_correlations(threshold=0.9)
print("Highly correlated pairs (consider dropping one):")
for pair in strong:
    print(f"  {pair['var1']} <-> {pair['var2']}: {pair['correlation']:.3f}")
```

### Sales Analysis
```python
explorer = CorrelationExplorer()
explorer.load_csv("sales_data.csv")

# What drives sales?
sales_corr = explorer.correlate_with_target("revenue")
print("Factors correlated with revenue:")
for var, corr in sales_corr.sort_values(ascending=False).items():
    if var != "revenue":
        print(f"  {var}: {corr:.3f}")

# Visualize
explorer.plot_heatmap("sales_correlations.png")
```

### Data Exploration
```python
explorer = CorrelationExplorer()
explorer.load_csv("dataset.csv")

# Get full picture
corr, pvals = explorer.correlation_with_pvalues()

# Find all significant correlations
significant = []
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if pvals.iloc[i, j] < 0.05:
            significant.append({
                'var1': corr.columns[i],
                'var2': corr.columns[j],
                'r': corr.iloc[i, j],
                'p': pvals.iloc[i, j]
            })
```

## Heatmap Options

```python
explorer.plot_heatmap(
    output="heatmap.png",
    cmap="coolwarm",      # Color scheme
    annot=True,           # Show values
    figsize=(12, 10),     # Figure size
    vmin=-1, vmax=1,      # Color scale
    title="Correlation Matrix"
)
```

## Dependencies

- pandas>=2.0.0
- numpy>=1.24.0
- scipy>=1.10.0
- matplotlib>=3.7.0
- seaborn>=0.12.0

Overview

This skill finds and visualizes correlations between variables in CSV/Excel datasets to accelerate data exploration and feature selection. It computes correlation matrices, significance (p-values), and generates heatmaps and scatter plots with multiple methods (Pearson, Spearman, Kendall). Use it to detect multicollinearity, identify predictors for a target, or surface unexpected relationships quickly.

How this skill works

Load a dataset from CSV or a pandas DataFrame, then compute pairwise correlations using the selected method. It can return correlation matrices, p-value matrices, and lists of strong or weak pairs. Visual outputs include annotated heatmaps and pairwise scatter plots; results can be exported to CSV or JSON for downstream use.

When to use it

  • Exploratory data analysis to summarize variable relationships
  • Feature selection and multicollinearity detection before modeling
  • Identifying variables strongly associated with a target outcome
  • Producing visual correlation reports for stakeholders
  • Checking robustness with parametric and nonparametric correlation methods

Best practices

  • Pre-clean data: handle missing values and convert categorical variables before correlation
  • Choose method by data type: Pearson for linear/normal, Spearman/Kendall for ranks or nonparametric
  • Inspect p-values to avoid overinterpreting noisy correlations
  • Flag and review high absolute correlations (>0.7) for multicollinearity issues
  • Export results and visuals to include in model documentation or EDA reports

Example use cases

  • Feature selection: rank features by absolute correlation with the target and remove redundant predictors
  • Sales analysis: find which marketing or product metrics correlate with revenue and visualize with a heatmap
  • Data auditing: surface unexpected correlations that may indicate data leakage or merged columns
  • Pre-model pipeline: detect and list highly correlated variable pairs to guide dimensionality reduction

FAQ

Which correlation method should I pick?

Use Pearson for linear relationships and normally distributed data; use Spearman or Kendall for ordinal, nonlinear, or rank-based relationships and small samples.

Can it test significance of correlations?

Yes — the tool can return p-value matrices and flag statistically significant pairs (commonly p < 0.05).