home / skills / dkyazzentwatwa / chatgpt-skills / correlation-explorer

correlation-explorer skill

safe

This skill helps you explore and visualize dataset correlations, identify strong relationships, and prioritize features using multiple methods and heatmaps.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill correlation-explorer

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

5.4 KB

---
name: correlation-explorer
description: Find and visualize correlations between variables in datasets. Use for data exploration, feature selection, or identifying relationships between columns.
---

# Correlation Explorer

Analyze correlations between variables in CSV/Excel datasets.

## Features

- **Correlation Matrix**: Compute all pairwise correlations
- **Heatmap Visualization**: Color-coded correlation display
- **Significance Testing**: P-values for correlations
- **Multiple Methods**: Pearson, Spearman, Kendall
- **Strong Correlations**: Find highly correlated pairs
- **Target Analysis**: Correlations with specific variable

## Quick Start

```python
from correlation_explorer import CorrelationExplorer

explorer = CorrelationExplorer()

# Load and analyze
explorer.load_csv("sales_data.csv")
matrix = explorer.correlation_matrix()

# Find strong correlations
strong = explorer.find_strong_correlations(threshold=0.7)
print(strong)

# Generate heatmap
explorer.plot_heatmap("correlation_heatmap.png")
```

## CLI Usage

```bash
# Compute correlation matrix
python correlation_explorer.py --input data.csv --output correlations.csv

# Generate heatmap
python correlation_explorer.py --input data.csv --heatmap heatmap.png

# Find strong correlations
python correlation_explorer.py --input data.csv --strong --threshold 0.7

# Correlations with target variable
python correlation_explorer.py --input data.csv --target sales

# Use Spearman correlation
python correlation_explorer.py --input data.csv --method spearman

# Include p-values
python correlation_explorer.py --input data.csv --pvalues
```

## API Reference

### CorrelationExplorer Class

```python
class CorrelationExplorer:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, **kwargs) -> 'CorrelationExplorer'
    def load_dataframe(self, df: pd.DataFrame) -> 'CorrelationExplorer'

    # Analysis
    def correlation_matrix(self, method: str = "pearson") -> pd.DataFrame
    def correlation_with_pvalues(self, method: str = "pearson") -> tuple
    def correlate_with_target(self, target: str, method: str = "pearson") -> pd.Series

    # Discovery
    def find_strong_correlations(self, threshold: float = 0.7) -> list
    def find_weak_correlations(self, threshold: float = 0.3) -> list

    # Visualization
    def plot_heatmap(self, output: str, **kwargs) -> str
    def plot_scatter(self, var1: str, var2: str, output: str) -> str

    # Export
    def to_csv(self, output: str) -> str
    def to_json(self, output: str) -> str
```

## Correlation Methods

| Method | Best For |
|--------|----------|
| `pearson` | Linear relationships, normal data |
| `spearman` | Non-linear, ordinal data |
| `kendall` | Small samples, ordinal data |

```python
# Pearson (default) - parametric
matrix = explorer.correlation_matrix(method="pearson")

# Spearman - rank-based, non-parametric
matrix = explorer.correlation_matrix(method="spearman")

# Kendall - robust to outliers
matrix = explorer.correlation_matrix(method="kendall")
```

## Output Format

### Correlation Matrix
```python
           sales  marketing  customers
sales      1.000      0.854      0.723
marketing  0.854      1.000      0.612
customers  0.723      0.612      1.000
```

### Strong Correlations
```python
[
    {"var1": "sales", "var2": "marketing", "correlation": 0.854, "abs_corr": 0.854},
    {"var1": "sales", "var2": "customers", "correlation": 0.723, "abs_corr": 0.723}
]
```

### With P-Values
```python
{
    "correlations": DataFrame,
    "pvalues": DataFrame,
    "significant": [...],  # p < 0.05
}
```

## Example Workflows

### Feature Selection
```python
explorer = CorrelationExplorer()
explorer.load_csv("features.csv")

# Find features correlated with target
target_corr = explorer.correlate_with_target("target")
important_features = target_corr[abs(target_corr) > 0.3].index.tolist()
print(f"Important features: {important_features}")

# Find multicollinear features (to potentially drop)
strong = explorer.find_strong_correlations(threshold=0.9)
print("Highly correlated pairs (consider dropping one):")
for pair in strong:
    print(f"  {pair['var1']} <-> {pair['var2']}: {pair['correlation']:.3f}")
```

### Sales Analysis
```python
explorer = CorrelationExplorer()
explorer.load_csv("sales_data.csv")

# What drives sales?
sales_corr = explorer.correlate_with_target("revenue")
print("Factors correlated with revenue:")
for var, corr in sales_corr.sort_values(ascending=False).items():
    if var != "revenue":
        print(f"  {var}: {corr:.3f}")

# Visualize
explorer.plot_heatmap("sales_correlations.png")
```

### Data Exploration
```python
explorer = CorrelationExplorer()
explorer.load_csv("dataset.csv")

# Get full picture
corr, pvals = explorer.correlation_with_pvalues()

# Find all significant correlations
significant = []
for i in range(len(corr.columns)):
    for j in range(i+1, len(corr.columns)):
        if pvals.iloc[i, j] < 0.05:
            significant.append({
                'var1': corr.columns[i],
                'var2': corr.columns[j],
                'r': corr.iloc[i, j],
                'p': pvals.iloc[i, j]
            })
```

## Heatmap Options

```python
explorer.plot_heatmap(
    output="heatmap.png",
    cmap="coolwarm",      # Color scheme
    annot=True,           # Show values
    figsize=(12, 10),     # Figure size
    vmin=-1, vmax=1,      # Color scale
    title="Correlation Matrix"
)
```

## Dependencies

- pandas>=2.0.0
- numpy>=1.24.0
- scipy>=1.10.0
- matplotlib>=3.7.0
- seaborn>=0.12.0

Overview

This skill finds and visualizes correlations between variables in CSV/Excel datasets to accelerate data exploration and feature selection. It computes correlation matrices, significance (p-values), and generates heatmaps and scatter plots with multiple methods (Pearson, Spearman, Kendall). Use it to detect multicollinearity, identify predictors for a target, or surface unexpected relationships quickly.

How this skill works

Load a dataset from CSV or a pandas DataFrame, then compute pairwise correlations using the selected method. It can return correlation matrices, p-value matrices, and lists of strong or weak pairs. Visual outputs include annotated heatmaps and pairwise scatter plots; results can be exported to CSV or JSON for downstream use.

When to use it

Exploratory data analysis to summarize variable relationships
Feature selection and multicollinearity detection before modeling
Identifying variables strongly associated with a target outcome
Producing visual correlation reports for stakeholders
Checking robustness with parametric and nonparametric correlation methods

Best practices

Pre-clean data: handle missing values and convert categorical variables before correlation
Choose method by data type: Pearson for linear/normal, Spearman/Kendall for ranks or nonparametric
Inspect p-values to avoid overinterpreting noisy correlations
Flag and review high absolute correlations (>0.7) for multicollinearity issues
Export results and visuals to include in model documentation or EDA reports

Example use cases

Feature selection: rank features by absolute correlation with the target and remove redundant predictors
Sales analysis: find which marketing or product metrics correlate with revenue and visualize with a heatmap
Data auditing: surface unexpected correlations that may indicate data leakage or merged columns
Pre-model pipeline: detect and list highly correlated variable pairs to guide dimensionality reduction

FAQ

Which correlation method should I pick?

Use Pearson for linear relationships and normally distributed data; use Spearman or Kendall for ordinal, nonlinear, or rank-based relationships and small samples.

Can it test significance of correlations?

Yes — the tool can return p-value matrices and flag statistically significant pairs (commonly p < 0.05).