home / skills / benchflow-ai / skillsbench / pca-decomposition

This skill helps you perform PCA with varimax rotation to reduce dimensionality and reveal underlying factors for better data interpretation.

npx playbooks add skill benchflow-ai/skillsbench --skill pca-decomposition

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.9 KB
---
name: pca-decomposition
description: Reduce dimensionality of multivariate data using PCA with varimax rotation. Use when you have many correlated variables and need to identify underlying factors or reduce collinearity.
license: MIT
---

# PCA Decomposition Guide

## Overview

Principal Component Analysis (PCA) reduces many correlated variables into fewer uncorrelated components. Varimax rotation makes components more interpretable by maximizing variance.

## When to Use PCA

- Many correlated predictor variables
- Need to identify underlying factor groups
- Reduce multicollinearity before regression
- Exploratory data analysis

## Basic PCA with Varimax Rotation
```python
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer

# Standardize data first
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA with varimax rotation
fa = FactorAnalyzer(n_factors=4, rotation='varimax')
fa.fit(X_scaled)

# Get factor loadings
loadings = fa.loadings_

# Get component scores for each observation
scores = fa.transform(X_scaled)
```

## Workflow for Attribution Analysis

When using PCA for contribution analysis with predefined categories:

1. **Combine ALL variables first**, then do PCA together:
```python
# Include all variables from all categories in one matrix
all_vars = ['AirTemp', 'NetRadiation', 'Precip', 'Inflow', 'Outflow',
            'WindSpeed', 'DevelopedArea', 'AgricultureArea']
X = df[all_vars].values

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# PCA on ALL variables together
fa = FactorAnalyzer(n_factors=4, rotation='varimax')
fa.fit(X_scaled)
scores = fa.transform(X_scaled)
```

2. **Interpret loadings** to map factors to categories (optional for understanding)

3. **Use factor scores directly** for R² decomposition

**Important**: Do NOT run separate PCA for each category. Run one global PCA on all variables, then use the resulting factor scores for contribution analysis.

## Interpreting Factor Loadings

Loadings show correlation between original variables and components:

| Loading | Interpretation |
|---------|----------------|
| > 0.7 | Strong association |
| 0.4 - 0.7 | Moderate association |
| < 0.4 | Weak association |

## Example: Economic Indicators
```python
import pandas as pd
from sklearn.preprocessing import StandardScaler
from factor_analyzer import FactorAnalyzer

# Variables: gdp, unemployment, inflation, interest_rate, exports, imports
df = pd.read_csv('economic_data.csv')
variables = ['gdp', 'unemployment', 'inflation',
             'interest_rate', 'exports', 'imports']

X = df[variables].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

fa = FactorAnalyzer(n_factors=3, rotation='varimax')
fa.fit(X_scaled)

# View loadings
loadings_df = pd.DataFrame(
    fa.loadings_,
    index=variables,
    columns=['RC1', 'RC2', 'RC3']
)
print(loadings_df.round(2))
```

## Choosing Number of Factors

### Option 1: Kaiser Criterion
```python
# Check eigenvalues
eigenvalues, _ = fa.get_eigenvalues()

# Keep factors with eigenvalue > 1
n_factors = sum(eigenvalues > 1)
```

### Option 2: Domain Knowledge

If you know how many categories your variables should group into, specify directly:
```python
# Example: health data with 3 expected categories (lifestyle, genetics, environment)
fa = FactorAnalyzer(n_factors=3, rotation='varimax')
```

## Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Loadings all similar | Too few factors | Increase n_factors |
| Negative loadings | Inverse relationship | Normal, interpret direction |
| Low variance explained | Data not suitable for PCA | Check correlations first |

## Best Practices

- Always standardize data before PCA
- Use varimax rotation for interpretability
- Check factor loadings to name components
- Use Kaiser criterion or domain knowledge for n_factors
- For attribution analysis, run ONE global PCA on all variables

Overview

This skill performs PCA-based dimensionality reduction with varimax rotation to produce interpretable components and factor scores. It is designed for datasets with many correlated variables where you need to reveal underlying factors or reduce multicollinearity. The output includes factor loadings and component scores ready for downstream modeling or attribution analysis.

How this skill works

The skill standardizes input variables, fits a PCA/Factor Analysis model, and applies varimax rotation to maximize interpretability of component loadings. It returns loadings that link original variables to components and transformed factor scores for each observation. Users can choose the number of factors via eigenvalues (Kaiser criterion) or by specifying domain-driven expectations.

When to use it

  • You have many correlated predictors and want fewer uncorrelated components.
  • You need to identify latent factor groups from observed variables.
  • You want to reduce multicollinearity before regression or predictive modeling.
  • You need component scores for contribution or R² decomposition analysis.
  • You require interpretable components for exploratory data analysis.

Best practices

  • Standardize all variables before running PCA or factor analysis.
  • Run a single global PCA on the full variable set when doing attribution or contribution analysis — do not run separate PCAs by category.
  • Use varimax rotation to improve interpretability of loadings and simplify component structure.
  • Choose n_factors using eigenvalues (keep >1) plus domain knowledge; inspect scree and loadings to validate choices.
  • Interpret loadings by magnitude: >0.7 strong, 0.4–0.7 moderate, <0.4 weak. Name components based on dominant loadings.

Example use cases

  • Hydrology: combine meteorological and land-use variables to extract drivers for runoff and use factor scores in attribution.
  • Economics: reduce many macro indicators into 3–4 components representing growth, labor, and trade for regression controls.
  • Health analytics: group lifestyle, genetic, and environmental measures into interpretable factors for outcome modeling.
  • Feature engineering: replace correlated predictors with a few orthogonal factor scores to stabilize regression coefficients.
  • R² decomposition: compute factor scores from a global PCA and use them directly to apportion explained variance across predefined categories.

FAQ

Should I standardize variables before PCA?

Yes. Standardization ensures variables contribute equally regardless of scale and is required prior to PCA or factor analysis.

Can I run PCA separately on each category for contribution analysis?

No. For attribution and consistent factor scores, run a single global PCA on all variables, then map loadings or aggregate scores to categories.