home / skills / benchflow-ai / skillsbench / contribution-analysis

contribution-analysis skill

safe

/tasks/lake-warming-attribution/environment/skills/contribution-analysis

This skill quantifies each factor's contribution to outcome variance using R² decomposition, enabling clear prioritization of drivers.

npx playbooks add skill benchflow-ai/skillsbench --skill contribution-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.9 KB

---
name: contribution-analysis
description: Calculate the relative contribution of different factors to a response variable using R² decomposition. Use when you need to quantify how much each factor explains the variance of an outcome.
license: MIT
---

# Contribution Analysis Guide

## Overview

Contribution analysis quantifies how much each factor contributes to explaining the variance of a response variable. This skill focuses on R² decomposition method.

## Complete Workflow

When you have multiple correlated variables that belong to different categories:
```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from factor_analyzer import FactorAnalyzer

# Step 1: Combine ALL variables into one matrix
pca_vars = ['Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Var6', 'Var7', 'Var8']
X = df[pca_vars].values
y = df['ResponseVariable'].values

# Step 2: Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 3: Run ONE global PCA on all variables together
fa = FactorAnalyzer(n_factors=4, rotation='varimax')
fa.fit(X_scaled)
scores = fa.transform(X_scaled)

# Step 4: R² decomposition on factor scores
def calc_r2(X, y):
    model = LinearRegression()
    model.fit(X, y)
    y_pred = model.predict(X)
    ss_res = np.sum((y - y_pred) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    return 1 - (ss_res / ss_tot)

full_r2 = calc_r2(scores, y)

# Step 5: Calculate contribution of each factor
contrib_0 = full_r2 - calc_r2(scores[:, [1, 2, 3]], y)
contrib_1 = full_r2 - calc_r2(scores[:, [0, 2, 3]], y)
contrib_2 = full_r2 - calc_r2(scores[:, [0, 1, 3]], y)
contrib_3 = full_r2 - calc_r2(scores[:, [0, 1, 2]], y)
```

## R² Decomposition Method

The contribution of each factor is calculated by comparing the full model R² with the R² when that factor is removed:
```
Contribution_i = R²_full - R²_without_i
```

## Output Format
```python
contributions = {
    'Category1': contrib_0 * 100,
    'Category2': contrib_1 * 100,
    'Category3': contrib_2 * 100,
    'Category4': contrib_3 * 100
}

dominant = max(contributions, key=contributions.get)
dominant_pct = round(contributions[dominant])

with open('output.csv', 'w') as f:
    f.write('variable,contribution\n')
    f.write(f'{dominant},{dominant_pct}\n')
```

## Common Issues

| Issue | Cause | Solution |
|-------|-------|----------|
| Negative contribution | Suppressor effect | Check for multicollinearity |
| Contributions don't sum to R² | Normal behavior | R² decomposition is approximate |
| Very small contributions | Factor not important | May be negligible driver |

## Best Practices

- Run ONE global PCA on all variables together, not separate PCA per category
- Use factor_analyzer with varimax rotation
- Map factors to category names based on loadings interpretation
- Report contribution as percentage
- Identify the dominant (largest) factor

Overview

This skill calculates the relative contribution of different factors to a response variable using R² decomposition on factor scores. It combines all candidate variables into a single factor analysis, then quantifies how much each derived factor explains outcome variance. The result is a percentage contribution per factor and identification of the dominant driver.

How this skill works

All variables are standardized and a single global factor/PCA model is estimated to produce orthogonal factor scores. A linear regression of the outcome on all factor scores produces a full-model R². Each factor's contribution is computed as the drop in R² when that factor's score is omitted (R²_full - R²_without_i). Contributions are reported as percentages and the largest value is flagged as dominant.

When to use it

You need to quantify how groups of correlated variables explain variance in an outcome.
You want a simple, interpretable decomposition of explained variance by latent factors.
Variables are numerous and naturally group into categories but are correlated across groups.
You need a reproducible method to report the dominant driver of an outcome.
You prefer factor-based summaries (PCA/FA) rather than single-variable importance.

Best practices

Run one global PCA or factor analysis on all variables together, not separate per category.
Standardize predictors before factor analysis so loadings are comparable.
Use an orthogonal rotation (e.g., varimax) to aid interpretation of factor loadings.
Map factors to category names based on loadings, then report contributions as percentages.
Check for multicollinearity and suppressor effects if any contribution is negative.

Example use cases

Assess which behavioral, demographic, or environmental latent factor most explains customer churn.
Decompose drivers of test scores when many correlated cognitive and socio-economic variables exist.
Compare how product features, pricing, and marketing latent factors contribute to revenue variance.
Summarize contributions of physiological, lifestyle, and genetic factors to a health outcome.

FAQ

What if contributions don't sum to the full R²?

R² decomposition by subtraction is approximate; contributions need not sum exactly to R² due to overlap and model geometry.

Why could a contribution be negative?

Negative values can arise from suppressor effects or multicollinearity; inspect loadings and correlations and consider re-specifying factors.