home / skills / eyadsibai / ltk / data-science

This skill helps you design experiments, build predictive models, and perform causal analysis with Python-based data science tools.

npx playbooks add skill eyadsibai/ltk --skill data-science

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.8 KB
---
name: data-science
description: Use when "statistical modeling", "A/B testing", "experiment design", "causal inference", "predictive modeling", or asking about "hypothesis testing", "feature engineering", "data analysis", "pandas", "scikit-learn"
version: 1.0.0
---

<!-- Adapted from: claude-skills/engineering-team/senior-data-scientist -->

# Data Science Guide

Statistical modeling, experimentation, and advanced analytics.

## When to Use

- Designing A/B tests and experiments
- Building predictive models
- Performing causal analysis
- Feature engineering
- Statistical hypothesis testing

## Tech Stack

| Category | Tools |
|----------|-------|
| Languages | Python, SQL, R |
| Analysis | NumPy, Pandas, SciPy |
| ML | Scikit-learn, XGBoost, LightGBM |
| Visualization | Matplotlib, Seaborn, Plotly |
| Statistics | Statsmodels, PyMC |
| Notebooks | Jupyter, VS Code |

## Experiment Design

### A/B Test Framework

```python
import scipy.stats as stats
import numpy as np

def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
    """Calculate required sample size for A/B test."""
    effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
    analysis = stats.TTestIndPower()
    return int(analysis.solve_power(
        effect_size=effect_size,
        alpha=alpha,
        power=power,
        alternative='two-sided'
    ))

# Example: 5% baseline, 10% relative lift
n = calculate_sample_size(0.05, 0.005)
print(f"Required sample size per group: {n}")
```

### Statistical Significance

```python
def analyze_ab_test(control, treatment):
    """Analyze A/B test results."""
    # Two-proportion z-test
    n1, n2 = len(control), len(treatment)
    p1, p2 = control.mean(), treatment.mean()
    p_pool = (control.sum() + treatment.sum()) / (n1 + n2)

    se = np.sqrt(p_pool * (1 - p_pool) * (1/n1 + 1/n2))
    z = (p2 - p1) / se
    p_value = 2 * (1 - stats.norm.cdf(abs(z)))

    return {
        'control_rate': p1,
        'treatment_rate': p2,
        'lift': (p2 - p1) / p1,
        'p_value': p_value,
        'significant': p_value < 0.05
    }
```

## Feature Engineering

### Common Patterns

```python
import pandas as pd
from sklearn.preprocessing import StandardScaler

def engineer_features(df):
    """Feature engineering pipeline."""
    # Temporal features
    df['hour'] = df['timestamp'].dt.hour
    df['day_of_week'] = df['timestamp'].dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6])

    # Aggregations
    df['user_avg_spend'] = df.groupby('user_id')['amount'].transform('mean')
    df['user_transaction_count'] = df.groupby('user_id')['amount'].transform('count')

    # Ratios
    df['spend_vs_avg'] = df['amount'] / df['user_avg_spend']

    return df
```

### Feature Selection

```python
from sklearn.feature_selection import mutual_info_classif

def select_features(X, y, k=10):
    """Select top k features by mutual information."""
    mi_scores = mutual_info_classif(X, y)
    top_k = np.argsort(mi_scores)[-k:]
    return X.columns[top_k].tolist()
```

## Model Evaluation

### Cross-Validation

```python
from sklearn.model_selection import cross_val_score, StratifiedKFold

def evaluate_model(model, X, y):
    """Robust model evaluation."""
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

    scores = {
        'accuracy': cross_val_score(model, X, y, cv=cv, scoring='accuracy'),
        'precision': cross_val_score(model, X, y, cv=cv, scoring='precision'),
        'recall': cross_val_score(model, X, y, cv=cv, scoring='recall'),
        'auc': cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    }

    return {k: f"{v.mean():.3f} (+/- {v.std()*2:.3f})" for k, v in scores.items()}
```

## Causal Inference

### Propensity Score Matching

```python
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

def propensity_matching(df, treatment_col, features):
    """Match treatment and control using propensity scores."""
    # Estimate propensity scores
    ps_model = LogisticRegression()
    ps_model.fit(df[features], df[treatment_col])
    df['propensity'] = ps_model.predict_proba(df[features])[:, 1]

    # Match nearest neighbors
    treated = df[df[treatment_col] == 1]
    control = df[df[treatment_col] == 0]

    nn = NearestNeighbors(n_neighbors=1)
    nn.fit(control[['propensity']])
    distances, indices = nn.kneighbors(treated[['propensity']])

    return treated, control.iloc[indices.flatten()]
```

## Best Practices

### Analysis Workflow

1. Define hypothesis clearly
2. Calculate required sample size
3. Design experiment (randomization)
4. Collect data with quality checks
5. Analyze with appropriate tests
6. Report with confidence intervals

### Common Pitfalls

- Multiple comparisons without correction
- Peeking at results before sample size reached
- Simpson's paradox in aggregations
- Survivorship bias in cohort analysis
- Correlation vs causation confusion

Overview

This skill provides practical guidance and reusable patterns for statistical modeling, experiment design, causal inference, and predictive modeling in Python. It focuses on clear hypothesis definition, robust experiment setup, feature engineering, model evaluation, and common pitfalls to avoid. It is tailored for analysts and data scientists working with pandas, scikit-learn, and standard scientific Python tools.

How this skill works

The skill inspects the problem type (A/B testing, causal analysis, predictive modeling) and recommends concrete steps: power/sample-size calculations, test selection, feature engineering patterns, and evaluation pipelines. It includes code patterns for sample-size calculation, two-proportion testing, feature creation and selection, cross-validation evaluation, and propensity score matching for causal work. It emphasizes data quality checks, randomization, and reporting with confidence intervals.

When to use it

  • Designing or sizing A/B tests and online experiments
  • Analyzing randomized experiments and running significance tests
  • Building and validating predictive models with scikit-learn or gradient boosters
  • Performing causal inference (propensity scores, matching) to estimate treatment effects
  • Engineering and selecting features from transactional or time-stamped data

Best practices

  • Start every analysis with a clear, falsifiable hypothesis and the primary metric defined
  • Calculate required sample size and pre-specify stopping rules to avoid peeking bias
  • Use stratified cross-validation and multiple metrics (AUC, precision, recall) for robust evaluation
  • Apply feature engineering patterns: temporal features, user-level aggregations, and ratio features; scale/encode before modeling
  • Correct for multiple comparisons when running many tests and report confidence intervals, not just p-values

Example use cases

  • Estimate sample size and analyze lift for a website conversion test
  • Create user-level features from transaction logs for a churn prediction model
  • Run propensity score matching to estimate causal effect of an intervention in observational data
  • Select top predictive features using mutual information and validate with stratified CV
  • Compute and report significance and confidence intervals for a two-proportion A/B test

FAQ

How do I choose between t-test and two-proportion z-test?

Use two-proportion z-test for binary outcomes (conversion rates) and t-tests for continuous outcomes; check sample sizes and assumptions (normality, equal variances) and prefer nonparametric tests if assumptions fail.

When should I use propensity score matching instead of regression adjustment?

Use propensity matching when you want balance on covariates and transparent matched comparisons; use regression adjustment or doubly robust methods when you can model outcomes reliably and need efficiency with larger samples.