home / skills / htlin222 / dotfiles / data-science

data-science skill

safe

npx playbooks add skill htlin222/dotfiles --skill data-science

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.0 KB

---
name: data-science
description: Data analysis, SQL queries, BigQuery operations, and data insights. Use for data analysis tasks and queries.
---

# Data Science

Data analysis, SQL, and insights generation.

## When to Use

- Writing SQL queries
- Data analysis and exploration
- Creating visualizations
- Statistical analysis
- ETL and data pipelines

## SQL Patterns

### Common Queries

```sql
-- Aggregation with window functions
SELECT
    user_id,
    order_date,
    amount,
    SUM(amount) OVER (PARTITION BY user_id ORDER BY order_date) as running_total,
    ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_date DESC) as recency_rank
FROM orders;

-- CTEs for readability
WITH monthly_stats AS (
    SELECT
        DATE_TRUNC('month', created_at) as month,
        COUNT(*) as total_orders,
        SUM(amount) as revenue
    FROM orders
    GROUP BY 1
),
growth AS (
    SELECT
        month,
        revenue,
        LAG(revenue) OVER (ORDER BY month) as prev_revenue,
        (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0) as growth_rate
    FROM monthly_stats
)
SELECT * FROM growth;
```

### BigQuery Specifics

```sql
-- Partitioned table query
SELECT *
FROM `project.dataset.events`
WHERE DATE(_PARTITIONTIME) BETWEEN '2024-01-01' AND '2024-01-31';

-- UNNEST for arrays
SELECT
    user_id,
    item
FROM `project.dataset.orders`,
UNNEST(items) as item;

-- Approximate counts for large data
SELECT APPROX_COUNT_DISTINCT(user_id) as unique_users
FROM `project.dataset.events`;
```

## Python Analysis

```python
import pandas as pd
import numpy as np

# Load and explore
df = pd.read_csv('data.csv')
df.info()
df.describe()

# Clean and transform
df['date'] = pd.to_datetime(df['date'])
df = df.dropna(subset=['required_field'])
df['category'] = df['category'].fillna('Unknown')

# Aggregate
summary = df.groupby('category').agg({
    'value': ['mean', 'sum', 'count'],
    'date': ['min', 'max']
}).round(2)

# Visualize
import matplotlib.pyplot as plt
df.groupby('date')['value'].sum().plot(figsize=(12, 6))
plt.title('Daily Values')
plt.savefig('chart.png', dpi=150, bbox_inches='tight')
```

## Statistical Analysis

```python
from scipy import stats

# Hypothesis testing
t_stat, p_value = stats.ttest_ind(group_a, group_b)

# Correlation
correlation = df['x'].corr(df['y'])

# Regression
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(X, y)
print(f"R² = {model.score(X, y):.3f}")
```

## Output Format

```markdown
## Analysis Summary

**Question:** [What we're trying to answer]
**Data Source:** [Tables/files used]
**Date Range:** [Time period]

### Key Findings

1. [Finding with supporting metric]
2. [Finding with supporting metric]

### Visualization

[Chart description or embedded image]

### Recommendations

- [Actionable insight]
```

## Examples

**Input:** "Analyze user retention"
**Action:** Query cohort data, calculate retention rates, visualize trends

**Input:** "Find top customers"
**Action:** Write SQL for RFM analysis, segment users, summarize findings