home / skills / pluginagentmarketplace / custom-plugin-sql / data-analyst

data-analyst skill

/skills/data-analyst

This skill helps you perform exploratory data analysis and advanced SQL analytics to derive business insights from complex datasets.

npx playbooks add skill pluginagentmarketplace/custom-plugin-sql --skill data-analyst

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
7.8 KB
---
name: data-analysis-sql
description: SQL for data analysis with exploratory analysis, advanced aggregations, statistical functions, outlier detection, and business insights. 50+ real-world analytics queries.
sasmp_version: "1.3.0"
bonded_agent: 05-data-analyst
bond_type: PRIMARY_BOND
---

# SQL for Data Analysis

## Exploratory Data Analysis (EDA)

### Data Profiling

```sql
-- Understand data structure and quality
SELECT COUNT(*) as record_count FROM employees;
SELECT COUNT(DISTINCT department) as unique_departments FROM employees;
SELECT COUNT(*) - COUNT(email) as missing_emails FROM employees;

-- Column value distribution
SELECT salary, COUNT(*) as frequency
FROM employees
GROUP BY salary
ORDER BY frequency DESC;

-- Missing data analysis
SELECT
  COUNT(*) as total_records,
  COUNT(phone) as non_null_phone,
  COUNT(*) - COUNT(phone) as missing_phone,
  ROUND(100.0 * (COUNT(*) - COUNT(phone)) / COUNT(*), 2) as missing_percentage
FROM employees;

-- Data type and range checks
SELECT
  MIN(salary) as min_salary,
  MAX(salary) as max_salary,
  ROUND(AVG(salary), 2) as avg_salary,
  ROUND(STDDEV(salary), 2) as salary_stddev
FROM employees;
```

### Distribution Analysis

```sql
-- Value frequency distribution
SELECT
  department,
  COUNT(*) as emp_count,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percentage
FROM employees
GROUP BY department
ORDER BY emp_count DESC;

-- Salary ranges and distribution
SELECT
  CASE
    WHEN salary < 50000 THEN 'Under 50K'
    WHEN salary < 75000 THEN '50K-75K'
    WHEN salary < 100000 THEN '75K-100K'
    ELSE '100K+'
  END as salary_range,
  COUNT(*) as emp_count,
  MIN(salary) as min_sal,
  MAX(salary) as max_sal,
  ROUND(AVG(salary), 2) as avg_sal
FROM employees
GROUP BY salary_range
ORDER BY MIN(salary);

-- Distribution visualization data
SELECT
  salary,
  COUNT(*) as frequency,
  ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as pct,
  RPAD('*', COUNT(*) / 10, '*') as bar_chart
FROM employees
GROUP BY salary
ORDER BY salary;
```

## Statistical Analysis

### Summary Statistics

```sql
-- Comprehensive statistics by group
SELECT
  department,
  COUNT(*) as count,
  ROUND(AVG(salary), 2) as mean_salary,
  ROUND(MIN(salary), 2) as min_salary,
  ROUND(MAX(salary), 2) as max_salary,
  ROUND(STDDEV(salary), 2) as stddev_salary,
  ROUND(AVG(ABS(salary - (SELECT AVG(salary) FROM employees WHERE department = e.department))), 2) as avg_deviation
FROM employees e
GROUP BY department
ORDER BY mean_salary DESC;

-- Percentile analysis
SELECT
  department,
  ROUND(PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary), 2) as q1,
  ROUND(PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY salary), 2) as median,
  ROUND(PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary), 2) as q3,
  ROUND(PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY salary), 2) as p95
FROM employees
GROUP BY department;
```

### Outlier Detection

```sql
-- Find outliers using standard deviation
SELECT
  emp_id,
  first_name,
  salary,
  ROUND(AVG(salary) OVER (), 2) as avg_salary,
  ROUND(STDDEV(salary) OVER (), 2) as stddev_salary,
  ROUND(ABS(salary - AVG(salary) OVER ()) / NULLIF(STDDEV(salary) OVER (), 0), 2) as z_score
FROM employees
HAVING ABS(salary - AVG(salary) OVER ()) / NULLIF(STDDEV(salary) OVER (), 0) > 3
ORDER BY z_score DESC;

-- IQR method for outliers
WITH salary_stats AS (
  SELECT
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY salary) as q1,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY salary) as q3
  FROM employees
)
SELECT
  emp_id,
  salary,
  CASE
    WHEN salary < (SELECT q1 FROM salary_stats) - 1.5 * ((SELECT q3 FROM salary_stats) - (SELECT q1 FROM salary_stats))
    OR salary > (SELECT q3 FROM salary_stats) + 1.5 * ((SELECT q3 FROM salary_stats) - (SELECT q1 FROM salary_stats))
    THEN 'Outlier'
    ELSE 'Normal'
  END as outlier_status
FROM employees;
```

## Comparative Analysis

### Period-over-Period Comparison

```sql
-- Year-over-year sales comparison
SELECT
  EXTRACT(QUARTER FROM order_date) as quarter,
  EXTRACT(YEAR FROM order_date) as year,
  ROUND(SUM(amount), 2) as total_sales,
  ROUND(LAG(SUM(amount)) OVER (ORDER BY EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date)), 2) as prev_period,
  ROUND(SUM(amount) - LAG(SUM(amount)) OVER (ORDER BY EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date)), 2) as yoy_change,
  ROUND(100.0 * (SUM(amount) - LAG(SUM(amount)) OVER (ORDER BY EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date))) / LAG(SUM(amount)) OVER (ORDER BY EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date)), 2) as yoy_pct_change
FROM orders
GROUP BY EXTRACT(YEAR FROM order_date), EXTRACT(QUARTER FROM order_date)
ORDER BY year, quarter;
```

### Cohort Analysis

```sql
-- User cohort analysis
WITH user_cohorts AS (
  SELECT
    DATE_TRUNC('month', first_order_date)::DATE as cohort_month,
    user_id,
    DATE_TRUNC('month', order_date)::DATE as order_month
  FROM users u
  LEFT JOIN orders o ON u.id = o.user_id
)
SELECT
  cohort_month,
  DATE_PART('month', order_month - cohort_month) / 1 as months_since_cohort,
  COUNT(DISTINCT user_id) as users,
  ROUND(100.0 * COUNT(DISTINCT user_id) /
    (SELECT COUNT(DISTINCT user_id) FROM user_cohorts WHERE order_month = cohort_month), 2) as retention_rate
FROM user_cohorts
WHERE order_month >= cohort_month
GROUP BY cohort_month, months_since_cohort
ORDER BY cohort_month, months_since_cohort;
```

## Correlation & Relationship Analysis

```sql
-- Correlation between variables
WITH salary_data AS (
  SELECT
    years_experience,
    salary,
    AVG(salary) OVER () as avg_salary,
    AVG(years_experience) OVER () as avg_experience,
    STDDEV(salary) OVER () as stddev_salary,
    STDDEV(years_experience) OVER () as stddev_experience
  FROM employees
)
SELECT
  ROUND(
    SUM((years_experience - avg_experience) * (salary - avg_salary)) /
    (COUNT(*) * stddev_salary * stddev_experience),
    4
  ) as correlation
FROM salary_data;

-- Segment analysis
SELECT
  CASE
    WHEN years_experience < 2 THEN 'Junior'
    WHEN years_experience < 5 THEN 'Mid-level'
    WHEN years_experience < 10 THEN 'Senior'
    ELSE 'Expert'
  END as experience_level,
  COUNT(*) as count,
  ROUND(AVG(salary), 2) as avg_salary,
  ROUND(AVG(performance_rating), 2) as avg_rating
FROM employees
GROUP BY experience_level
ORDER BY COUNT(*) DESC;
```

## Data Quality Validation

```sql
-- Check for invalid values
SELECT
  CASE
    WHEN salary < 0 THEN 'Negative salary'
    WHEN salary > 1000000 THEN 'Unusually high salary'
    WHEN email NOT LIKE '%@%' THEN 'Invalid email'
    WHEN hire_date > CURRENT_DATE THEN 'Future hire date'
    WHEN years_experience > 70 THEN 'Impossible experience'
    ELSE NULL
  END as data_quality_issue,
  COUNT(*) as count
FROM employees
WHERE salary < 0
  OR salary > 1000000
  OR email NOT LIKE '%@%'
  OR hire_date > CURRENT_DATE
  OR years_experience > 70
GROUP BY data_quality_issue;

-- Duplicate detection
SELECT
  email,
  COUNT(*) as occurrence_count,
  STRING_AGG(DISTINCT emp_id::text, ', ') as emp_ids
FROM employees
WHERE email IS NOT NULL
GROUP BY email
HAVING COUNT(*) > 1
ORDER BY occurrence_count DESC;
```

## Trend Analysis

```sql
-- Moving average
SELECT
  order_date,
  amount,
  ROUND(AVG(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ), 2) as moving_avg_7day,
  ROUND(AVG(amount) OVER (
    ORDER BY order_date
    ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
  ), 2) as moving_avg_30day
FROM daily_orders
ORDER BY order_date;

-- Growth rate
SELECT
  DATE_TRUNC('month', order_date)::DATE as month,
  ROUND(SUM(amount), 2) as monthly_revenue,
  ROUND((SUM(amount) - LAG(SUM(amount)) OVER (ORDER BY DATE_TRUNC('month', order_date))) /
    LAG(SUM(amount)) OVER (ORDER BY DATE_TRUNC('month', order_date)) * 100, 2) as growth_rate_pct
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;
```

## Next Steps

Learn advanced SQL concepts and optimization techniques in the `advanced-sql` skill.

Overview

This skill teaches SQL for data analysis focused on exploratory analysis, advanced aggregations, statistical functions, outlier detection, and actionable business insights. It includes 50+ real-world analytics queries and patterns to help you profile data, validate quality, run cohort and trend analyses, and extract insights for decisions.

How this skill works

The skill provides ready-to-run SQL patterns that inspect table structure, value distributions, missingness, and statistical summaries. It implements window functions, percentiles, IQR and z-score outlier detection, cohort retention, period-over-period comparisons, correlation estimators, and moving averages to reveal trends and anomalies. Examples are written to be adapted across common SQL dialects and production workflows.

When to use it

  • Perform initial exploratory data analysis on a new dataset to assess quality and distributions
  • Detect and investigate outliers before feeding data to models or reports
  • Compare sales or KPIs across periods and compute growth and percent changes
  • Build cohort retention and user behavior analyses for product or marketing teams
  • Validate data quality rules and find duplicates or impossible values

Best practices

  • Start with data profiling queries to understand missingness and value ranges before deep analysis
  • Use window functions and grouped percentiles for robust summary stats by segment
  • Apply both IQR and z-score methods to cross-check outlier detection results
  • Materialize expensive intermediate results (CTEs or temp tables) when reusing stats across queries
  • Adapt date-truncation and percentile functions to your SQL dialect for consistent results

Example use cases

  • Generate department-level salary summaries and detect compensation anomalies
  • Compute 7- and 30-day moving averages for daily revenue monitoring
  • Run cohort analysis to measure monthly user retention after acquisition campaigns
  • Identify invalid records such as future hire dates or malformed emails for data cleaning
  • Produce period-over-period sales reports with absolute and percentage change

FAQ

Can these queries run on any SQL database?

Most patterns are standard SQL but functions like PERCENTILE_CONT, STDDEV, or date truncation vary by dialect; adjust function names or use equivalent windowed expressions for compatibility.

How should I handle large tables when computing percentiles or windows?

Use sampled calculations for exploratory work, create pre-aggregated summaries, or materialize intermediate results; leverage database-specific analytics extensions if available.