home / skills / dkyazzentwatwa / chatgpt-skills / outlier-detective

outlier-detective skill

/outlier-detective

This skill detects anomalies in numeric data using statistical and ML methods, enabling data cleaning, fraud detection, and quality control analyses.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill outlier-detective

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
6.2 KB
---
name: outlier-detective
description: Detect anomalies and outliers in datasets using statistical and ML methods. Use for data cleaning, fraud detection, or quality control analysis.
---

# Outlier Detective

Detect anomalies and outliers in numeric data using multiple methods.

## Features

- **Statistical Methods**: Z-score, IQR, Modified Z-score
- **ML Methods**: Isolation Forest, LOF, DBSCAN
- **Visualization**: Box plots, scatter plots
- **Multi-Column**: Analyze multiple variables
- **Reports**: Detailed outlier reports
- **Flexible Thresholds**: Configurable sensitivity

## Quick Start

```python
from outlier_detective import OutlierDetective

detective = OutlierDetective()
detective.load_csv("sales_data.csv")

# Detect outliers in a column
outliers = detective.detect("revenue", method="iqr")
print(f"Found {len(outliers)} outliers")

# Get full report
report = detective.analyze("revenue")
print(report)
```

## CLI Usage

```bash
# Detect outliers using IQR method
python outlier_detective.py --input data.csv --column sales --method iqr

# Use Z-score with custom threshold
python outlier_detective.py --input data.csv --column price --method zscore --threshold 3

# Analyze all numeric columns
python outlier_detective.py --input data.csv --all

# Generate visualization
python outlier_detective.py --input data.csv --column revenue --plot boxplot.png

# Export outliers to CSV
python outlier_detective.py --input data.csv --column value --output outliers.csv

# Use Isolation Forest (ML)
python outlier_detective.py --input data.csv --method isolation_forest
```

## API Reference

### OutlierDetective Class

```python
class OutlierDetective:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, **kwargs) -> 'OutlierDetective'
    def load_dataframe(self, df: pd.DataFrame) -> 'OutlierDetective'

    # Detection (single column)
    def detect(self, column: str, method: str = "iqr", **kwargs) -> pd.DataFrame
    def analyze(self, column: str) -> dict

    # Detection (multi-column)
    def detect_multivariate(self, columns: list = None, method: str = "isolation_forest") -> pd.DataFrame
    def analyze_all(self) -> dict

    # Visualization
    def plot_boxplot(self, column: str, output: str) -> str
    def plot_scatter(self, col1: str, col2: str, output: str) -> str
    def plot_distribution(self, column: str, output: str) -> str

    # Export
    def get_outliers(self, column: str, method: str = "iqr") -> pd.DataFrame
    def get_clean_data(self, column: str, method: str = "iqr") -> pd.DataFrame
```

## Detection Methods

### Statistical Methods

#### IQR (Interquartile Range)
- Default and most robust method
- Outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR
- Multiplier configurable (default: 1.5)

```python
outliers = detective.detect("price", method="iqr", multiplier=1.5)
```

#### Z-Score
- Based on standard deviations from mean
- Assumes normal distribution
- Threshold configurable (default: 3)

```python
outliers = detective.detect("price", method="zscore", threshold=3)
```

#### Modified Z-Score
- Uses median instead of mean
- More robust to existing outliers
- Based on MAD (Median Absolute Deviation)

```python
outliers = detective.detect("price", method="modified_zscore", threshold=3.5)
```

### ML Methods

#### Isolation Forest
- Ensemble method, good for high-dimensional data
- Contamination parameter sets expected outlier fraction

```python
outliers = detective.detect_multivariate(
    method="isolation_forest",
    contamination=0.1
)
```

#### Local Outlier Factor (LOF)
- Density-based method
- Compares local density to neighbors

```python
outliers = detective.detect_multivariate(
    method="lof",
    n_neighbors=20
)
```

## Output Format

### detect() Result
```python
# Returns DataFrame of outlier rows with additional columns:
#   - outlier_score: How extreme the value is
#   - outlier_reason: Description of why it's an outlier

   index  value  outlier_score  outlier_reason
0     15   5000          4.2    Above Q3 + 1.5×IQR
1     42  -1000         -3.8    Below Q1 - 1.5×IQR
```

### analyze() Result
```python
{
    "column": "revenue",
    "total_rows": 1000,
    "outlier_count": 23,
    "outlier_percent": 2.3,
    "methods": {
        "iqr": {"count": 23, "indices": [...]},
        "zscore": {"count": 18, "indices": [...]},
        "modified_zscore": {"count": 20, "indices": [...]}
    },
    "stats": {
        "mean": 5432.10,
        "median": 4890.00,
        "std": 1234.56,
        "min": -1000.00,
        "max": 15000.00,
        "q1": 3500.00,
        "q3": 6200.00,
        "iqr": 2700.00
    },
    "bounds": {
        "lower": -550.00,
        "upper": 10250.00
    }
}
```

## Example Workflows

### Data Cleaning Pipeline
```python
detective = OutlierDetective()
detective.load_csv("raw_data.csv")

# Analyze and visualize
report = detective.analyze("price")
print(f"Found {report['outlier_count']} outliers ({report['outlier_percent']:.1f}%)")

# Get clean data
clean_data = detective.get_clean_data("price", method="iqr")
clean_data.to_csv("clean_data.csv")
```

### Fraud Detection
```python
detective = OutlierDetective()
detective.load_csv("transactions.csv")

# Use multiple methods for consensus
iqr_outliers = set(detective.detect("amount", method="iqr").index)
zscore_outliers = set(detective.detect("amount", method="zscore").index)

# Transactions flagged by both methods
high_confidence = iqr_outliers & zscore_outliers
print(f"High-confidence anomalies: {len(high_confidence)}")
```

### Multi-Variable Analysis
```python
detective = OutlierDetective()
detective.load_csv("sensors.csv")

# Detect multivariate outliers
outliers = detective.detect_multivariate(
    columns=["temp", "pressure", "humidity"],
    method="isolation_forest",
    contamination=0.05
)
print(f"Anomalous readings: {len(outliers)}")
```

## Visualization Examples

```python
# Box plot with outliers highlighted
detective.plot_boxplot("revenue", "revenue_boxplot.png")

# Distribution with bounds
detective.plot_distribution("price", "price_dist.png")

# Scatter plot (2D outliers)
detective.plot_scatter("feature1", "feature2", "scatter.png")
```

## Dependencies

- pandas>=2.0.0
- numpy>=1.24.0
- scipy>=1.10.0
- scikit-learn>=1.3.0
- matplotlib>=3.7.0

Overview

This skill detects anomalies and outliers in numeric datasets using a mix of statistical rules and machine learning models. It provides single-column and multivariate detection, visualizations, and exportable reports to support data cleaning, fraud detection, and quality control. Sensitivity and algorithms are configurable for practical workflows.

How this skill works

The skill inspects numeric columns and returns rows flagged as outliers along with an outlier score and human-readable reason. Built-in statistical methods include IQR, Z-score, and Modified Z-score; ML methods include Isolation Forest, Local Outlier Factor, and DBSCAN for high-dimensional and density-based detection. It can analyze one column, run multivariate detection across multiple features, generate summary statistics and bounds, and produce boxplots, scatter plots, and distribution charts.

When to use it

  • Clean training data before machine learning to remove extreme values
  • Flag suspicious transactions or sensor readings for fraud and QA
  • Explore data distributions and discover data entry errors
  • Run multivariate checks when anomalies arise from feature combinations
  • Generate repeatable reports for audits or data pipelines

Best practices

  • Start with IQR for robust single-column detection and tune the multiplier
  • Use Modified Z-score when median-based robustness is needed
  • Apply Isolation Forest or LOF for high-dimensional or contextual anomalies
  • Compare methods and use intersection of methods for higher-confidence flags
  • Visualize flagged points to validate model results before automated removal

Example use cases

  • Data cleaning pipeline: analyze columns, export a cleaned CSV without outliers
  • Fraud detection: run IQR and Z-score then inspect transactions flagged by both
  • Quality control: monitor sensor data and alert on multivariate anomalies
  • EDA: create boxplots and distributions to communicate outlier impact
  • Production monitoring: schedule periodic multivariate isolation forest scans

FAQ

Which method should I pick first?

Use IQR as a robust default for single columns; switch to Isolation Forest or LOF for multivariate cases or complex feature interactions.

Can I tune sensitivity?

Yes. Statistical thresholds (multiplier, z-score threshold) and ML parameters (contamination, n_neighbors) are configurable to increase or decrease sensitivity.