home / skills / dkyazzentwatwa / chatgpt-skills / outlier-detective
This skill detects anomalies in numeric data using statistical and ML methods, enabling data cleaning, fraud detection, and quality control analyses.
npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill outlier-detectiveReview the files below or copy the command above to add this skill to your agents.
---
name: outlier-detective
description: Detect anomalies and outliers in datasets using statistical and ML methods. Use for data cleaning, fraud detection, or quality control analysis.
---
# Outlier Detective
Detect anomalies and outliers in numeric data using multiple methods.
## Features
- **Statistical Methods**: Z-score, IQR, Modified Z-score
- **ML Methods**: Isolation Forest, LOF, DBSCAN
- **Visualization**: Box plots, scatter plots
- **Multi-Column**: Analyze multiple variables
- **Reports**: Detailed outlier reports
- **Flexible Thresholds**: Configurable sensitivity
## Quick Start
```python
from outlier_detective import OutlierDetective
detective = OutlierDetective()
detective.load_csv("sales_data.csv")
# Detect outliers in a column
outliers = detective.detect("revenue", method="iqr")
print(f"Found {len(outliers)} outliers")
# Get full report
report = detective.analyze("revenue")
print(report)
```
## CLI Usage
```bash
# Detect outliers using IQR method
python outlier_detective.py --input data.csv --column sales --method iqr
# Use Z-score with custom threshold
python outlier_detective.py --input data.csv --column price --method zscore --threshold 3
# Analyze all numeric columns
python outlier_detective.py --input data.csv --all
# Generate visualization
python outlier_detective.py --input data.csv --column revenue --plot boxplot.png
# Export outliers to CSV
python outlier_detective.py --input data.csv --column value --output outliers.csv
# Use Isolation Forest (ML)
python outlier_detective.py --input data.csv --method isolation_forest
```
## API Reference
### OutlierDetective Class
```python
class OutlierDetective:
def __init__(self)
# Data loading
def load_csv(self, filepath: str, **kwargs) -> 'OutlierDetective'
def load_dataframe(self, df: pd.DataFrame) -> 'OutlierDetective'
# Detection (single column)
def detect(self, column: str, method: str = "iqr", **kwargs) -> pd.DataFrame
def analyze(self, column: str) -> dict
# Detection (multi-column)
def detect_multivariate(self, columns: list = None, method: str = "isolation_forest") -> pd.DataFrame
def analyze_all(self) -> dict
# Visualization
def plot_boxplot(self, column: str, output: str) -> str
def plot_scatter(self, col1: str, col2: str, output: str) -> str
def plot_distribution(self, column: str, output: str) -> str
# Export
def get_outliers(self, column: str, method: str = "iqr") -> pd.DataFrame
def get_clean_data(self, column: str, method: str = "iqr") -> pd.DataFrame
```
## Detection Methods
### Statistical Methods
#### IQR (Interquartile Range)
- Default and most robust method
- Outliers: values below Q1 - 1.5×IQR or above Q3 + 1.5×IQR
- Multiplier configurable (default: 1.5)
```python
outliers = detective.detect("price", method="iqr", multiplier=1.5)
```
#### Z-Score
- Based on standard deviations from mean
- Assumes normal distribution
- Threshold configurable (default: 3)
```python
outliers = detective.detect("price", method="zscore", threshold=3)
```
#### Modified Z-Score
- Uses median instead of mean
- More robust to existing outliers
- Based on MAD (Median Absolute Deviation)
```python
outliers = detective.detect("price", method="modified_zscore", threshold=3.5)
```
### ML Methods
#### Isolation Forest
- Ensemble method, good for high-dimensional data
- Contamination parameter sets expected outlier fraction
```python
outliers = detective.detect_multivariate(
method="isolation_forest",
contamination=0.1
)
```
#### Local Outlier Factor (LOF)
- Density-based method
- Compares local density to neighbors
```python
outliers = detective.detect_multivariate(
method="lof",
n_neighbors=20
)
```
## Output Format
### detect() Result
```python
# Returns DataFrame of outlier rows with additional columns:
# - outlier_score: How extreme the value is
# - outlier_reason: Description of why it's an outlier
index value outlier_score outlier_reason
0 15 5000 4.2 Above Q3 + 1.5×IQR
1 42 -1000 -3.8 Below Q1 - 1.5×IQR
```
### analyze() Result
```python
{
"column": "revenue",
"total_rows": 1000,
"outlier_count": 23,
"outlier_percent": 2.3,
"methods": {
"iqr": {"count": 23, "indices": [...]},
"zscore": {"count": 18, "indices": [...]},
"modified_zscore": {"count": 20, "indices": [...]}
},
"stats": {
"mean": 5432.10,
"median": 4890.00,
"std": 1234.56,
"min": -1000.00,
"max": 15000.00,
"q1": 3500.00,
"q3": 6200.00,
"iqr": 2700.00
},
"bounds": {
"lower": -550.00,
"upper": 10250.00
}
}
```
## Example Workflows
### Data Cleaning Pipeline
```python
detective = OutlierDetective()
detective.load_csv("raw_data.csv")
# Analyze and visualize
report = detective.analyze("price")
print(f"Found {report['outlier_count']} outliers ({report['outlier_percent']:.1f}%)")
# Get clean data
clean_data = detective.get_clean_data("price", method="iqr")
clean_data.to_csv("clean_data.csv")
```
### Fraud Detection
```python
detective = OutlierDetective()
detective.load_csv("transactions.csv")
# Use multiple methods for consensus
iqr_outliers = set(detective.detect("amount", method="iqr").index)
zscore_outliers = set(detective.detect("amount", method="zscore").index)
# Transactions flagged by both methods
high_confidence = iqr_outliers & zscore_outliers
print(f"High-confidence anomalies: {len(high_confidence)}")
```
### Multi-Variable Analysis
```python
detective = OutlierDetective()
detective.load_csv("sensors.csv")
# Detect multivariate outliers
outliers = detective.detect_multivariate(
columns=["temp", "pressure", "humidity"],
method="isolation_forest",
contamination=0.05
)
print(f"Anomalous readings: {len(outliers)}")
```
## Visualization Examples
```python
# Box plot with outliers highlighted
detective.plot_boxplot("revenue", "revenue_boxplot.png")
# Distribution with bounds
detective.plot_distribution("price", "price_dist.png")
# Scatter plot (2D outliers)
detective.plot_scatter("feature1", "feature2", "scatter.png")
```
## Dependencies
- pandas>=2.0.0
- numpy>=1.24.0
- scipy>=1.10.0
- scikit-learn>=1.3.0
- matplotlib>=3.7.0
This skill detects anomalies and outliers in numeric datasets using a mix of statistical rules and machine learning models. It provides single-column and multivariate detection, visualizations, and exportable reports to support data cleaning, fraud detection, and quality control. Sensitivity and algorithms are configurable for practical workflows.
The skill inspects numeric columns and returns rows flagged as outliers along with an outlier score and human-readable reason. Built-in statistical methods include IQR, Z-score, and Modified Z-score; ML methods include Isolation Forest, Local Outlier Factor, and DBSCAN for high-dimensional and density-based detection. It can analyze one column, run multivariate detection across multiple features, generate summary statistics and bounds, and produce boxplots, scatter plots, and distribution charts.
Which method should I pick first?
Use IQR as a robust default for single columns; switch to Isolation Forest or LOF for multivariate cases or complex feature interactions.
Can I tune sensitivity?
Yes. Statistical thresholds (multiplier, z-score threshold) and ML parameters (contamination, n_neighbors) are configurable to increase or decrease sensitivity.