home / skills / dkyazzentwatwa / chatgpt-skills / clustering-analyzer

clustering-analyzer skill

/clustering-analyzer

This skill analyzes and clusters data using multiple algorithms with visualization and evaluation to reveal insights and segment customers.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill clustering-analyzer

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
7.3 KB
---
name: clustering-analyzer
description: Cluster data using K-Means, DBSCAN, hierarchical clustering. Use for customer segmentation, pattern discovery, or data grouping.
---

# Clustering Analyzer

Analyze and cluster data using multiple algorithms with visualization and evaluation.

## Features

- **K-Means**: Partition-based clustering with elbow method
- **DBSCAN**: Density-based clustering for arbitrary shapes
- **Hierarchical**: Agglomerative clustering with dendrograms
- **Evaluation**: Silhouette scores, cluster statistics
- **Visualization**: 2D/3D plots, dendrograms, elbow curves
- **Export**: Labeled data, cluster summaries

## Quick Start

```python
from clustering_analyzer import ClusteringAnalyzer

analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv")

# K-Means clustering
result = analyzer.kmeans(n_clusters=3)
print(f"Silhouette Score: {result['silhouette_score']:.3f}")

# Visualize
analyzer.plot_clusters("clusters.png")
```

## CLI Usage

```bash
# K-Means clustering
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3

# Find optimal clusters (elbow method)
python clustering_analyzer.py --input data.csv --method kmeans --find-optimal

# DBSCAN clustering
python clustering_analyzer.py --input data.csv --method dbscan --eps 0.5 --min-samples 5

# Hierarchical clustering
python clustering_analyzer.py --input data.csv --method hierarchical --clusters 4

# Generate plots
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --plot clusters.png

# Export labeled data
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --output labeled.csv

# Select specific columns
python clustering_analyzer.py --input data.csv --columns age,income,spending --method kmeans --clusters 3
```

## API Reference

### ClusteringAnalyzer Class

```python
class ClusteringAnalyzer:
    def __init__(self)

    # Data loading
    def load_csv(self, filepath: str, columns: list = None) -> 'ClusteringAnalyzer'
    def load_dataframe(self, df: pd.DataFrame, columns: list = None) -> 'ClusteringAnalyzer'

    # Clustering methods
    def kmeans(self, n_clusters: int, **kwargs) -> dict
    def dbscan(self, eps: float = 0.5, min_samples: int = 5) -> dict
    def hierarchical(self, n_clusters: int, linkage: str = "ward") -> dict

    # Optimal clusters
    def find_optimal_clusters(self, max_k: int = 10) -> dict
    def elbow_plot(self, output: str, max_k: int = 10) -> str

    # Evaluation
    def silhouette_score(self) -> float
    def cluster_statistics(self) -> dict

    # Visualization
    def plot_clusters(self, output: str, dimensions: list = None) -> str
    def plot_dendrogram(self, output: str) -> str
    def plot_silhouette(self, output: str) -> str

    # Export
    def get_labels(self) -> list
    def to_dataframe(self) -> pd.DataFrame
    def save_labeled(self, output: str) -> str
```

## Clustering Methods

### K-Means

Best for spherical clusters with known number of groups:

```python
result = analyzer.kmeans(n_clusters=3)

# Returns:
{
    "labels": [0, 1, 2, 0, ...],
    "n_clusters": 3,
    "silhouette_score": 0.65,
    "inertia": 1234.56,
    "cluster_sizes": {0: 150, 1: 200, 2: 100},
    "centroids": [[...], [...], [...]]
}
```

### DBSCAN

Best for arbitrary-shaped clusters:

```python
result = analyzer.dbscan(eps=0.5, min_samples=5)

# Returns:
{
    "labels": [0, 0, 1, -1, ...],  # -1 = noise
    "n_clusters": 3,
    "n_noise": 15,
    "silhouette_score": 0.58,
    "cluster_sizes": {0: 150, 1: 200, 2: 100}
}
```

### Hierarchical (Agglomerative)

Best for understanding cluster hierarchy:

```python
result = analyzer.hierarchical(n_clusters=4, linkage="ward")

# Returns:
{
    "labels": [0, 1, 2, 3, ...],
    "n_clusters": 4,
    "silhouette_score": 0.62,
    "cluster_sizes": {0: 100, 1: 150, 2: 120, 3: 80}
}
```

## Finding Optimal Clusters

### Elbow Method

```python
optimal = analyzer.find_optimal_clusters(max_k=10)

# Returns:
{
    "optimal_k": 4,
    "inertias": [1000, 800, 500, 300, 280, ...],
    "silhouettes": [0.5, 0.55, 0.6, 0.65, 0.63, ...]
}
```

### Elbow Plot

```python
analyzer.elbow_plot("elbow.png", max_k=10)
```

Generates plot showing inertia vs number of clusters.

## Cluster Statistics

```python
stats = analyzer.cluster_statistics()

# Returns:
{
    "n_clusters": 3,
    "cluster_sizes": {0: 150, 1: 200, 2: 100},
    "cluster_means": {
        0: {"age": 25.5, "income": 45000, ...},
        1: {"age": 45.2, "income": 75000, ...},
        2: {"age": 35.1, "income": 55000, ...}
    },
    "cluster_std": {
        0: {"age": 5.2, "income": 8000, ...},
        ...
    },
    "overall_silhouette": 0.65
}
```

## Visualization

### Cluster Plot

```python
# 2D plot (uses first 2 features or PCA)
analyzer.plot_clusters("clusters_2d.png")

# Specify dimensions
analyzer.plot_clusters("clusters.png", dimensions=["age", "income"])
```

### Dendrogram

```python
# For hierarchical clustering
analyzer.hierarchical(n_clusters=4)
analyzer.plot_dendrogram("dendrogram.png")
```

### Silhouette Plot

```python
analyzer.plot_silhouette("silhouette.png")
```

Shows silhouette coefficient for each sample.

## Export Results

### Get Labels

```python
labels = analyzer.get_labels()
# [0, 1, 2, 0, 1, ...]
```

### Save Labeled Data

```python
analyzer.save_labeled("labeled_data.csv")
# Original data + cluster_label column
```

### Get Full DataFrame

```python
df = analyzer.to_dataframe()
# DataFrame with cluster_label column
```

## Example Workflows

### Customer Segmentation

```python
analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv", columns=["age", "income", "spending_score"])

# Find optimal number of segments
optimal = analyzer.find_optimal_clusters(max_k=8)
print(f"Optimal segments: {optimal['optimal_k']}")

# Cluster with optimal k
result = analyzer.kmeans(n_clusters=optimal['optimal_k'])

# Get segment characteristics
stats = analyzer.cluster_statistics()
for cluster_id, means in stats["cluster_means"].items():
    print(f"\nSegment {cluster_id}:")
    for feature, value in means.items():
        print(f"  {feature}: {value:.2f}")

# Save segmented data
analyzer.save_labeled("customer_segments.csv")
```

### Anomaly Detection with DBSCAN

```python
analyzer = ClusteringAnalyzer()
analyzer.load_csv("transactions.csv", columns=["amount", "frequency"])

# DBSCAN identifies noise points as potential anomalies
result = analyzer.dbscan(eps=0.3, min_samples=10)

print(f"Found {result['n_noise']} potential anomalies")

# Get anomalous records
df = analyzer.to_dataframe()
anomalies = df[df["cluster_label"] == -1]
```

### Document Clustering

```python
# After TF-IDF transformation
analyzer = ClusteringAnalyzer()
analyzer.load_dataframe(tfidf_matrix)

# Hierarchical clustering to see document relationships
result = analyzer.hierarchical(n_clusters=5)
analyzer.plot_dendrogram("doc_dendrogram.png")
```

## Data Preprocessing

The analyzer automatically:
- Handles missing values (imputation)
- Scales features (standardization)
- Reduces dimensions for visualization (PCA)

For custom preprocessing:

```python
from sklearn.preprocessing import StandardScaler

# Preprocess manually
df = pd.read_csv("data.csv")
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

# Load preprocessed data
analyzer.load_dataframe(df_scaled)
```

## Dependencies

- scikit-learn>=1.3.0
- pandas>=2.0.0
- numpy>=1.24.0
- matplotlib>=3.7.0
- scipy>=1.10.0

Overview

This skill analyzes and clusters tabular data using K-Means, DBSCAN, and hierarchical (agglomerative) algorithms with built-in visualization and evaluation. It provides end-to-end workflows: data loading, preprocessing, clustering, quality metrics, plots, and exportable labeled datasets. Use it to discover patterns, segment customers, or detect anomalies quickly.

How this skill works

Load CSV files or pandas DataFrames, optionally selecting columns, then run K-Means, DBSCAN, or hierarchical clustering. The skill automatically imputes missing values, scales features, and applies PCA for 2D/3D plots when needed. Results include labels, silhouette scores, cluster statistics, and saved plots or labeled CSVs for downstream analysis.

When to use it

  • Customer segmentation to create actionable groups for marketing or product targeting
  • Pattern discovery in sales, operations, or sensor data without predefined labels
  • Anomaly detection via DBSCAN noise points in transaction or log datasets
  • Exploratory analysis to understand cluster hierarchy with dendrograms
  • Quick prototyping of clustering pipelines with visual validation

Best practices

  • Standardize or preprocess features when mixing scales; built-in scaling runs automatically but manual preprocessing is supported
  • Use find_optimal_clusters and elbow_plot to pick K for K-Means, then validate with silhouette scores
  • Tune DBSCAN eps and min_samples on a sample or using k-distance plots before full runs
  • For high-dimensional data, apply TF-IDF or other feature transforms and confirm interpretability of cluster means
  • Export labeled data and cluster_statistics to reproduce segments and feed downstream models

Example use cases

  • Segment e-commerce customers by age, income, and spending score, save segments for targeted campaigns
  • Detect anomalous transactions by running DBSCAN and extracting noise points as potential fraud
  • Cluster documents after TF-IDF and visualize relationships with a dendrogram to guide topic labeling
  • Explore product usage patterns by clustering user metrics and exporting labeled datasets for A/B testing
  • Run elbow_plot to determine optimal K, then generate cluster visualizations and silhouette plots for stakeholder presentations

FAQ

Which algorithm should I pick?

Use K-Means for compact, roughly spherical groups and when you know the expected number of clusters; DBSCAN for irregular shapes and anomaly detection; hierarchical to inspect cluster relationships and generate dendrograms.

Can I use preprocessed data?

Yes. You can load a pre-scaled DataFrame via load_dataframe to skip built-in imputation and scaling when you want full control.