home / skills / dkyazzentwatwa / chatgpt-skills / clustering-analyzer
This skill analyzes and clusters data using multiple algorithms with visualization and evaluation to reveal insights and segment customers.
npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill clustering-analyzerReview the files below or copy the command above to add this skill to your agents.
---
name: clustering-analyzer
description: Cluster data using K-Means, DBSCAN, hierarchical clustering. Use for customer segmentation, pattern discovery, or data grouping.
---
# Clustering Analyzer
Analyze and cluster data using multiple algorithms with visualization and evaluation.
## Features
- **K-Means**: Partition-based clustering with elbow method
- **DBSCAN**: Density-based clustering for arbitrary shapes
- **Hierarchical**: Agglomerative clustering with dendrograms
- **Evaluation**: Silhouette scores, cluster statistics
- **Visualization**: 2D/3D plots, dendrograms, elbow curves
- **Export**: Labeled data, cluster summaries
## Quick Start
```python
from clustering_analyzer import ClusteringAnalyzer
analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv")
# K-Means clustering
result = analyzer.kmeans(n_clusters=3)
print(f"Silhouette Score: {result['silhouette_score']:.3f}")
# Visualize
analyzer.plot_clusters("clusters.png")
```
## CLI Usage
```bash
# K-Means clustering
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3
# Find optimal clusters (elbow method)
python clustering_analyzer.py --input data.csv --method kmeans --find-optimal
# DBSCAN clustering
python clustering_analyzer.py --input data.csv --method dbscan --eps 0.5 --min-samples 5
# Hierarchical clustering
python clustering_analyzer.py --input data.csv --method hierarchical --clusters 4
# Generate plots
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --plot clusters.png
# Export labeled data
python clustering_analyzer.py --input data.csv --method kmeans --clusters 3 --output labeled.csv
# Select specific columns
python clustering_analyzer.py --input data.csv --columns age,income,spending --method kmeans --clusters 3
```
## API Reference
### ClusteringAnalyzer Class
```python
class ClusteringAnalyzer:
def __init__(self)
# Data loading
def load_csv(self, filepath: str, columns: list = None) -> 'ClusteringAnalyzer'
def load_dataframe(self, df: pd.DataFrame, columns: list = None) -> 'ClusteringAnalyzer'
# Clustering methods
def kmeans(self, n_clusters: int, **kwargs) -> dict
def dbscan(self, eps: float = 0.5, min_samples: int = 5) -> dict
def hierarchical(self, n_clusters: int, linkage: str = "ward") -> dict
# Optimal clusters
def find_optimal_clusters(self, max_k: int = 10) -> dict
def elbow_plot(self, output: str, max_k: int = 10) -> str
# Evaluation
def silhouette_score(self) -> float
def cluster_statistics(self) -> dict
# Visualization
def plot_clusters(self, output: str, dimensions: list = None) -> str
def plot_dendrogram(self, output: str) -> str
def plot_silhouette(self, output: str) -> str
# Export
def get_labels(self) -> list
def to_dataframe(self) -> pd.DataFrame
def save_labeled(self, output: str) -> str
```
## Clustering Methods
### K-Means
Best for spherical clusters with known number of groups:
```python
result = analyzer.kmeans(n_clusters=3)
# Returns:
{
"labels": [0, 1, 2, 0, ...],
"n_clusters": 3,
"silhouette_score": 0.65,
"inertia": 1234.56,
"cluster_sizes": {0: 150, 1: 200, 2: 100},
"centroids": [[...], [...], [...]]
}
```
### DBSCAN
Best for arbitrary-shaped clusters:
```python
result = analyzer.dbscan(eps=0.5, min_samples=5)
# Returns:
{
"labels": [0, 0, 1, -1, ...], # -1 = noise
"n_clusters": 3,
"n_noise": 15,
"silhouette_score": 0.58,
"cluster_sizes": {0: 150, 1: 200, 2: 100}
}
```
### Hierarchical (Agglomerative)
Best for understanding cluster hierarchy:
```python
result = analyzer.hierarchical(n_clusters=4, linkage="ward")
# Returns:
{
"labels": [0, 1, 2, 3, ...],
"n_clusters": 4,
"silhouette_score": 0.62,
"cluster_sizes": {0: 100, 1: 150, 2: 120, 3: 80}
}
```
## Finding Optimal Clusters
### Elbow Method
```python
optimal = analyzer.find_optimal_clusters(max_k=10)
# Returns:
{
"optimal_k": 4,
"inertias": [1000, 800, 500, 300, 280, ...],
"silhouettes": [0.5, 0.55, 0.6, 0.65, 0.63, ...]
}
```
### Elbow Plot
```python
analyzer.elbow_plot("elbow.png", max_k=10)
```
Generates plot showing inertia vs number of clusters.
## Cluster Statistics
```python
stats = analyzer.cluster_statistics()
# Returns:
{
"n_clusters": 3,
"cluster_sizes": {0: 150, 1: 200, 2: 100},
"cluster_means": {
0: {"age": 25.5, "income": 45000, ...},
1: {"age": 45.2, "income": 75000, ...},
2: {"age": 35.1, "income": 55000, ...}
},
"cluster_std": {
0: {"age": 5.2, "income": 8000, ...},
...
},
"overall_silhouette": 0.65
}
```
## Visualization
### Cluster Plot
```python
# 2D plot (uses first 2 features or PCA)
analyzer.plot_clusters("clusters_2d.png")
# Specify dimensions
analyzer.plot_clusters("clusters.png", dimensions=["age", "income"])
```
### Dendrogram
```python
# For hierarchical clustering
analyzer.hierarchical(n_clusters=4)
analyzer.plot_dendrogram("dendrogram.png")
```
### Silhouette Plot
```python
analyzer.plot_silhouette("silhouette.png")
```
Shows silhouette coefficient for each sample.
## Export Results
### Get Labels
```python
labels = analyzer.get_labels()
# [0, 1, 2, 0, 1, ...]
```
### Save Labeled Data
```python
analyzer.save_labeled("labeled_data.csv")
# Original data + cluster_label column
```
### Get Full DataFrame
```python
df = analyzer.to_dataframe()
# DataFrame with cluster_label column
```
## Example Workflows
### Customer Segmentation
```python
analyzer = ClusteringAnalyzer()
analyzer.load_csv("customers.csv", columns=["age", "income", "spending_score"])
# Find optimal number of segments
optimal = analyzer.find_optimal_clusters(max_k=8)
print(f"Optimal segments: {optimal['optimal_k']}")
# Cluster with optimal k
result = analyzer.kmeans(n_clusters=optimal['optimal_k'])
# Get segment characteristics
stats = analyzer.cluster_statistics()
for cluster_id, means in stats["cluster_means"].items():
print(f"\nSegment {cluster_id}:")
for feature, value in means.items():
print(f" {feature}: {value:.2f}")
# Save segmented data
analyzer.save_labeled("customer_segments.csv")
```
### Anomaly Detection with DBSCAN
```python
analyzer = ClusteringAnalyzer()
analyzer.load_csv("transactions.csv", columns=["amount", "frequency"])
# DBSCAN identifies noise points as potential anomalies
result = analyzer.dbscan(eps=0.3, min_samples=10)
print(f"Found {result['n_noise']} potential anomalies")
# Get anomalous records
df = analyzer.to_dataframe()
anomalies = df[df["cluster_label"] == -1]
```
### Document Clustering
```python
# After TF-IDF transformation
analyzer = ClusteringAnalyzer()
analyzer.load_dataframe(tfidf_matrix)
# Hierarchical clustering to see document relationships
result = analyzer.hierarchical(n_clusters=5)
analyzer.plot_dendrogram("doc_dendrogram.png")
```
## Data Preprocessing
The analyzer automatically:
- Handles missing values (imputation)
- Scales features (standardization)
- Reduces dimensions for visualization (PCA)
For custom preprocessing:
```python
from sklearn.preprocessing import StandardScaler
# Preprocess manually
df = pd.read_csv("data.csv")
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
# Load preprocessed data
analyzer.load_dataframe(df_scaled)
```
## Dependencies
- scikit-learn>=1.3.0
- pandas>=2.0.0
- numpy>=1.24.0
- matplotlib>=3.7.0
- scipy>=1.10.0
This skill analyzes and clusters tabular data using K-Means, DBSCAN, and hierarchical (agglomerative) algorithms with built-in visualization and evaluation. It provides end-to-end workflows: data loading, preprocessing, clustering, quality metrics, plots, and exportable labeled datasets. Use it to discover patterns, segment customers, or detect anomalies quickly.
Load CSV files or pandas DataFrames, optionally selecting columns, then run K-Means, DBSCAN, or hierarchical clustering. The skill automatically imputes missing values, scales features, and applies PCA for 2D/3D plots when needed. Results include labels, silhouette scores, cluster statistics, and saved plots or labeled CSVs for downstream analysis.
Which algorithm should I pick?
Use K-Means for compact, roughly spherical groups and when you know the expected number of clusters; DBSCAN for irregular shapes and anomaly detection; hierarchical to inspect cluster relationships and generate dendrograms.
Can I use preprocessed data?
Yes. You can load a pre-scaled DataFrame via load_dataframe to skip built-in imputation and scaling when you want full control.