home / skills / benchflow-ai / skillsbench / speaker-clustering
/tasks/speaker-diarization-subtitles/environment/skills/speaker-clustering
This skill clusters speaker embeddings by choosing hierarchical, kmeans, or agglomerative methods with normalization to identify distinct speakers.
npx playbooks add skill benchflow-ai/skillsbench --skill speaker-clusteringReview the files below or copy the command above to add this skill to your agents.
---
name: Speaker Clustering Methods
description: Choose and implement clustering algorithms for grouping speaker embeddings after VAD and embedding extraction. Compare Hierarchical clustering (auto-tunes speaker count), KMeans (fast, requires known count), and Agglomerative clustering (fixed clusters). Use Hierarchical clustering when speaker count is unknown, KMeans when count is known, and always normalize embeddings before clustering.
---
# Speaker Clustering Methods
## Overview
After extracting speaker embeddings from audio segments, you need to cluster them to identify unique speakers. Different clustering methods have different strengths.
## When to Use
- After extracting speaker embeddings from VAD segments
- Need to group similar speakers together
- Determining number of speakers automatically or manually
## Available Clustering Methods
### 1. Hierarchical Clustering (Recommended for Auto-tuning)
**Best for:** Automatically determining number of speakers, flexible threshold tuning
```python
from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import pdist
import numpy as np
# Prepare embeddings
embeddings_array = np.array(embeddings_list)
n_segments = len(embeddings_array)
# Compute distance matrix
distances = pdist(embeddings_array, metric='cosine')
# Create linkage matrix
linkage_matrix = linkage(distances, method='average')
# Auto-tune threshold to get reasonable speaker count
min_speakers = 2
max_speakers = max(2, min(10, n_segments // 2))
threshold = 0.7
labels = fcluster(linkage_matrix, t=threshold, criterion='distance')
n_speakers = len(set(labels))
# Adjust threshold if needed
if n_speakers > max_speakers:
for t in [0.8, 0.9, 1.0, 1.1, 1.2]:
labels = fcluster(linkage_matrix, t=t, criterion='distance')
n_speakers = len(set(labels))
if n_speakers <= max_speakers:
threshold = t
break
elif n_speakers < min_speakers:
for t in [0.6, 0.5, 0.4]:
labels = fcluster(linkage_matrix, t=t, criterion='distance')
n_speakers = len(set(labels))
if n_speakers >= min_speakers:
threshold = t
break
print(f"Selected: t={threshold}, {n_speakers} speakers")
```
**Advantages:**
- Automatically determines speaker count
- Flexible threshold tuning
- Good for unknown number of speakers
- Can visualize dendrogram
### 2. KMeans Clustering
**Best for:** Known number of speakers, fast clustering
```python
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import numpy as np
# Normalize embeddings
embeddings_array = np.array(embeddings_list)
norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)
embeddings_normalized = embeddings_array / np.clip(norms, 1e-9, None)
# Try different k values and choose best
best_k = 2
best_score = -1
best_labels = None
for k in range(2, min(7, len(embeddings_normalized))):
kmeans = KMeans(n_clusters=k, random_state=0, n_init=10)
labels = kmeans.fit_predict(embeddings_normalized)
if len(set(labels)) < 2:
continue
score = silhouette_score(embeddings_normalized, labels, metric='cosine')
if score > best_score:
best_score = score
best_k = k
best_labels = labels
print(f"Best k={best_k}, silhouette score={best_score:.3f}")
```
**Advantages:**
- Fast and efficient
- Works well with known speaker count
- Simple to implement
**Disadvantages:**
- Requires specifying number of clusters
- May get stuck in local minima
### 3. Agglomerative Clustering
**Best for:** Similar to hierarchical but with fixed number of clusters
```python
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import silhouette_score
import numpy as np
# Normalize embeddings
embeddings_array = np.array(embeddings_list)
norms = np.linalg.norm(embeddings_array, axis=1, keepdims=True)
embeddings_normalized = embeddings_array / np.clip(norms, 1e-9, None)
# Try different numbers of clusters
best_n = 2
best_score = -1
best_labels = None
for n_clusters in range(2, min(6, len(embeddings_normalized))):
clustering = AgglomerativeClustering(n_clusters=n_clusters)
labels = clustering.fit_predict(embeddings_normalized)
if len(set(labels)) < 2:
continue
score = silhouette_score(embeddings_normalized, labels, metric='cosine')
if score > best_score:
best_score = score
best_n = n_clusters
best_labels = labels
print(f"Best n_clusters={best_n}, silhouette score={best_score:.3f}")
```
**Advantages:**
- Deterministic results
- Good for fixed number of clusters
- Can use different linkage methods
## Comparison Table
| Method | Auto Speaker Count | Speed | Best For |
|--------|-------------------|-------|----------|
| Hierarchical | ✅ Yes | Medium | Unknown speaker count |
| KMeans | ❌ No | Fast | Known speaker count |
| Agglomerative | ❌ No | Medium | Fixed clusters needed |
## Embedding Normalization
Always normalize embeddings before clustering:
```python
# L2 normalization
embeddings_normalized = embeddings_array / np.clip(
np.linalg.norm(embeddings_array, axis=1, keepdims=True),
1e-9, None
)
```
## Distance Metrics
- **Cosine**: Best for speaker embeddings (default)
- **Euclidean**: Can work but less ideal for normalized embeddings
## Choosing Number of Speakers
### Method 1: Silhouette Score (for KMeans/Agglomerative)
```python
from sklearn.metrics import silhouette_score
best_k = 2
best_score = -1
for k in range(2, min(7, len(embeddings))):
labels = clusterer.fit_predict(embeddings)
score = silhouette_score(embeddings, labels, metric='cosine')
if score > best_score:
best_score = score
best_k = k
```
### Method 2: Threshold Tuning (for Hierarchical)
```python
# Start with reasonable threshold
threshold = 0.7
labels = fcluster(linkage_matrix, t=threshold, criterion='distance')
n_speakers = len(set(labels))
# Adjust based on constraints
if n_speakers > max_speakers:
# Increase threshold to merge more
threshold = 0.9
elif n_speakers < min_speakers:
# Decrease threshold to split more
threshold = 0.5
```
## Post-Clustering: Merging Segments
After clustering, merge adjacent segments with same speaker:
```python
def merge_speaker_segments(labeled_segments, gap_threshold=0.15):
"""
labeled_segments: list of (start, end, speaker_label)
gap_threshold: merge if gap <= this (seconds)
"""
labeled_segments.sort(key=lambda x: (x[0], x[1]))
merged = []
cur_s, cur_e, cur_spk = labeled_segments[0]
for s, e, spk in labeled_segments[1:]:
if spk == cur_spk and s <= cur_e + gap_threshold:
cur_e = max(cur_e, e)
else:
merged.append((cur_s, cur_e, cur_spk))
cur_s, cur_e, cur_spk = s, e, spk
merged.append((cur_s, cur_e, cur_spk))
return merged
```
## Common Issues
1. **Too many speakers**: Increase threshold (hierarchical) or decrease k (KMeans)
2. **Too few speakers**: Decrease threshold (hierarchical) or increase k (KMeans)
3. **Poor clustering**: Check embedding quality, try different normalization
4. **Over-segmentation**: Increase gap_threshold when merging segments
## Best Practices
1. **Normalize embeddings** before clustering
2. **Use cosine distance** for speaker embeddings
3. **Try multiple methods** and compare results
4. **Validate speaker count** with visual features if available
5. **Merge adjacent segments** after clustering
6. **After diarization, use high-quality ASR**: Use Whisper `small` or `large-v3` model for transcription (see automatic-speech-recognition skill)This skill guides selection and implementation of clustering algorithms for speaker embeddings after voice activity detection (VAD) and embedding extraction. It compares Hierarchical (auto-tunes speaker count), KMeans (fast, requires known count), and Agglomerative (deterministic, fixed clusters) approaches and highlights practical defaults and outcomes. The goal is reliable speaker grouping, sensible speaker-count selection, and clean post-clustering segment merging.
Normalize embeddings (L2) and compute pairwise distances using cosine similarity. Use Hierarchical clustering with threshold tuning when the speaker count is unknown; it derives clusters from a linkage matrix and can be auto-tuned to meet min/max speaker constraints. Use KMeans when you know or can estimate the number of speakers and optionally select k via silhouette score. Use Agglomerative clustering when you want deterministic results with a fixed number of clusters. Finally, merge adjacent segments with the same label to produce contiguous speaker turns.
How do I choose between Hierarchical and KMeans?
Use Hierarchical when speaker count is unknown or variable; use KMeans when you have a reliable estimate of speaker count and need speed.
What distance metric should I use?
Cosine distance is recommended for speaker embeddings; normalize embeddings with L2 before computing distances.