home / skills / benchflow-ai / skillsbench / hierarchical-taxonomy-clustering
/tasks/taxonomy-tree-merge/environment/skills/hierarchical-taxonomy-clustering
This skill generates a unified multi-level taxonomy from hierarchical product paths by embedding, clustering, and intelligent naming for consistent
npx playbooks add skill benchflow-ai/skillsbench --skill hierarchical-taxonomy-clusteringReview the files below or copy the command above to add this skill to your agents.
---
name: hierarchical-taxonomy-clustering
description: Build unified multi-level category taxonomy from hierarchical product category paths from any e-commerce companies using embedding-based recursive clustering with intelligent category naming via weighted word frequency analysis.
---
# Hierarchical Taxonomy Clustering
Create a unified multi-level taxonomy from hierarchical category paths by clustering similar paths and automatically generating meaningful category names.
## Problem
Given category paths from multiple sources (e.g., "electronics -> computers -> laptops"), create a unified taxonomy that groups similar paths across sources, generates meaningful category names, and produces a clean N-level hierarchy (typically 5 levels). The unified category taxonomy could be used to do analysis or metric tracking on products from different platform.
## Methodology
1. **Hierarchical Weighting**: Convert paths to embeddings with exponentially decaying weights (Level i gets weight 0.6^(i-1)) to signify the importance of category granularity
2. **Recursive Clustering**: Hierarchically cluster at each level (10-20 clusters at L1, 3-20 at L2-L5) using cosine distance
3. **Intelligent Naming**: Generate category names via weighted word frequency + lemmatization + bundle word logic
4. **Quality Control**: Exclude all ancestor words (parent, grandparent, etc.), avoid ancestor path duplicates, clean special characters
## Output
DataFrame with added columns:
- `unified_level_1`: Top-level category (e.g., "electronic | device")
- `unified_level_2`: Second-level category (e.g., "computer | laptop")
- `unified_level_3` through `unified_level_N`: Deeper levels
Category names use ` | ` separator, max 5 words, covering 70%+ of records in each cluster.
## Installation
```bash
pip install pandas numpy scipy sentence-transformers nltk tqdm
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"
```
## 4-Step Pipeline
### Step 1: Load, Standardize, Filter and Merge (`step1_preprocessing_and_merge.py`)
- **Input**: List of (DataFrame, source_name) tuples, each of the with `category_path` column
- **Process**: Per-source deduplication, text cleaning (remove &/,/'/-/quotes,'and' or "&", "," and so on, lemmatize words as nouns), normalize delimiter to ` > `, depth filtering, prefix removal, then merge all sources. source_level should reflect the processed version of the source level name
- **Output**: Merged DataFrame with `category_path`, `source`, `depth`, `source_level_1` through `source_level_N`
### Step 2: Weighted Embeddings (`step2_weighted_embedding_generation.py`)
- **Input**: DataFrame from Step 1
- **Output**: Numpy embedding matrix (n_records × 384)
- Weights: L1=1.0, L2=0.6, L3=0.36, L4=0.216, L5=0.1296 (exponential decay 0.6^(n-1))
- **Performance**: For ~10,000 records, expect 2-5 minutes. Progress bar will show encoding status.
### Step 3: Recursive Clustering (`step3_recursive_clustering_naming.py`)
- **Input**: DataFrame + embeddings from Step 2
- **Output**: Assignments dict {index → {level_1: ..., level_5: ...}}
- Average linkage + cosine distance, 10-20 clusters at L1, 3-20 at L2-L5
- Word-based naming: weighted frequency + lemmatization + coverage ≥70%
- **Performance**: For ~10,000 records, expect 1-3 minutes for hierarchical clustering and naming. Be patient - the system is working through recursive levels.
### Step 4: Export Results (`step4_result_assignments.py`)
- **Input**: DataFrame + assignments from Step 3
- **Output**:
- `unified_taxonomy_full.csv` - all records with unified categories
- `unified_taxonomy_hierarchy.csv` - unique taxonomy structure
## Usage
**Use `scripts/pipeline.py` to run the complete 4-step workflow.**
See `scripts/pipeline.py` for:
- Complete implementation of all 4 steps
- Example code for processing multiple sources
- Command-line interface
- Individual step usage (for advanced control)
This skill builds a unified multi-level product category taxonomy from hierarchical category paths collected across e-commerce sources. It clusters semantically similar paths using embedding-based recursive clustering and generates compact, human-readable category names. The result is a clean N-level hierarchy (typically up to 5 levels) suitable for analysis and metric alignment across platforms.
The pipeline standardizes and merges source category paths, converts them to weighted embeddings with exponentially decaying level weights, and performs hierarchical clustering per level using cosine distance and average linkage. It then names clusters via weighted word frequency, lemmatization, and bundle-word logic while excluding ancestor words and cleaning noise. Final outputs are per-record unified level assignments and a deduplicated taxonomy hierarchy.
How many levels does the algorithm support?
The pipeline is designed for up to 5 levels by default but can be adapted to other N levels by adjusting preprocessing and clustering parameters.
What embedding model and weights are used?
Embeddings are generated (384-dim) and levels are weighted exponentially with 0.6^(level-1), giving L1=1.0, L2=0.6, L3=0.36, etc., to emphasize higher-level semantics.
How are cluster names generated reliably?
Names use weighted word frequency, lemmatization, and bundle-word logic, exclude ancestor words, limit to about five words, and require coverage thresholds (≥70%) to ensure representative, concise labels.
What outputs should I expect?
A full per-record CSV with unified_level_1..N columns and a separate CSV of the unique taxonomy hierarchy suitable for downstream reporting and integrations.