home / skills / benchflow-ai / skillsbench / hierarchical-taxonomy-clustering

hierarchical-taxonomy-clustering skill

safe

/tasks/taxonomy-tree-merge/environment/skills/hierarchical-taxonomy-clustering

This skill generates a unified multi-level taxonomy from hierarchical product paths by embedding, clustering, and intelligent naming for consistent

npx playbooks add skill benchflow-ai/skillsbench --skill hierarchical-taxonomy-clustering

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

3.8 KB

---
name: hierarchical-taxonomy-clustering
description: Build unified multi-level category taxonomy from hierarchical product category paths from any e-commerce companies using embedding-based recursive clustering with intelligent category naming via weighted word frequency analysis.
---

# Hierarchical Taxonomy Clustering

Create a unified multi-level taxonomy from hierarchical category paths by clustering similar paths and automatically generating meaningful category names.

## Problem

Given category paths from multiple sources (e.g., "electronics -> computers -> laptops"), create a unified taxonomy that groups similar paths across sources, generates meaningful category names, and produces a clean N-level hierarchy (typically 5 levels). The unified category taxonomy could be used to do analysis or metric tracking on products from different platform.

## Methodology

1. **Hierarchical Weighting**: Convert paths to embeddings with exponentially decaying weights (Level i gets weight 0.6^(i-1)) to signify the importance of category granularity
2. **Recursive Clustering**: Hierarchically cluster at each level (10-20 clusters at L1, 3-20 at L2-L5) using cosine distance
3. **Intelligent Naming**: Generate category names via weighted word frequency + lemmatization + bundle word logic
4. **Quality Control**: Exclude all ancestor words (parent, grandparent, etc.), avoid ancestor path duplicates, clean special characters

## Output

DataFrame with added columns:
- `unified_level_1`: Top-level category (e.g., "electronic | device")
- `unified_level_2`: Second-level category (e.g., "computer | laptop")
- `unified_level_3` through `unified_level_N`: Deeper levels

Category names use ` | ` separator, max 5 words, covering 70%+ of records in each cluster.

## Installation

```bash
pip install pandas numpy scipy sentence-transformers nltk tqdm
python -c "import nltk; nltk.download('wordnet'); nltk.download('omw-1.4')"
```

## 4-Step Pipeline

### Step 1: Load, Standardize, Filter and Merge (`step1_preprocessing_and_merge.py`)
- **Input**: List of (DataFrame, source_name) tuples, each of the  with `category_path` column
- **Process**: Per-source deduplication, text cleaning (remove &/,/'/-/quotes,'and' or "&", "," and so on, lemmatize words as nouns), normalize delimiter to ` > `, depth filtering, prefix removal, then merge all sources. source_level should reflect the processed version of the source level name
- **Output**: Merged DataFrame with `category_path`, `source`, `depth`, `source_level_1` through `source_level_N`

### Step 2: Weighted Embeddings (`step2_weighted_embedding_generation.py`)
- **Input**: DataFrame from Step 1
- **Output**: Numpy embedding matrix (n_records × 384)
- Weights: L1=1.0, L2=0.6, L3=0.36, L4=0.216, L5=0.1296 (exponential decay 0.6^(n-1))
- **Performance**: For ~10,000 records, expect 2-5 minutes. Progress bar will show encoding status.

### Step 3: Recursive Clustering (`step3_recursive_clustering_naming.py`)
- **Input**: DataFrame + embeddings from Step 2
- **Output**: Assignments dict {index → {level_1: ..., level_5: ...}}
- Average linkage + cosine distance, 10-20 clusters at L1, 3-20 at L2-L5
- Word-based naming: weighted frequency + lemmatization + coverage ≥70%
- **Performance**: For ~10,000 records, expect 1-3 minutes for hierarchical clustering and naming. Be patient - the system is working through recursive levels.

### Step 4: Export Results (`step4_result_assignments.py`)
- **Input**: DataFrame + assignments from Step 3
- **Output**:
  - `unified_taxonomy_full.csv` - all records with unified categories
  - `unified_taxonomy_hierarchy.csv` - unique taxonomy structure

## Usage

**Use `scripts/pipeline.py` to run the complete 4-step workflow.**

See `scripts/pipeline.py` for:
- Complete implementation of all 4 steps
- Example code for processing multiple sources
- Command-line interface
- Individual step usage (for advanced control)

Overview

This skill builds a unified multi-level product category taxonomy from hierarchical category paths collected across e-commerce sources. It clusters semantically similar paths using embedding-based recursive clustering and generates compact, human-readable category names. The result is a clean N-level hierarchy (typically up to 5 levels) suitable for analysis and metric alignment across platforms.

How this skill works

The pipeline standardizes and merges source category paths, converts them to weighted embeddings with exponentially decaying level weights, and performs hierarchical clustering per level using cosine distance and average linkage. It then names clusters via weighted word frequency, lemmatization, and bundle-word logic while excluding ancestor words and cleaning noise. Final outputs are per-record unified level assignments and a deduplicated taxonomy hierarchy.

When to use it

You need a single taxonomy to compare metrics across marketplaces or vendors.
Raw category paths come from many different sources and have inconsistent depth or phrasing.
You want automated, scalable creation of a multi-level hierarchy (up to 5 levels).
You need interpretable category labels generated from actual source terms.
You want a repeatable pipeline to refresh taxonomy as source data changes.

Best practices

Pre-clean source data: remove non-informative tokens and normalize delimiters before merging.
Limit path depth to the intended maximum (commonly 5) to control embedding complexity.
Tune top-level cluster count (10–20) and subsequent ranges (3–20) to match catalog breadth.
Verify cluster coverage: ensure generated names cover ≥70% of records in each cluster and adjust naming thresholds if necessary.
Run quality checks to remove ancestor duplicates, special-character noise, and low-coverage labels.

Example use cases

Consolidate product categories from multiple marketplaces into one taxonomy for cross-platform sales analysis.
Normalize partner feeds so downstream analytics and recommendations use consistent category IDs.
Create a reporting taxonomy for product performance dashboards spanning diverse vendors.
Build a training set of consistent category labels for downstream classification models.
Regularly re-cluster and rename categories as new product types emerge in source data.

FAQ

How many levels does the algorithm support?

The pipeline is designed for up to 5 levels by default but can be adapted to other N levels by adjusting preprocessing and clustering parameters.

What embedding model and weights are used?

Embeddings are generated (384-dim) and levels are weighted exponentially with 0.6^(level-1), giving L1=1.0, L2=0.6, L3=0.36, etc., to emphasize higher-level semantics.

How are cluster names generated reliably?

Names use weighted word frequency, lemmatization, and bundle-word logic, exclude ancestor words, limit to about five words, and require coverage thresholds (≥70%) to ensure representative, concise labels.

What outputs should I expect?

A full per-record CSV with unified_level_1..N columns and a separate CSV of the unique taxonomy hierarchy suitable for downstream reporting and integrations.