home / skills / benchflow-ai / skillsbench / custom-distance-metrics

custom-distance-metrics skill

safe

/tasks/mars-clouds-clustering/environment/skills/custom-distance-metrics

This skill helps you define and apply custom distance metrics for clustering and ML tasks, enabling tailored similarity measures in sklearn and scipy.

npx playbooks add skill benchflow-ai/skillsbench --skill custom-distance-metrics

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.6 KB

---
name: custom-distance-metrics
description: Define custom distance/similarity metrics for clustering and ML algorithms. Use when working with DBSCAN, sklearn, or scipy distance functions with application-specific metrics.
---

# Custom Distance Metrics

Custom distance metrics allow you to define application-specific notions of similarity or distance between data points.

## Defining Custom Metrics for sklearn

sklearn's DBSCAN accepts a callable as the `metric` parameter:

```python
from sklearn.cluster import DBSCAN

def my_distance(point_a, point_b):
    """Custom distance between two points."""
    # point_a and point_b are 1D arrays
    return some_calculation(point_a, point_b)

db = DBSCAN(eps=5, min_samples=3, metric=my_distance)
```

## Parameterized Distance Functions

To use a distance function with configurable parameters, use a closure or factory function:

```python
def create_weighted_distance(weight_x, weight_y):
    """Create a distance function with specific weights."""
    def distance(a, b):
        dx = a[0] - b[0]
        dy = a[1] - b[1]
        return np.sqrt((weight_x * dx)**2 + (weight_y * dy)**2)
    return distance

# Create distances with different weights
dist_equal = create_weighted_distance(1.0, 1.0)
dist_x_heavy = create_weighted_distance(2.0, 0.5)

# Use with DBSCAN
db = DBSCAN(eps=10, min_samples=3, metric=dist_x_heavy)
```

## Example: Manhattan Distance with Parameter

As an example, Manhattan distance (L1 norm) can be parameterized with a scale factor:

```python
def create_manhattan_distance(scale=1.0):
    """
    Manhattan distance with optional scaling.
    Measures distance as sum of absolute differences.
    This is just one example - you can design custom metrics for your specific needs.
    """
    def distance(a, b):
        return scale * (abs(a[0] - b[0]) + abs(a[1] - b[1]))
    return distance

# Use with DBSCAN
manhattan_metric = create_manhattan_distance(scale=1.5)
db = DBSCAN(eps=10, min_samples=3, metric=manhattan_metric)
```


## Using scipy.spatial.distance

For computing distance matrices efficiently:

```python
from scipy.spatial.distance import cdist, pdist, squareform

# Custom distance for cdist
def custom_metric(u, v):
    return np.sqrt(np.sum((u - v)**2))

# Distance matrix between two sets of points
dist_matrix = cdist(points_a, points_b, metric=custom_metric)

# Pairwise distances within one set
pairwise = pdist(points, metric=custom_metric)
dist_matrix = squareform(pairwise)
```

## Performance Considerations

- Custom Python functions are slower than built-in metrics
- For large datasets, consider vectorizing operations
- Pre-compute distance matrices when doing multiple lookups

Overview

This skill defines custom distance and similarity metrics for clustering and machine learning workflows. It provides patterns for plug-in metrics in scikit-learn (e.g., DBSCAN) and for efficient distance matrix computation with SciPy. Use parameterized factories to customize behavior for application-specific similarity notions.

How this skill works

You supply a callable that computes distance between two 1D data points and pass it to algorithms that accept custom metrics (for example, DBSCAN's metric parameter). For reusable configuration, the skill shows how to build factory functions or closures that capture weights or scaling factors. For batch operations, it demonstrates using SciPy functions like cdist/pdist and squareform to compute pairwise matrices efficiently.

When to use it

When default metrics (Euclidean, cosine) do not reflect domain similarity.
When clustering with DBSCAN or other algorithms that accept a callable metric.
When you need parameterized metrics (weighted dimensions, scaling).
When precomputing pairwise distances improves repeated queries.
When combining heterogeneous features that require custom weighting.

Best practices

Prefer vectorized implementations or NumPy operations inside the metric to reduce Python overhead.
Wrap configurable logic in a factory/closure so parameters are fixed at algorithm construction time.
For large datasets, compute and reuse a distance matrix rather than calling the Python metric repeatedly.
Benchmark custom metrics against built-ins; use C/Cython or numba if performance becomes critical.
Validate metric properties (non-negativity, symmetry) if required by downstream algorithms.

Example use cases

DBSCAN clustering on geospatial data with longitude/latitude scaled differently from altitude.
Custom similarity for mixed-type records where numeric fields have domain-specific weights.
Precomputing a pairwise distance matrix for repeated nearest-neighbor queries in a production pipeline.
Creating a Manhattan (L1) metric with a tunable scale factor for outlier-sensitive clustering.
Using SciPy cdist with a custom kernel for hybrid datasets before running hierarchical clustering.

FAQ

Can I pass a Python function directly to scikit-learn algorithms?

Yes. Algorithms like DBSCAN accept a callable metric that takes two 1D arrays and returns a scalar distance.

How do I tune parameters inside a custom metric?

Use a factory function or closure that captures parameter values and returns a metric function to pass to the algorithm.

Will a custom Python metric be slow on large datasets?

Pure Python callables add overhead. Vectorize calculations, precompute matrices, or use numba/C-extensions for better performance.