home / skills / benchflow-ai / skillsbench / fuzzy-match

fuzzy-match skill

safe

/tasks/invoice-fraud-detection/environment/skills/fuzzy-match

This skill helps you reconcile data by applying fuzzy matching techniques to identify near matches across datasets.

npx playbooks add skill benchflow-ai/skillsbench --skill fuzzy-match

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.1 KB

---
name: fuzzy-match
description: A toolkit for fuzzy string matching and data reconciliation. Useful for matching entity names (companies, people) across different datasets where spelling variations, typos, or formatting differences exist.
license: MIT
---

# Fuzzy Matching Guide

## Overview

This skill provides methods to compare strings and find the best matches using Levenshtein distance and other similarity metrics. It is essential when joining datasets on string keys that are not identical.

## Quick Start

```python
from difflib import SequenceMatcher

def similarity(a, b):
    return SequenceMatcher(None, a, b).ratio()

print(similarity("Apple Inc.", "Apple Incorporated"))
# Output: 0.7...
```

## Python Libraries

### difflib (Standard Library)

The `difflib` module provides classes and functions for comparing sequences.

#### Basic Similarity

```python
from difflib import SequenceMatcher

def get_similarity(str1, str2):
    """Returns a ratio between 0 and 1."""
    return SequenceMatcher(None, str1, str2).ratio()

# Example
s1 = "Acme Corp"
s2 = "Acme Corporation"
print(f"Similarity: {get_similarity(s1, s2)}")
```

#### Finding Best Match in a List

```python
from difflib import get_close_matches

word = "appel"
possibilities = ["ape", "apple", "peach", "puppy"]
matches = get_close_matches(word, possibilities, n=1, cutoff=0.6)
print(matches)
# Output: ['apple']
```

### rapidfuzz (Recommended for Performance)

If `rapidfuzz` is available (pip install rapidfuzz), it is much faster and offers more metrics.

```python
from rapidfuzz import fuzz, process

# Simple Ratio
score = fuzz.ratio("this is a test", "this is a test!")
print(score)

# Partial Ratio (good for substrings)
score = fuzz.partial_ratio("this is a test", "this is a test!")
print(score)

# Extraction
choices = ["Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys"]
best_match = process.extractOne("new york jets", choices)
print(best_match)
# Output: ('New York Jets', 100.0, 1)
```

## Common Patterns

### Normalization before Matching

Always normalize strings before comparing to improve accuracy.

```python
import re

def normalize(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters
    text = re.sub(r'[^\w\s]', '', text)
    # Normalize whitespace
    text = " ".join(text.split())
    # Common abbreviations
    text = text.replace("limited", "ltd").replace("corporation", "corp")
    return text

s1 = "Acme  Corporation, Inc."
s2 = "acme corp inc"
print(normalize(s1) == normalize(s2))
```

### Entity Resolution

When matching a list of dirty names to a clean database:

```python
clean_names = ["Google LLC", "Microsoft Corp", "Apple Inc"]
dirty_names = ["google", "Microsft", "Apple"]

results = {}
for dirty in dirty_names:
    # simple containment check first
    match = None
    for clean in clean_names:
        if dirty.lower() in clean.lower():
            match = clean
            break

    # fallback to fuzzy
    if not match:
        matches = get_close_matches(dirty, clean_names, n=1, cutoff=0.6)
        if matches:
            match = matches[0]

    results[dirty] = match
```

Overview

This skill is a toolkit for fuzzy string matching and data reconciliation that helps match entity names across datasets despite typos, abbreviations, or formatting differences. It exposes simple similarity measures and extraction utilities to find best matches and reconcile messy keys. The goal is reliable, scalable matching for record linkage and entity resolution workflows.

How this skill works

The skill computes similarity scores between strings using algorithms like Levenshtein-based ratios and sequence matching, and can extract the closest match from a list. It encourages normalization steps (lowercasing, punctuation removal, abbreviation mapping) before comparison to boost accuracy. For performance, it supports fast libraries when available and falls back to standard implementations otherwise.

When to use it

Joining datasets on name fields that are not exact matches
Cleaning CRM or vendor lists that contain typos and variants
Resolving multiple spelling variants of person or company names
Preprocessing data before deduplication or merging
Creating a best-effort mapping between a dirty list and a canonical database

Best practices

Normalize strings first: lowercase, strip punctuation, normalize whitespace, and expand or standardize common abbreviations
Prefer fast libraries for large datasets and batch matching; use standard tools for small datasets or prototypes
Combine simple containment checks and domain-specific rules with fuzzy scoring to reduce false positives
Tune similarity thresholds based on sample data and consider human review for borderline matches
Log scores and candidate matches so reconciliation rules can be audited and improved

Example use cases

Match a list of customer-entered company names to a master vendor table to populate vendor IDs
Detect duplicate person records in a CRM where names have typos or nickname variations
Map product names from supplier catalogs to internal SKUs when naming conventions differ
Auto-suggest best matches in a data-cleaning UI with confidence scores
Pre-filter candidate matches with quick checks, then apply fuzzy scoring for final selection

FAQ

Which algorithm should I use?

Start with a simple sequence ratio for small datasets; use optimized libraries like rapidfuzz for speed and advanced metrics on larger collections.

How do I choose a cutoff threshold?

Evaluate on a labeled sample: pick a threshold that balances precision and recall for your use case and add review for borderline scores.