home / skills / benchflow-ai / skillsbench / content-similarity-checker

content-similarity-checker skill

/registry/terminal_bench_2.0/full_batch_reviewed/terminal_bench_2_0_mteb-retrieve/environment/skills/content-similarity-checker

This skill helps you detect content similarity across documents using multiple algorithms to identify duplicates, plagiarism, or related material.

npx playbooks add skill benchflow-ai/skillsbench --skill content-similarity-checker

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
6.7 KB
---
name: content-similarity-checker
description: Compare document similarity using TF-IDF, cosine similarity, and Jaccard index. Use for plagiarism detection, duplicate finding, or content matching.
---

# Content Similarity Checker

Compare documents and text for similarity using multiple algorithms.

## Features

- **Cosine Similarity**: TF-IDF based comparison
- **Jaccard Similarity**: Set-based comparison
- **Levenshtein Distance**: Edit distance for short texts
- **Batch Comparison**: Compare multiple documents
- **Similarity Matrix**: Pairwise comparison of all documents
- **Reports**: Detailed similarity reports

## Quick Start

```python
from similarity_checker import SimilarityChecker

checker = SimilarityChecker()

# Compare two texts
score = checker.compare(
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps over a sleepy dog"
)
print(f"Similarity: {score:.2%}")

# Compare documents
score = checker.compare_files("doc1.txt", "doc2.txt")
```

## CLI Usage

```bash
# Compare two texts
python similarity_checker.py --text1 "Hello world" --text2 "Hello there world"

# Compare two files
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt

# Compare all files in folder
python similarity_checker.py --folder ./documents/ --output matrix.csv

# Use specific algorithm
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --method jaccard

# Find similar documents (threshold)
python similarity_checker.py --folder ./documents/ --threshold 0.7

# JSON output
python similarity_checker.py --file1 doc1.txt --file2 doc2.txt --json
```

## API Reference

### SimilarityChecker Class

```python
class SimilarityChecker:
    def __init__(self, method: str = "cosine")

    # Text comparison
    def compare(self, text1: str, text2: str) -> float
    def compare_files(self, file1: str, file2: str) -> float

    # Multiple algorithms
    def compare_all_methods(self, text1: str, text2: str) -> dict

    # Batch comparison
    def compare_to_corpus(self, text: str, corpus: list) -> list
    def similarity_matrix(self, documents: list) -> pd.DataFrame
    def find_duplicates(self, documents: list, threshold: float = 0.8) -> list

    # Folder operations
    def compare_folder(self, folder: str, threshold: float = None) -> dict
    def find_most_similar(self, text: str, folder: str, top_n: int = 5) -> list

    # Report
    def generate_report(self, output: str) -> str
```

## Similarity Methods

### Cosine Similarity (Default)

Best for comparing documents of different lengths:

```python
checker = SimilarityChecker(method="cosine")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0
```

### Jaccard Similarity

Good for comparing sets of words/tokens:

```python
checker = SimilarityChecker(method="jaccard")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0
```

### Levenshtein (Edit Distance)

Best for short texts, typo detection:

```python
checker = SimilarityChecker(method="levenshtein")
score = checker.compare(text1, text2)
# Returns: 0.0 to 1.0 (normalized)
```

### TF-IDF + Cosine

Advanced: considers term importance:

```python
checker = SimilarityChecker(method="tfidf")
score = checker.compare(text1, text2)
```

## Batch Comparison

### Compare to Corpus

```python
checker = SimilarityChecker()

target = "Machine learning is a subset of artificial intelligence."
corpus = [
    "AI includes machine learning and deep learning.",
    "Python is a programming language.",
    "Neural networks power deep learning systems."
]

results = checker.compare_to_corpus(target, corpus)

# Returns:
[
    {"index": 0, "similarity": 0.65, "text": "AI includes..."},
    {"index": 2, "similarity": 0.42, "text": "Neural networks..."},
    {"index": 1, "similarity": 0.12, "text": "Python is..."}
]
```

### Similarity Matrix

```python
documents = [
    "Document one content...",
    "Document two content...",
    "Document three content..."
]

matrix = checker.similarity_matrix(documents)

# Returns DataFrame:
#          doc_0    doc_1    doc_2
# doc_0    1.000    0.750    0.320
# doc_1    0.750    1.000    0.410
# doc_2    0.320    0.410    1.000
```

### Find Duplicates

```python
documents = [...]  # List of texts

duplicates = checker.find_duplicates(documents, threshold=0.85)

# Returns:
[
    {"doc1_index": 0, "doc2_index": 3, "similarity": 0.92},
    {"doc1_index": 2, "doc2_index": 7, "similarity": 0.88}
]
```

## Compare All Methods

Get similarity scores from all algorithms:

```python
checker = SimilarityChecker()
results = checker.compare_all_methods(text1, text2)

# Returns:
{
    "cosine": 0.82,
    "jaccard": 0.65,
    "levenshtein": 0.71,
    "tfidf": 0.78,
    "average": 0.74
}
```

## Folder Operations

### Compare All Files in Folder

```python
checker = SimilarityChecker()
results = checker.compare_folder("./documents/")

# Returns:
{
    "files": ["doc1.txt", "doc2.txt", "doc3.txt"],
    "comparisons": 3,
    "similar_pairs": [
        {"file1": "doc1.txt", "file2": "doc3.txt", "similarity": 0.87}
    ],
    "matrix": <DataFrame>
}
```

### Find Most Similar to Query

```python
query = "Your search text here..."
results = checker.find_most_similar(query, "./documents/", top_n=5)

# Returns:
[
    {"file": "doc3.txt", "similarity": 0.89},
    {"file": "doc1.txt", "similarity": 0.72},
    ...
]
```

## Output Format

### Comparison Result

```python
result = checker.compare_with_details(text1, text2)

# Returns:
{
    "similarity": 0.82,
    "method": "cosine",
    "text1_length": 150,
    "text2_length": 180,
    "common_words": 25,
    "unique_words_text1": 10,
    "unique_words_text2": 15,
    "interpretation": "High similarity - likely related content"
}
```

## Example Workflows

### Plagiarism Check

```python
checker = SimilarityChecker()

submission = open("student_paper.txt").read()
results = checker.compare_folder("./source_materials/")

suspicious = [p for p in results["similar_pairs"] if p["similarity"] > 0.6]

if suspicious:
    print(f"Warning: Found {len(suspicious)} potentially similar sources")
    for p in suspicious:
        print(f"  {p['file1']} matches {p['file2']}: {p['similarity']:.0%}")
```

### Document Deduplication

```python
checker = SimilarityChecker()

# Load all documents
docs = {}
for file in Path("./articles/").glob("*.txt"):
    docs[file.name] = file.read_text()

# Find near-duplicates
duplicates = checker.find_duplicates(list(docs.values()), threshold=0.9)

print(f"Found {len(duplicates)} duplicate pairs")
```

### Content Matching

```python
checker = SimilarityChecker()

query = "Best practices for Python web development"
results = checker.find_most_similar(query, "./blog_posts/", top_n=10)

print("Most relevant articles:")
for r in results:
    print(f"  {r['file']}: {r['similarity']:.0%} match")
```

## Dependencies

- scikit-learn>=1.3.0
- nltk>=3.8.0
- numpy>=1.24.0
- pandas>=2.0.0

Overview

This skill compares documents and text using multiple similarity algorithms to detect duplicates, plagiarism, or related content. It offers TF-IDF + cosine, pure cosine, Jaccard, and Levenshtein methods, plus batch operations and reporting for larger corpora. Results include pairwise scores, similarity matrices, duplicate lists, and JSON-friendly outputs.

How this skill works

The skill vectorizes text with TF-IDF for cosine comparisons and treats token sets for Jaccard similarity. Levenshtein distance is used for short or noisy strings and is normalized to a 0–1 score. It can run single comparisons, compute pairwise matrices across many documents, search a folder for the most similar files, and produce structured reports or JSON output.

When to use it

  • Plagiarism detection across student submissions or web sources
  • Finding near-duplicate articles in a content repository
  • Matching user queries to the most relevant documents
  • Batch-processing folders to create similarity matrices or reports
  • Quick sanity checks for short text variations and typos

Best practices

  • Preprocess consistently: lowercase, remove stopwords, and normalize whitespace
  • Use TF-IDF/cosine for longer documents, Jaccard for set-based matching, and Levenshtein for short strings
  • Set a clear similarity threshold based on your use case (e.g., 0.8 for duplicates, 0.6 for potential plagiarism)
  • Run multiple methods (compare_all_methods) to get a balanced view before acting
  • Limit corpus size or use approximate nearest neighbors for very large collections to keep performance manageable

Example use cases

  • Scan a submissions folder to flag papers with similarity > 0.6 against a sources directory
  • Generate a similarity matrix for a set of news articles to cluster related coverage
  • Deduplicate an article archive by finding pairs above 0.9 similarity and removing duplicates
  • Find the top 5 blog posts most relevant to a user query using TF-IDF + cosine
  • Produce a JSON report of pairwise comparisons for downstream auditing or visualization

FAQ

Which method should I pick for long documents?

Use TF-IDF with cosine similarity: it accounts for term importance and handles variable document lengths well.

Can I compare many files at once?

Yes. Use batch functions to compute a similarity matrix, find duplicates, or search a folder for similar files; consider performance tuning for very large corpora.