home / skills / letta-ai / skills / mteb-retrieve

mteb-retrieve skill

/letta/benchmarks/trajectory-only/mteb-retrieve

This skill guides semantic similarity retrieval workflows from data cleaning to embedding, similarity scoring, and top-k ranking for documents.

npx playbooks add skill letta-ai/skills --skill mteb-retrieve

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
5.1 KB
---
name: mteb-retrieve
description: This skill provides guidance for semantic similarity retrieval tasks using embedding models (e.g., MTEB benchmarks, document ranking). It should be used when computing embeddings for documents/queries, ranking documents by similarity, or identifying top-k similar items. Covers data preprocessing, model selection, similarity computation, and result verification.
---

# MTEB Retrieve

## Overview

This skill guides semantic similarity retrieval tasks where documents must be ranked by their similarity to a query using embedding models. These tasks typically involve loading documents, computing embeddings, calculating similarity scores, and identifying documents at specific ranks.

## Workflow

### Step 1: Data Inspection and Preprocessing

Before computing embeddings, thoroughly inspect the input data format:

1. **Examine raw file contents** - Read a sample of lines to understand the actual format
2. **Identify formatting artifacts** - Look for:
   - Line number prefixes (e.g., `1→`, `2→`, `11→`)
   - Index markers or delimiters
   - Whitespace padding or alignment characters
   - Header rows or metadata lines
3. **Clean the data** - Remove any non-semantic content:
   - Strip line numbers and prefixes using regex (e.g., `re.sub(r'^\s*\d+→', '', line)`)
   - Remove leading/trailing whitespace
   - Filter empty lines
4. **Validate preprocessing** - Print sample cleaned documents to verify they contain only semantic content

Example preprocessing pattern:
```python
import re

def clean_line(line):
    # Remove line number prefix like "  1→" or "11→"
    cleaned = re.sub(r'^\s*\d+[→\t]', '', line)
    return cleaned.strip()

documents = [clean_line(line) for line in raw_lines if clean_line(line)]
```

### Step 2: Model Selection

Select an appropriate embedding model for the content language and domain:

1. **Check model language** - Models often have language indicators in their names:
   - `zh` = Chinese (e.g., `bge-small-zh-v1.5`)
   - `en` = English (e.g., `bge-small-en-v1.5`)
   - No suffix often means multilingual or English
2. **Match model to content** - Using a Chinese-optimized model for English text (or vice versa) produces suboptimal embeddings
3. **Consider model size** - Larger models generally produce better embeddings but are slower

### Step 3: Embedding Computation

When computing embeddings:

1. **Normalize embeddings** - Use `normalize_embeddings=True` to enable cosine similarity via dot product
2. **Batch processing** - For large document sets, process in batches to manage memory
3. **Verify dimensions** - Confirm embedding dimensions match expectations for the model

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('model-name')
doc_embeddings = model.encode(documents, normalize_embeddings=True)
query_embedding = model.encode(query, normalize_embeddings=True)
```

### Step 4: Similarity Computation and Ranking

1. **Compute similarities** - Use dot product for normalized embeddings (equivalent to cosine similarity)
2. **Handle ties** - Be aware that identical similarity scores produce arbitrary ordering
3. **Use correct indexing** - For k-th highest, use index `k-1` after sorting in descending order

```python
import numpy as np

similarities = np.dot(doc_embeddings, query_embedding)
sorted_indices = np.argsort(similarities)[::-1]  # Descending order

# For 5th highest: index 4 (0-indexed)
fifth_highest_idx = sorted_indices[4]
fifth_highest_doc = documents[fifth_highest_idx]
```

### Step 5: Result Verification

Before writing final results, verify correctness:

1. **Print document count** - Confirm expected number of documents were loaded
2. **Show sample documents** - Display first few cleaned documents to verify preprocessing
3. **Display top-k results** - Print at least the top 5-10 documents with their similarity scores
4. **Cross-check output format** - Ensure the output contains only the semantic content, not formatting artifacts

```python
# Verification checklist
print(f"Total documents: {len(documents)}")
print(f"Sample document: {documents[0][:100]}...")
print("\nTop 10 by similarity:")
for i in range(min(10, len(sorted_indices))):
    idx = sorted_indices[i]
    print(f"  {i+1}. [{similarities[idx]:.4f}] {documents[idx][:50]}...")
```

## Common Pitfalls

### Data Format Issues
- **Line number prefixes** - Input files often include line numbers (e.g., `1→Text`) that corrupt embeddings if not removed
- **Invisible characters** - Watch for tabs, non-breaking spaces, or Unicode formatting characters
- **Mixed encodings** - Explicitly specify file encoding (`encoding='utf-8'`)

### Model Mismatches
- **Language mismatch** - Using language-specific models on wrong-language content
- **Version confusion** - Ensure model revision matches expected behavior

### Indexing Errors
- **Off-by-one errors** - k-th highest uses index `k-1` in 0-indexed arrays
- **Original vs sorted indices** - Track the mapping between sorted positions and original document indices

### Verification Gaps
- **No sanity checks** - Always verify document count, sample content, and score distribution
- **Missing tie handling** - Document when ties exist and how they affect results

Overview

This skill guides semantic similarity retrieval using embedding models to rank documents by relevance to a query. It focuses on practical steps: data inspection and cleaning, model selection, embedding computation, similarity scoring, and result verification. The goal is reliable top-k retrieval for benchmarks and production ranking tasks.

How this skill works

The skill inspects raw inputs to remove non-semantic artifacts (line numbers, headers, invisible characters) and validates cleaned samples. It recommends choosing a model that matches the content language and domain, computing normalized embeddings in batches, and using dot-product on normalized vectors to get cosine similarities. Finally it sorts scores to identify top-k items and provides checks to verify correctness before exporting results.

When to use it

  • Preparing data for semantic search or document ranking tasks
  • Computing embeddings for queries and document collections
  • Evaluating retrieval quality on MTEB-style benchmarks
  • Identifying top-k similar items for recommendation or deduplication
  • Troubleshooting unexpected retrieval results or ranking ties

Best practices

  • Inspect raw file samples to detect prefixes, headers, or invisible characters before encoding
  • Clean data with targeted regex and strip whitespace; validate by printing sample cleaned lines
  • Choose a model that matches the content language and domain; prefer larger models when quality matters
  • Encode in batches and normalize embeddings to use dot product as cosine similarity
  • Sort similarities in descending order and use zero-based indices (k-th highest = index k-1)
  • Always verify: document count, sample content, and top-k with similarity scores

Example use cases

  • Ranking news articles by relevance to a search query after removing line-number artifacts
  • Computing embeddings for a multilingual dataset using a language-matched model for better retrieval
  • Finding the 5th most similar paragraph to a query in a large document collection
  • Running MTEB retrieval experiments with batched, normalized embeddings and explicit verification steps
  • Detecting near-duplicate documents by retrieving top-k similar items and inspecting similarity scores

FAQ

How do I remove line number prefixes reliably?

Use a simple regex to strip leading digits and delimiters (e.g., re.sub(r'^\s*\d+[→\t]', '', line)) and then trim whitespace. Test on several samples before batch processing.

Why normalize embeddings?

Normalization lets you compute cosine similarity with a single dot product, which is faster and numerically stable for ranking.

How do I handle ties in similarity scores?

Document ties explicitly in results. If deterministic ordering is required, add a secondary sort key such as original index or timestamp.