home / skills / doanchienthangdev / omgkit / evaluation-methodology

This skill helps evaluate AI model outputs using exact match, semantic similarity, and AI judge methods to build robust eval pipelines.

npx playbooks add skill doanchienthangdev/omgkit --skill evaluation-methodology

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.3 KB
---
name: evaluation-methodology
description: Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.
---

# Evaluation Methodology

Methods for evaluating Foundation Model outputs.

## Evaluation Approaches

### 1. Exact Evaluation

| Method | Use Case | Example |
|--------|----------|---------|
| Exact Match | QA, Math | `"5" == "5"` |
| Functional Correctness | Code | Pass test cases |
| BLEU/ROUGE | Translation | N-gram overlap |
| Semantic Similarity | Open-ended | Embedding cosine |

```python
# Semantic Similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode([generated])
emb2 = model.encode([reference])
similarity = cosine_similarity(emb1, emb2)[0][0]
```

### 2. AI as Judge

```python
JUDGE_PROMPT = """Rate the response on a scale of 1-5.

Criteria:
- Accuracy: Is information correct?
- Helpfulness: Does it address the need?
- Clarity: Is it easy to understand?

Query: {query}
Response: {response}

Return JSON: {"score": N, "reasoning": "..."}"""

# Multi-judge for reliability
judges = ["gpt-4", "claude-3"]
scores = [get_score(judge, response) for judge in judges]
final_score = sum(scores) / len(scores)
```

### 3. Comparative Evaluation (ELO)

```python
COMPARE_PROMPT = """Compare these responses.

Query: {query}
A: {response_a}
B: {response_b}

Which is better? Return: A, B, or tie"""

def update_elo(rating_a, rating_b, winner, k=32):
    expected_a = 1 / (1 + 10**((rating_b - rating_a) / 400))
    score_a = 1 if winner == "A" else 0 if winner == "B" else 0.5
    return rating_a + k * (score_a - expected_a)
```

## Evaluation Pipeline

```
1. Define Criteria (accuracy, helpfulness, safety)
   ↓
2. Create Scoring Rubric with Examples
   ↓
3. Select Methods (exact + AI judge + human)
   ↓
4. Create Evaluation Dataset
   ↓
5. Run Evaluation
   ↓
6. Analyze & Iterate
```

## Best Practices

1. Use multiple evaluation methods
2. Calibrate AI judges with human data
3. Include both automatic and human evaluation
4. Version your evaluation datasets
5. Track metrics over time
6. Test for position bias in comparisons

Overview

This skill describes practical methods for evaluating AI model outputs, covering exact match, semantic similarity, LLM-as-judge, comparative evaluation, and ELO ranking. It explains how to mix automatic and human-centered techniques to measure quality and drive iteration. Use it to build repeatable evaluation pipelines and compare models or versions reliably.

How this skill works

It inspects model responses against defined criteria (accuracy, helpfulness, clarity, safety) using a mix of exact checks, embedding-based semantic similarity, and automated judge prompts. For ranking and head-to-head comparisons it supports pairwise assessments and ELO-style rating updates. The skill also describes assembling datasets, scoring rubrics, and aggregating multi-judge results into final metrics.

When to use it

  • Measuring model quality for QA, summarization, translation, or code generation
  • Building an automated evaluation pipeline with repeatable metrics
  • Comparing two or more model versions or prompts head-to-head
  • Calibrating LLM judges with human-labelled examples
  • Tracking model performance across releases and datasets

Best practices

  • Combine multiple methods: exact matches, semantic similarity, LLM-as-judge, and human review for robustness
  • Define clear criteria and a scoring rubric with examples before running evaluations
  • Calibrate automated judges against a human-labeled baseline to reduce bias
  • Version your evaluation datasets and record inputs, outputs, and metrics for reproducibility
  • Monitor position and selection bias in pairwise comparisons and use multi-judge aggregation
  • Track metrics over time and iterate on prompts, training data, and evaluation design

Example use cases

  • Exact match for fact-based QA and math problems to flag regressions
  • Embedding cosine similarity to score open-ended generation against reference summaries
  • LLM-as-judge prompts to rate clarity, helpfulness, and safety at scale
  • Pairwise comparisons with ELO to rank competing model variants or prompts
  • Hybrid pipelines: automatic checks first, then human review for borderline or risky outputs

FAQ

How do I combine scores from multiple judges?

Aggregate by averaging scores or use majority for discrete choices; weight judges if some are calibrated against human labels.

When is exact match insufficient?

Exact match fails for paraphrases and open-ended tasks; use semantic similarity or human/LLM judgment instead.