home / skills / doanchienthangdev / omgkit / evaluation-methodology

evaluation-methodology skill

safe

/plugin/skills/ai-engineering/evaluation-methodology

This skill helps evaluate AI model outputs using exact match, semantic similarity, and AI judge methods to build robust eval pipelines.

npx playbooks add skill doanchienthangdev/omgkit --skill evaluation-methodology

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.3 KB

---
name: evaluation-methodology
description: Methods for evaluating AI model outputs - exact match, semantic similarity, LLM-as-judge, comparative evaluation, ELO ranking. Use when measuring AI quality, building eval pipelines, or comparing models.
---

# Evaluation Methodology

Methods for evaluating Foundation Model outputs.

## Evaluation Approaches

### 1. Exact Evaluation

| Method | Use Case | Example |
|--------|----------|---------|
| Exact Match | QA, Math | `"5" == "5"` |
| Functional Correctness | Code | Pass test cases |
| BLEU/ROUGE | Translation | N-gram overlap |
| Semantic Similarity | Open-ended | Embedding cosine |

```python
# Semantic Similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')
emb1 = model.encode([generated])
emb2 = model.encode([reference])
similarity = cosine_similarity(emb1, emb2)[0][0]
```

### 2. AI as Judge

```python
JUDGE_PROMPT = """Rate the response on a scale of 1-5.

Criteria:
- Accuracy: Is information correct?
- Helpfulness: Does it address the need?
- Clarity: Is it easy to understand?

Query: {query}
Response: {response}

Return JSON: {"score": N, "reasoning": "..."}"""

# Multi-judge for reliability
judges = ["gpt-4", "claude-3"]
scores = [get_score(judge, response) for judge in judges]
final_score = sum(scores) / len(scores)
```

### 3. Comparative Evaluation (ELO)

```python
COMPARE_PROMPT = """Compare these responses.

Query: {query}
A: {response_a}
B: {response_b}

Which is better? Return: A, B, or tie"""

def update_elo(rating_a, rating_b, winner, k=32):
    expected_a = 1 / (1 + 10**((rating_b - rating_a) / 400))
    score_a = 1 if winner == "A" else 0 if winner == "B" else 0.5
    return rating_a + k * (score_a - expected_a)
```

## Evaluation Pipeline

```
1. Define Criteria (accuracy, helpfulness, safety)
   ↓
2. Create Scoring Rubric with Examples
   ↓
3. Select Methods (exact + AI judge + human)
   ↓
4. Create Evaluation Dataset
   ↓
5. Run Evaluation
   ↓
6. Analyze & Iterate
```

## Best Practices

1. Use multiple evaluation methods
2. Calibrate AI judges with human data
3. Include both automatic and human evaluation
4. Version your evaluation datasets
5. Track metrics over time
6. Test for position bias in comparisons

Overview

This skill describes practical methods for evaluating AI model outputs, covering exact match, semantic similarity, LLM-as-judge, comparative evaluation, and ELO ranking. It explains how to mix automatic and human-centered techniques to measure quality and drive iteration. Use it to build repeatable evaluation pipelines and compare models or versions reliably.

How this skill works

It inspects model responses against defined criteria (accuracy, helpfulness, clarity, safety) using a mix of exact checks, embedding-based semantic similarity, and automated judge prompts. For ranking and head-to-head comparisons it supports pairwise assessments and ELO-style rating updates. The skill also describes assembling datasets, scoring rubrics, and aggregating multi-judge results into final metrics.

When to use it

Measuring model quality for QA, summarization, translation, or code generation
Building an automated evaluation pipeline with repeatable metrics
Comparing two or more model versions or prompts head-to-head
Calibrating LLM judges with human-labelled examples
Tracking model performance across releases and datasets

Best practices

Combine multiple methods: exact matches, semantic similarity, LLM-as-judge, and human review for robustness
Define clear criteria and a scoring rubric with examples before running evaluations
Calibrate automated judges against a human-labeled baseline to reduce bias
Version your evaluation datasets and record inputs, outputs, and metrics for reproducibility
Monitor position and selection bias in pairwise comparisons and use multi-judge aggregation
Track metrics over time and iterate on prompts, training data, and evaluation design

Example use cases

Exact match for fact-based QA and math problems to flag regressions
Embedding cosine similarity to score open-ended generation against reference summaries
LLM-as-judge prompts to rate clarity, helpfulness, and safety at scale
Pairwise comparisons with ELO to rank competing model variants or prompts
Hybrid pipelines: automatic checks first, then human review for borderline or risky outputs

FAQ

How do I combine scores from multiple judges?

Aggregate by averaging scores or use majority for discrete choices; weight judges if some are calibrated against human labels.

When is exact match insufficient?

Exact match fails for paraphrases and open-ended tasks; use semantic similarity or human/LLM judgment instead.