home / skills / yonatangross / orchestkit / llm-evaluation

llm-evaluation skill

Q: Which judge model should I use?

Use a smaller, different model than the evaluated model (e.g., GPT-4o-mini or Claude Haiku) to reduce bias and cost.

Q: What thresholds are recommended?

Use ~0.7 for production gates and ~0.6 for drafts; tune by dimension and business risk.

safe

/plugins/ork/skills/llm-evaluation

This skill helps you evaluate and validate LLM outputs using multi-dimension scoring, gates, and hallucination checks for reliable AI pipelines.

npx playbooks add skill yonatangross/orchestkit --skill llm-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

4.3 KB

---
name: llm-evaluation
description: LLM output evaluation and quality assessment. Use when implementing LLM-as-judge patterns, quality gates for AI outputs, or automated evaluation pipelines.
context: fork
agent: llm-integrator
version: 2.0.0
tags: [evaluation, llm, quality, ragas, langfuse, 2026]
author: OrchestKit
user-invocable: false
---

# LLM Evaluation

Evaluate and validate LLM outputs for quality assurance using RAGAS and LLM-as-judge patterns.

## Quick Reference

### LLM-as-Judge Pattern

```python
async def evaluate_quality(input_text: str, output_text: str, dimension: str) -> float:
    response = await llm.chat([{
        "role": "user",
        "content": f"""Evaluate for {dimension}. Score 1-10.
Input: {input_text[:500]}
Output: {output_text[:1000]}
Respond with just the number."""
    }])
    return int(response.content.strip()) / 10
```

### Quality Gate

```python
QUALITY_THRESHOLD = 0.7

async def quality_gate(state: dict) -> dict:
    scores = await full_quality_assessment(state["input"], state["output"])
    passed = scores["average"] >= QUALITY_THRESHOLD
    return {**state, "quality_passed": passed}
```

### Hallucination Detection

```python
async def detect_hallucination(context: str, output: str) -> dict:
    # Check if output contains claims not in context
    return {"has_hallucinations": bool, "unsupported_claims": []}
```

## RAGAS Metrics (2026)

| Metric | Use Case | Threshold |
|--------|----------|-----------|
| Faithfulness | RAG grounding | ≥ 0.8 |
| Answer Relevancy | Q&A systems | ≥ 0.7 |
| Context Precision | Retrieval quality | ≥ 0.7 |
| Context Recall | Retrieval completeness | ≥ 0.7 |

## Anti-Patterns (FORBIDDEN)

```python
# ❌ NEVER use same model as judge and evaluated
output = await gpt4.complete(prompt)
score = await gpt4.evaluate(output)  # Same model!

# ❌ NEVER use single dimension
if relevance_score > 0.7:  # Only checking one thing
    return "pass"

# ❌ NEVER set threshold too high
THRESHOLD = 0.95  # Blocks most content

# ✅ ALWAYS use different judge model
score = await gpt4_mini.evaluate(claude_output)

# ✅ ALWAYS use multiple dimensions
scores = await evaluate_all_dimensions(output)
if scores["average"] > 0.7:
    return "pass"
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Judge model | GPT-4o-mini or Claude Haiku |
| Threshold | 0.7 for production, 0.6 for drafts |
| Dimensions | 3-5 most relevant to use case |
| Sample size | 50+ for reliable metrics |

## Detailed Documentation

| Resource | Description |
|----------|-------------|
| [references/evaluation-metrics.md](references/evaluation-metrics.md) | RAGAS & LLM-as-judge metrics |
| [examples/evaluation-patterns.md](examples/evaluation-patterns.md) | Complete evaluation examples |
| [checklists/evaluation-checklist.md](checklists/evaluation-checklist.md) | Setup and review checklists |
| [scripts/evaluator-template.py](scripts/evaluator-template.py) | Starter evaluation template |

## Related Skills

- `quality-gates` - Workflow quality control
- `langfuse-observability` - Tracking evaluation scores
- `agent-loops` - Self-correcting with evaluation

## Capability Details

### llm-as-judge
**Keywords:** LLM judge, judge model, evaluation model, grader LLM
**Solves:**
- Use LLM to evaluate other LLM outputs
- Implement judge prompts for quality
- Configure evaluation criteria

### ragas-metrics
**Keywords:** RAGAS, faithfulness, answer relevancy, context precision
**Solves:**
- Evaluate RAG with RAGAS metrics
- Measure faithfulness and relevancy
- Assess context precision and recall

### hallucination-detection
**Keywords:** hallucination, factuality, grounded, verify facts
**Solves:**
- Detect hallucinations in LLM output
- Verify factual accuracy
- Implement grounding checks

### quality-gates
**Keywords:** quality gate, threshold, pass/fail, evaluation gate
**Solves:**
- Implement quality thresholds
- Block low-quality outputs
- Configure multi-metric gates

### batch-evaluation
**Keywords:** batch eval, dataset evaluation, bulk scoring, eval suite
**Solves:**
- Evaluate over golden datasets
- Run batch evaluation pipelines
- Generate evaluation reports

### pairwise-comparison
**Keywords:** pairwise, A/B comparison, side-by-side, preference
**Solves:**
- Compare two model outputs
- Implement preference ranking
- Run A/B evaluations

Overview

This skill provides a production-ready toolkit for evaluating LLM outputs using LLM-as-judge patterns and RAGAS metrics. It helps teams implement automated quality gates, hallucination detection, and batch evaluation pipelines. The goal is reliable, repeatable quality assessment for retrieval-augmented generation (RAG) workflows and general LLM outputs.

How this skill works

The skill runs multi-dimensional assessments (faithfulness, relevancy, precision, recall) and aggregates scores into pass/fail quality gates. It uses a separate judge model to score outputs, supports per-dimension prompts, and can run batch evaluations against golden datasets. Hallucination checks compare output claims to provided context and flag unsupported assertions.

When to use it

Implementing automated quality gates before publishing AI-generated content
Running batch evaluations over model outputs or golden datasets
Detecting hallucinations and verifying factual claims in RAG systems
Setting up LLM-as-judge workflows where a different model grades outputs
Comparing model variants with pairwise or A/B evaluation

Best practices

Always use a different judge model than the evaluated model to avoid bias
Score multiple dimensions (3–5) and use the average for gating decisions
Set realistic thresholds (0.7 for production, 0.6 for drafts) and tune per use case
Sample sufficiently (50+ examples) for reliable metrics before enforcing gates
Avoid single-dimension pass/fail checks; combine relevance, faithfulness, and retrieval metrics

Example use cases

Quality gate that blocks responses with average score < 0.7 before returning to users
Batch evaluation pipeline that scores model outputs on faithfulness and relevancy over a test set
Hallucination detector that flags unsupported claims in answers produced by a RAG system
LLM-as-judge microservice using a lightweight judge model to rate production outputs
Pairwise A/B comparison flow that ranks two outputs by user-centric preference and factuality

FAQ

Which judge model should I use?

Use a smaller, different model than the evaluated model (e.g., GPT-4o-mini or Claude Haiku) to reduce bias and cost.

What thresholds are recommended?

Use ~0.7 for production gates and ~0.6 for drafts; tune by dimension and business risk.