home / skills / yonatangross / orchestkit / llm-evaluation

llm-evaluation skill

/plugins/ork/skills/llm-evaluation

This skill helps you evaluate and validate LLM outputs using multi-dimension scoring, gates, and hallucination checks for reliable AI pipelines.

npx playbooks add skill yonatangross/orchestkit --skill llm-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
4.3 KB
---
name: llm-evaluation
description: LLM output evaluation and quality assessment. Use when implementing LLM-as-judge patterns, quality gates for AI outputs, or automated evaluation pipelines.
context: fork
agent: llm-integrator
version: 2.0.0
tags: [evaluation, llm, quality, ragas, langfuse, 2026]
author: OrchestKit
user-invocable: false
---

# LLM Evaluation

Evaluate and validate LLM outputs for quality assurance using RAGAS and LLM-as-judge patterns.

## Quick Reference

### LLM-as-Judge Pattern

```python
async def evaluate_quality(input_text: str, output_text: str, dimension: str) -> float:
    response = await llm.chat([{
        "role": "user",
        "content": f"""Evaluate for {dimension}. Score 1-10.
Input: {input_text[:500]}
Output: {output_text[:1000]}
Respond with just the number."""
    }])
    return int(response.content.strip()) / 10
```

### Quality Gate

```python
QUALITY_THRESHOLD = 0.7

async def quality_gate(state: dict) -> dict:
    scores = await full_quality_assessment(state["input"], state["output"])
    passed = scores["average"] >= QUALITY_THRESHOLD
    return {**state, "quality_passed": passed}
```

### Hallucination Detection

```python
async def detect_hallucination(context: str, output: str) -> dict:
    # Check if output contains claims not in context
    return {"has_hallucinations": bool, "unsupported_claims": []}
```

## RAGAS Metrics (2026)

| Metric | Use Case | Threshold |
|--------|----------|-----------|
| Faithfulness | RAG grounding | ≥ 0.8 |
| Answer Relevancy | Q&A systems | ≥ 0.7 |
| Context Precision | Retrieval quality | ≥ 0.7 |
| Context Recall | Retrieval completeness | ≥ 0.7 |

## Anti-Patterns (FORBIDDEN)

```python
# ❌ NEVER use same model as judge and evaluated
output = await gpt4.complete(prompt)
score = await gpt4.evaluate(output)  # Same model!

# ❌ NEVER use single dimension
if relevance_score > 0.7:  # Only checking one thing
    return "pass"

# ❌ NEVER set threshold too high
THRESHOLD = 0.95  # Blocks most content

# ✅ ALWAYS use different judge model
score = await gpt4_mini.evaluate(claude_output)

# ✅ ALWAYS use multiple dimensions
scores = await evaluate_all_dimensions(output)
if scores["average"] > 0.7:
    return "pass"
```

## Key Decisions

| Decision | Recommendation |
|----------|----------------|
| Judge model | GPT-4o-mini or Claude Haiku |
| Threshold | 0.7 for production, 0.6 for drafts |
| Dimensions | 3-5 most relevant to use case |
| Sample size | 50+ for reliable metrics |

## Detailed Documentation

| Resource | Description |
|----------|-------------|
| [references/evaluation-metrics.md](references/evaluation-metrics.md) | RAGAS & LLM-as-judge metrics |
| [examples/evaluation-patterns.md](examples/evaluation-patterns.md) | Complete evaluation examples |
| [checklists/evaluation-checklist.md](checklists/evaluation-checklist.md) | Setup and review checklists |
| [scripts/evaluator-template.py](scripts/evaluator-template.py) | Starter evaluation template |

## Related Skills

- `quality-gates` - Workflow quality control
- `langfuse-observability` - Tracking evaluation scores
- `agent-loops` - Self-correcting with evaluation

## Capability Details

### llm-as-judge
**Keywords:** LLM judge, judge model, evaluation model, grader LLM
**Solves:**
- Use LLM to evaluate other LLM outputs
- Implement judge prompts for quality
- Configure evaluation criteria

### ragas-metrics
**Keywords:** RAGAS, faithfulness, answer relevancy, context precision
**Solves:**
- Evaluate RAG with RAGAS metrics
- Measure faithfulness and relevancy
- Assess context precision and recall

### hallucination-detection
**Keywords:** hallucination, factuality, grounded, verify facts
**Solves:**
- Detect hallucinations in LLM output
- Verify factual accuracy
- Implement grounding checks

### quality-gates
**Keywords:** quality gate, threshold, pass/fail, evaluation gate
**Solves:**
- Implement quality thresholds
- Block low-quality outputs
- Configure multi-metric gates

### batch-evaluation
**Keywords:** batch eval, dataset evaluation, bulk scoring, eval suite
**Solves:**
- Evaluate over golden datasets
- Run batch evaluation pipelines
- Generate evaluation reports

### pairwise-comparison
**Keywords:** pairwise, A/B comparison, side-by-side, preference
**Solves:**
- Compare two model outputs
- Implement preference ranking
- Run A/B evaluations

Overview

This skill provides a production-ready toolkit for evaluating LLM outputs using LLM-as-judge patterns and RAGAS metrics. It helps teams implement automated quality gates, hallucination detection, and batch evaluation pipelines. The goal is reliable, repeatable quality assessment for retrieval-augmented generation (RAG) workflows and general LLM outputs.

How this skill works

The skill runs multi-dimensional assessments (faithfulness, relevancy, precision, recall) and aggregates scores into pass/fail quality gates. It uses a separate judge model to score outputs, supports per-dimension prompts, and can run batch evaluations against golden datasets. Hallucination checks compare output claims to provided context and flag unsupported assertions.

When to use it

  • Implementing automated quality gates before publishing AI-generated content
  • Running batch evaluations over model outputs or golden datasets
  • Detecting hallucinations and verifying factual claims in RAG systems
  • Setting up LLM-as-judge workflows where a different model grades outputs
  • Comparing model variants with pairwise or A/B evaluation

Best practices

  • Always use a different judge model than the evaluated model to avoid bias
  • Score multiple dimensions (3–5) and use the average for gating decisions
  • Set realistic thresholds (0.7 for production, 0.6 for drafts) and tune per use case
  • Sample sufficiently (50+ examples) for reliable metrics before enforcing gates
  • Avoid single-dimension pass/fail checks; combine relevance, faithfulness, and retrieval metrics

Example use cases

  • Quality gate that blocks responses with average score < 0.7 before returning to users
  • Batch evaluation pipeline that scores model outputs on faithfulness and relevancy over a test set
  • Hallucination detector that flags unsupported claims in answers produced by a RAG system
  • LLM-as-judge microservice using a lightweight judge model to rate production outputs
  • Pairwise A/B comparison flow that ranks two outputs by user-centric preference and factuality

FAQ

Which judge model should I use?

Use a smaller, different model than the evaluated model (e.g., GPT-4o-mini or Claude Haiku) to reduce bias and cost.

What thresholds are recommended?

Use ~0.7 for production gates and ~0.6 for drafts; tune by dimension and business risk.