home / skills / greyhaven-ai / claude-code-config / evaluation

evaluation skill

safe

/grey-haven-plugins/core/skills/evaluation

This skill evaluates LLM outputs using multi-dimensional rubrics to ensure quality, handle non-determinism, and apply LLM-as-judge patterns for robust testing.

npx playbooks add skill greyhaven-ai/claude-code-config --skill evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

7.6 KB

---
name: grey-haven-evaluation
description: "Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems. Use when testing prompts, validating outputs, comparing models, or when user mentions 'evaluation', 'testing LLM', 'rubric', 'LLM-as-judge', 'output quality', 'prompt testing', or 'model comparison'."
# v2.0.43: Skills to auto-load for evaluation work
skills:
  - grey-haven-prompt-engineering
  - grey-haven-testing-strategy
# v2.0.74: Tools for evaluation work
allowed-tools:
  - Read
  - Write
  - MultiEdit
  - Bash
  - Grep
  - Glob
  - TodoWrite
---

# Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

## Core Insight: The 95% Variance Finding

Research shows **95% of output variance** comes from just two sources:
- **80%** from prompt tokens (wording, structure, examples)
- **15%** from random seed/sampling

Temperature, model version, and other factors account for only 5%.

**Implication**: Focus evaluation on prompt quality, not model tweaking.

## What's Included

### Examples (`examples/`)
- **Prompt comparison** - A/B testing prompts with rubrics
- **Model evaluation** - Comparing outputs across models
- **Regression testing** - Detecting output degradation

### Reference Guides (`reference/`)
- **Rubric design** - Multi-dimensional evaluation criteria
- **LLM-as-judge** - Using LLMs to evaluate LLM outputs
- **Statistical methods** - Handling non-determinism

### Templates (`templates/`)
- **Rubric templates** - Ready-to-use evaluation criteria
- **Judge prompts** - LLM-as-judge prompt templates
- **Test case format** - Structured test case templates

### Checklists (`checklists/`)
- **Evaluation setup** - Before running evaluations
- **Rubric validation** - Ensuring rubric quality

## Key Concepts

### 1. Multi-Dimensional Rubrics

Don't use single scores. Break down evaluation into dimensions:

| Dimension | Weight | Criteria |
|-----------|--------|----------|
| Accuracy | 30% | Factually correct, no hallucinations |
| Completeness | 25% | Addresses all requirements |
| Clarity | 20% | Well-organized, easy to understand |
| Conciseness | 15% | No unnecessary content |
| Format | 10% | Follows specified structure |

### 2. Handling Non-Determinism

LLMs are non-deterministic. Handle with:

```
Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases

Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal

Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals
```

### 3. LLM-as-Judge Pattern

Use a judge LLM to evaluate outputs:

```
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│  Test LLM   │────▶│   Output    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Rubric    │────▶│ Judge LLM   │
                    └─────────────┘     └─────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │   Score     │
                                        └─────────────┘
```

**Best Practice**: Use stronger model as judge (Opus judges Sonnet).

### 4. Test Case Design

Structure test cases with:

```typescript
interface TestCase {
  id: string
  input: string              // User message or context
  expectedBehavior: string   // What output should do
  rubric: RubricItem[]       // Evaluation criteria
  groundTruth?: string       // Optional gold standard
  metadata: {
    category: string
    difficulty: 'easy' | 'medium' | 'hard'
    createdAt: string
  }
}
```

## Evaluation Workflow

### Step 1: Define Rubric

```yaml
rubric:
  dimensions:
    - name: accuracy
      weight: 0.3
      criteria:
        5: "Completely accurate, no errors"
        4: "Minor errors, doesn't affect correctness"
        3: "Some errors, partially correct"
        2: "Significant errors, mostly incorrect"
        1: "Completely incorrect or hallucinated"
```

### Step 2: Create Test Cases

```yaml
test_cases:
  - id: "code-gen-001"
    input: "Write a function to reverse a string"
    expected_behavior: "Returns working reverse function"
    ground_truth: |
      function reverse(s: string): string {
        return s.split('').reverse().join('')
      }
```

### Step 3: Run Evaluation

```bash
# Run test suite
python evaluate.py --suite code-generation --runs 3

# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation                 │
# │ Total: 50 | Pass: 47 | Fail: 3              │
# │ Accuracy: 94% (±2.1%)                       │
# │ Avg Score: 4.2/5.0                          │
# └─────────────────────────────────────────────┘
```

### Step 4: Analyze Results

Look for:
- **Low-scoring dimensions** - Target for improvement
- **High-variance cases** - Prompt needs clarification
- **Regression from baseline** - Investigate changes

## Grey Haven Integration

### With TDD Workflow

```
1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?
```

### With Pipeline Architecture

```
acquire → prepare → process → parse → render → EVALUATE
                                                  │
                                          ┌───────┴───────┐
                                          │ Compare to    │
                                          │ ground truth  │
                                          │ or rubric     │
                                          └───────────────┘
```

### With Prompt Engineering

```
Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓
```

## Use This Skill When

- Testing new prompts before production
- Comparing prompt variations (A/B testing)
- Validating model outputs meet quality bar
- Detecting regressions after changes
- Building evaluation datasets
- Implementing automated quality gates

## Related Skills

- `prompt-engineering` - Improve prompts based on evaluation
- `testing-strategy` - Overall testing approaches
- `llm-project-development` - Pipeline with evaluation stage

## Quick Start

```bash
# Design your rubric
cat templates/rubric-template.yaml

# Create test cases
cat templates/test-case-template.yaml

# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md

# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md
```

---

**Skill Version**: 1.0
**Key Finding**: 95% variance from prompts (80%) + sampling (15%)
**Last Updated**: 2025-01-15

Overview

This skill evaluates LLM outputs with multi-dimensional rubrics, supports handling non-determinism, and implements the LLM-as-judge pattern for repeatable quality checks. It focuses evaluation effort on prompt design and structured test cases so teams can reliably compare prompts, models, and changes. Use it to create automated evaluation pipelines and enforce quality gates before production deployment.

How this skill works

Define a rubric with weighted dimensions (accuracy, completeness, clarity, conciseness, format) and author structured test cases that include expected behavior and optional ground truth. Run multiple inference runs per test to measure mean and variance, then aggregate scores per-dimension and overall. Optionally use a stronger model as an automated judge: feed outputs plus rubric to the judge LLM to produce scored evaluations and written feedback.

When to use it

Before shipping prompts or model changes to production
When A/B testing prompt variations or model versions
During regression testing to detect output degradation
When building evaluation datasets or automated quality gates
Any time you need reproducible, multi-dimensional output assessments

Best practices

Design multi-dimensional rubrics; avoid single overall scores
Run each prompt 3–5 times and report mean and variance
Prefer prompt improvements over model hyperparameter tweaks
Use a stronger model as judge and validate judge consistency
Require statistical thresholds (e.g., ≥70% win rate) for claims of improvement

Example use cases

Compare two prompt rewrites for instruction clarity using rubric scores
Validate that a model update did not regress accuracy on a core test suite
Automate quality gates in CI: fail build if average score falls below threshold
Use judge LLM to scale human evaluation for large result sets
Create labeled failure cases for prompt engineering and remediation

FAQ

How many runs should I use per prompt to handle non-determinism?

Run each prompt 3–5 times, report mean and variance, and flag high-variance cases for closer inspection.

Should I tune the model or the prompt first?

Focus on prompts first—research shows most variance comes from prompt wording; only tweak model settings when prompts are stable.