home / skills / shipshitdev / library / advanced-evaluation

advanced-evaluation skill

/bundles/ai-agents/skills/advanced-evaluation

This skill helps you build robust LLM evaluation pipelines by applying direct scoring, pairwise comparison, and bias mitigation to ensure consistent quality.

npx playbooks add skill shipshitdev/library --skill advanced-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
3.0 KB
---
name: advanced-evaluation
description: Master LLM-as-a-Judge evaluation techniques including direct scoring, pairwise comparison, rubric generation, and bias mitigation. Use when building evaluation systems, comparing model outputs, or establishing quality standards for AI-generated content.
version: 1.0.0
tags:
  - evaluation
  - llm-as-judge
  - quality
  - bias-mitigation
---

# Advanced Evaluation

LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.

## When to Activate

- Building automated evaluation pipelines for LLM outputs
- Comparing multiple model responses to select the best one
- Establishing consistent quality standards
- Debugging inconsistent evaluation results
- Designing A/B tests for prompt or model changes
- Creating rubrics for human or automated evaluation

## Core Concepts

### Evaluation Taxonomy

**Direct Scoring**: Single LLM rates one response on a defined scale.

- Best for: Objective criteria (factual accuracy, instruction following, toxicity)
- Reliability: Moderate to high for well-defined criteria

**Pairwise Comparison**: LLM compares two responses and selects better one.

- Best for: Subjective preferences (tone, style, persuasiveness)
- Reliability: Higher than direct scoring for preferences

### Known Biases

| Bias | Description | Mitigation |
|------|-------------|------------|
| Position | First-position preference | Swap positions, check consistency |
| Length | Longer = higher scores | Explicit prompting, length-normalized scoring |
| Self-Enhancement | Models rate own outputs higher | Use different model for evaluation |
| Verbosity | Unnecessary detail rated higher | Criteria-specific rubrics |
| Authority | Confident tone rated higher | Require evidence citation |

### Decision Framework

```
Is there an objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Pairwise Comparison (tone, style, creativity)
```

## Quick Reference

### Direct Scoring Requirements

1. Clear criteria definitions
2. Calibrated scale (1-5 recommended)
3. Chain-of-thought: justification BEFORE score (improves reliability 15-25%)

### Pairwise Comparison Protocol

1. First pass: A in first position
2. Second pass: B in first position (swap)
3. Consistency check: If passes disagree → TIE
4. Final verdict: Consistent winner with averaged confidence

### Rubric Components

- Level descriptions with clear boundaries
- Observable characteristics per level
- Edge case guidance
- Strictness calibration (lenient/balanced/strict)

## Integration

Works with:

- **context-fundamentals** - Effective context structure
- **tool-design** - Evaluation tool schemas
- **evaluation** (foundational) - Core evaluation concepts

---

**For detailed implementation patterns, prompt templates, examples, and metrics:** `references/full-guide.md`

See also: `references/implementation-patterns.md`, `references/bias-mitigation.md`, `references/metrics-guide.md`

Overview

This skill teaches advanced LLM-as-a-Judge evaluation techniques for scoring and comparing AI-generated outputs. It covers direct scoring, pairwise comparison, rubric generation, and bias mitigation to help teams build reliable, repeatable evaluation systems. Use it to standardize quality checks, run A/B tests, and reduce common evaluator biases.

How this skill works

The skill provides decision rules and protocols that select the right evaluation method based on whether objective ground truth exists. It defines calibrated scales, chain-of-thought justifications, and swap-based pairwise comparison to improve reliability. It also documents common biases (position, length, self-enhancement, verbosity, authority) and practical mitigations to reduce false positives and evaluator drift.

When to use it

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to choose a winner
  • Creating rubrics for human or automated graders
  • Designing A/B tests for prompt or model changes
  • Debugging inconsistent or noisy evaluation results

Best practices

  • Choose direct scoring for objective criteria and pairwise comparison for subjective preferences
  • Define clear, observable rubric levels and calibrate on examples before full runs
  • Require justification (chain-of-thought) before giving scores to improve reliability
  • Swap positions in pairwise tests and perform consistency checks; mark ties when inconsistent
  • Use separate models for evaluation and generation to reduce self-enhancement bias
  • Normalize for length and demand evidence for confident claims to avoid authority bias

Example use cases

  • Automated factual accuracy checks: direct-score factual claims with a 1–5 calibrated scale
  • Style preference testing: run pairwise comparisons with position swaps and consistency rules
  • Rubric creation: produce level descriptions, observable markers, and edge-case guidance for human graders
  • Model selection: compare outputs from multiple models across scaled criteria and aggregate winners
  • Evaluation pipeline debugging: identify bias sources by running bias-mitigation experiments

FAQ

When should I prefer pairwise comparison over direct scoring?

Prefer pairwise comparison when there is no objective ground truth and preference, tone, or creativity matter; it yields more consistent judgments for subjective criteria.

How do I handle inconsistent outcomes between swapped pairwise passes?

Treat inconsistent outcomes as ties. Optionally run a third adjudication pass or average confidence scores after addressing prompt ambiguities.