home / skills / zpankz / mcp-skillset / agent-evaluation

agent-evaluation skill

/agent-evaluation

This skill helps evaluate and improve Claude Code agents by applying multi-dimensional rubrics, non-determinism considerations, and context engineering

npx playbooks add skill zpankz/mcp-skillset --skill agent-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
2.7 KB
---
name: agent-evaluation
description: Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.
---

# Evaluation Methods for Claude Code Agents

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

## Core Concepts

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.

The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.

**Performance Drivers: The 95% Finding**
Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:

| Factor | Variance Explained | Implication |
|--------|-------------------|-------------|
| Token usage | 80% | More tokens = better performance |
| Number of tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |

Implications for Claude Code development:

- **Token budgets matter**: Evaluate with realistic token constraints
- **Model upgrades beat token increases**: Upgrading models provides larger gains than increasing token budgets
- **Multi-agent validation**: Validates architectures that distribute work across subagents with separate context windows

## Progressive Loading

**L2 Content** (loaded when methodology details needed):
- See: [references/methodologies.md](./references/methodologies.md)
  - Evaluation Challenges
  - Evaluation Rubric Design
  - Evaluation Methodologies
  - Test Set Design
  - Context Engineering Evaluation

**L3 Content** (loaded when advanced techniques or examples needed):
- See: [references/advanced.md](./references/advanced.md)
  - Advanced Evaluation: LLM-as-Judge
  - Evaluation Metrics Reference
  - Bias Mitigation Techniques
  - LLM-as-Judge Implementation Patterns
  - Metric Selection Guide
  - Practical Examples

Overview

This skill evaluates and improves Claude Code commands, skills, and agents by measuring outcome quality and identifying actionable improvements. It focuses on testing prompt effectiveness, validating context engineering choices, and tracking measurable gains over iterations. The skill is designed for continuous validation and regression detection in agent-driven systems.

How this skill works

The skill runs multi-dimensional evaluations that judge outcomes rather than single canonical answers, accounting for non-determinism and multiple valid solution paths. It uses rubrics covering factual accuracy, completeness, citation/source quality, and tool-efficiency, and supports LLM-as-judge workflows alongside human review for edge cases. It also measures performance drivers like token usage, tool-call patterns, and model choice to explain variance and prioritize fixes.

When to use it

  • When testing new Claude Code prompts or command templates for effectiveness
  • When validating context engineering changes that alter agent memory or tool access
  • When comparing model or token-budget trade-offs to prioritize upgrades
  • When running regression checks after changes to agent orchestration or subagent flows
  • When building test sets to measure real-world task success and reliability

Best practices

  • Evaluate against realistic token budgets to reflect production constraints
  • Use multi-dimensional rubrics to capture accuracy, completeness, and source quality
  • Combine automated LLM-as-judge scoring with targeted human reviews for edge cases
  • Track token usage and tool-call patterns to identify inefficiencies and improvement opportunities
  • Run multi-seed, multi-run tests to capture non-deterministic behavior and variance

Example use cases

  • Compare two prompt variants across 50 runs to determine which yields higher factual completeness under a fixed token budget
  • Validate a new context window strategy by measuring downstream task success and token consumption
  • Use LLM-as-judge to rapidly score large test suites while routing low-confidence cases to human reviewers
  • Benchmark a model upgrade versus increasing token budget to decide the most cost-effective performance uplift
  • Test multi-agent orchestration by evaluating whether distributed subagents achieve the same outcomes with lower peak tokens

FAQ

How do I account for non-determinism in agent outputs?

Run multiple seeds per test case, aggregate scores (mean/median), and inspect variance; use rubrics to accept multiple valid paths.

Should I prioritize token budget or model upgrades?

Model upgrades often provide larger efficiency gains, but test both: measure token-driven gains and compare against model changes using the same test set.