home / skills / zpankz / mcp-skillset / agent-evaluation
This skill helps evaluate and improve Claude Code agents by applying multi-dimensional rubrics, non-determinism considerations, and context engineering
npx playbooks add skill zpankz/mcp-skillset --skill agent-evaluationReview the files below or copy the command above to add this skill to your agents.
---
name: agent-evaluation
description: Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.
---
# Evaluation Methods for Claude Code Agents
Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.
## Core Concepts
Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.
The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.
**Performance Drivers: The 95% Finding**
Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:
| Factor | Variance Explained | Implication |
|--------|-------------------|-------------|
| Token usage | 80% | More tokens = better performance |
| Number of tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |
Implications for Claude Code development:
- **Token budgets matter**: Evaluate with realistic token constraints
- **Model upgrades beat token increases**: Upgrading models provides larger gains than increasing token budgets
- **Multi-agent validation**: Validates architectures that distribute work across subagents with separate context windows
## Progressive Loading
**L2 Content** (loaded when methodology details needed):
- See: [references/methodologies.md](./references/methodologies.md)
- Evaluation Challenges
- Evaluation Rubric Design
- Evaluation Methodologies
- Test Set Design
- Context Engineering Evaluation
**L3 Content** (loaded when advanced techniques or examples needed):
- See: [references/advanced.md](./references/advanced.md)
- Advanced Evaluation: LLM-as-Judge
- Evaluation Metrics Reference
- Bias Mitigation Techniques
- LLM-as-Judge Implementation Patterns
- Metric Selection Guide
- Practical Examples
This skill evaluates and improves Claude Code commands, skills, and agents by measuring outcome quality and identifying actionable improvements. It focuses on testing prompt effectiveness, validating context engineering choices, and tracking measurable gains over iterations. The skill is designed for continuous validation and regression detection in agent-driven systems.
The skill runs multi-dimensional evaluations that judge outcomes rather than single canonical answers, accounting for non-determinism and multiple valid solution paths. It uses rubrics covering factual accuracy, completeness, citation/source quality, and tool-efficiency, and supports LLM-as-judge workflows alongside human review for edge cases. It also measures performance drivers like token usage, tool-call patterns, and model choice to explain variance and prioritize fixes.
How do I account for non-determinism in agent outputs?
Run multiple seeds per test case, aggregate scores (mean/median), and inspect variance; use rubrics to accept multiple valid paths.
Should I prioritize token budget or model upgrades?
Model upgrades often provide larger efficiency gains, but test both: measure token-driven gains and compare against model changes using the same test set.