home / skills / eyadsibai / ltk / agent-evaluation

agent-evaluation skill

safe

/plugins/ltk-core/skills/agent-evaluation

This skill helps you evaluate agent performance, design test sets, and apply multi-dimensional rubrics to improve quality and reliability.

npx playbooks add skill eyadsibai/ltk --skill agent-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.6 KB

---
name: agent-evaluation
description: Use when evaluating agent performance, building test frameworks, measuring quality, or asking about "agent evaluation", "LLM-as-judge", "agent testing", "quality metrics", "evaluation rubrics", "agent benchmarks"
version: 1.0.0
---

# Agent Evaluation Methods

Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.

## Key Finding: 95% Performance Drivers

Research on BrowseComp found three factors explain 95% of variance:

| Factor | Variance | Implication |
|--------|----------|-------------|
| Token usage | 80% | More tokens = better performance |
| Tool calls | ~10% | More exploration helps |
| Model choice | ~5% | Better models multiply efficiency |

**Implications**: Model upgrades beat token increases. Multi-agent architectures validate.

## Multi-Dimensional Rubric

| Dimension | Excellent | Good | Acceptable | Failed |
|-----------|-----------|------|------------|--------|
| Factual accuracy | All correct | Minor errors | Some errors | Wrong |
| Completeness | All aspects | Most aspects | Key aspects | Missing |
| Citation accuracy | All match | Most match | Some match | Wrong |
| Tool efficiency | Optimal | Good | Adequate | Wasteful |

## LLM-as-Judge

```python
evaluation_prompt = """
Task: {task_description}
Agent Output: {agent_output}
Ground Truth: {ground_truth}

Evaluate on:
1. Factual accuracy (0-1)
2. Completeness (0-1)
3. Citation accuracy (0-1)
4. Tool efficiency (0-1)

Provide scores and reasoning.
"""
```

## Test Set Design

```python
test_set = [
    {"name": "simple", "complexity": "simple",
     "input": "What is capital of France?"},
    {"name": "medium", "complexity": "medium",
     "input": "Compare Apple and Microsoft revenue"},
    {"name": "complex", "complexity": "complex",
     "input": "Analyze Q1-Q4 sales trends"},
    {"name": "very_complex", "complexity": "very_complex",
     "input": "Research AI tech, evaluate impact, recommend strategy"}
]
```

## Evaluation Pipeline

```python
def evaluate_agent(agent, test_set):
    results = []
    for test in test_set:
        output = agent.run(test["input"])
        scores = llm_judge(output, test)
        results.append({
            "test": test["name"],
            "scores": scores,
            "passed": scores["overall"] >= 0.7
        })
    return results
```

## Complexity Stratification

| Level | Characteristics |
|-------|-----------------|
| Simple | Single tool call |
| Medium | Multiple tool calls |
| Complex | Many calls, ambiguity |
| Very Complex | Extended interaction, deep reasoning |

## Context Engineering Evaluation

Test context strategies systematically:

1. Run agents with different strategies on same tests
2. Compare quality scores, token usage, efficiency
3. Identify degradation cliffs at different context sizes

## Continuous Evaluation

- Run evaluations on all agent changes
- Track metrics over time
- Set alerts for quality drops
- Sample production interactions

## Avoiding Pitfalls

| Pitfall | Solution |
|---------|----------|
| Path overfitting | Evaluate outcomes, not steps |
| Ignoring edge cases | Include diverse scenarios |
| Single metric | Multi-dimensional rubrics |
| Ignoring context | Test realistic context sizes |
| No human review | Supplement automated eval |

## Best Practices

1. Use multi-dimensional rubrics
2. Evaluate outcomes, not specific paths
3. Cover complexity levels
4. Test with realistic context sizes
5. Run evaluations continuously
6. Supplement LLM with human review
7. Track metrics for trends
8. Set clear pass/fail thresholds

Overview

This skill helps teams evaluate AI agents using structured, repeatable methods and rubrics. It focuses on building test sets, running LLM-as-judge evaluations, tracking multi-dimensional quality metrics, and monitoring performance over time. The goal is measurable, defensible agent quality assessment across simple to very complex tasks.

How this skill works

Define a stratified test set covering simple to very complex scenarios, run agent executions, and have an LLM or automated judge score outputs on dimensions like factual accuracy, completeness, citation accuracy, and tool efficiency. Aggregate per-test scores, compute pass/fail against thresholds, and track metrics and token/tool usage to explain performance variance. Run continuously and compare context-engineering strategies or model versions.

When to use it

Validating new agent designs or model upgrades
Building an automated agent test framework or CI job
Diagnosing regressions after code or prompt changes
Comparing context strategies or tool-usage policies
Measuring agent readiness before production rollout

Best practices

Use a multi-dimensional rubric (accuracy, completeness, citations, efficiency)
Stratify tests by complexity: simple, medium, complex, very complex
Evaluate outcomes rather than enforcing exact internal paths
Track tokens and tool calls; they explain most performance variance
Run continuous evaluations and alert on metric drops
Supplement automated judgments with periodic human review

Example use cases

Create a test set that ranges from factual lookups to multi-step research tasks and run nightly evaluations
Use an LLM-as-judge prompt to score agent outputs across four rubric dimensions and store results in a dashboard
Compare two context-engineering strategies by running the same tests and measuring score deltas and token usage
Set pass/fail thresholds (e.g., overall >= 0.7) and gate deployments on evaluation results
Investigate a performance regression by correlating token counts, tool calls, and model versions

FAQ

What explains most agent performance variance?

Token usage accounts for the largest share, followed by tool calls; model choice has smaller but multiplicative effects.

Should I rely only on automated LLM judges?

Automated judges scale evaluation, but periodic human review is important to catch edge cases, citation errors, and subtle quality issues.