home / skills / supercent-io / skills-template / agent-evaluation

agent-evaluation skill

needs review

This skill helps you design, implement, and maintain comprehensive AI agent evaluation systems across coding, conversational, and research domains.

npx playbooks add skill supercent-io/skills-template --skill agent-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

13.0 KB

---
name: agent-evaluation
description: Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.
allowed-tools: [Read, Write, Shell, Grep, Glob]
tags: [agent-evaluation, evals, AI-agents, benchmarks, graders, testing, quality-assurance]
platforms: [Claude, ChatGPT, Gemini]
---

# Agent Evaluation (AI Agent Evals)

> Based on Anthropic's "Demystifying evals for AI agents"

## When to use this skill

- Designing evaluation systems for AI agents
- Building benchmarks for coding, conversational, or research agents
- Creating graders (code-based, model-based, human)
- Implementing production monitoring for AI systems
- Setting up CI/CD pipelines with automated evals
- Debugging agent performance issues
- Measuring agent improvement over time

## Core Concepts

### Eval Evolution: Single-turn → Multi-turn → Agentic

| Type | Turns | State | Grading | Complexity |
|------|-------|-------|---------|------------|
| **Single-turn** | 1 | None | Simple | Low |
| **Multi-turn** | N | Conversation | Per-turn | Medium |
| **Agentic** | N | World + History | Outcome | High |

### 7 Key Terms

| Term | Definition |
|------|------------|
| **Task** | Single test case (prompt + expected outcome) |
| **Trial** | One agent run on a task |
| **Grader** | Scoring function (code/model/human) |
| **Transcript** | Full record of agent actions |
| **Outcome** | Final state for grading |
| **Harness** | Infrastructure running evals |
| **Suite** | Collection of related tasks |

## Instructions

### Step 1: Understand Grader Types

#### Code-based Graders (Recommended for Coding Agents)
- **Pros**: Fast, objective, reproducible
- **Cons**: Requires clear success criteria
- **Best for**: Coding agents, structured outputs

```python
# Example: Code-based grader
def grade_task(outcome: dict) -> float:
    """Grade coding task by test passage."""
    tests_passed = outcome.get("tests_passed", 0)
    total_tests = outcome.get("total_tests", 1)
    return tests_passed / total_tests

# SWE-bench style grader
def grade_swe_bench(repo_path: str, test_spec: dict) -> bool:
    """Run tests and check if patch resolves issue."""
    result = subprocess.run(
        ["pytest", test_spec["test_file"]],
        cwd=repo_path,
        capture_output=True
    )
    return result.returncode == 0
```

#### Model-based Graders (LLM-as-Judge)
- **Pros**: Flexible, handles nuance
- **Cons**: Requires calibration, can be inconsistent
- **Best for**: Conversational agents, open-ended tasks

```yaml
# Example: LLM Rubric for Customer Support Agent
rubric:
  dimensions:
    - name: empathy
      weight: 0.3
      scale: 1-5
      criteria: |
        5: Acknowledges emotions, uses warm language
        3: Polite but impersonal
        1: Cold or dismissive

    - name: resolution
      weight: 0.5
      scale: 1-5
      criteria: |
        5: Fully resolves issue
        3: Partial resolution
        1: No resolution

    - name: efficiency
      weight: 0.2
      scale: 1-5
      criteria: |
        5: Resolved in minimal turns
        3: Reasonable turns
        1: Excessive back-and-forth
```

#### Human Graders
- **Pros**: Highest accuracy, catches edge cases
- **Cons**: Expensive, slow, not scalable
- **Best for**: Final validation, ambiguous cases

### Step 2: Choose Strategy by Agent Type

#### 2.1 Coding Agents

**Benchmarks**:
- SWE-bench Verified: Real GitHub issues (40% → 80%+ achievable)
- Terminal-Bench: Complex terminal tasks
- Custom test suites with your codebase

**Grading Strategy**:
```python
def grade_coding_agent(task: dict, outcome: dict) -> dict:
    return {
        "tests_passed": run_test_suite(outcome["code"]),
        "lint_score": run_linter(outcome["code"]),
        "builds": check_build(outcome["code"]),
        "matches_spec": compare_to_reference(task["spec"], outcome["code"])
    }
```

**Key Metrics**:
- Test passage rate
- Build success
- Lint/style compliance
- Diff size (smaller is better)

#### 2.2 Conversational Agents

**Benchmarks**:
- τ2-Bench: Multi-domain conversation
- Custom domain-specific suites

**Grading Strategy** (Multi-dimensional):
```yaml
success_criteria:
  - empathy_score: >= 4.0
  - resolution_rate: >= 0.9
  - avg_turns: <= 5
  - escalation_rate: <= 0.1
```

**Key Metrics**:
- Task resolution rate
- Customer satisfaction proxy
- Turn efficiency
- Escalation rate

#### 2.3 Research Agents

**Grading Dimensions**:
1. **Grounding**: Claims backed by sources
2. **Coverage**: All aspects addressed
3. **Source Quality**: Authoritative sources used

```python
def grade_research_agent(task: dict, outcome: dict) -> dict:
    return {
        "grounding": check_citations(outcome["report"]),
        "coverage": check_topic_coverage(task["topics"], outcome["report"]),
        "source_quality": score_sources(outcome["sources"]),
        "factual_accuracy": verify_claims(outcome["claims"])
    }
```

#### 2.4 Computer Use Agents

**Benchmarks**:
- WebArena: Web navigation tasks
- OSWorld: Desktop environment tasks

**Grading Strategy**:
```python
def grade_computer_use(task: dict, outcome: dict) -> dict:
    return {
        "ui_state": verify_ui_state(outcome["screenshot"]),
        "db_state": verify_database(task["expected_db_state"]),
        "file_state": verify_files(task["expected_files"]),
        "success": all_conditions_met(task, outcome)
    }
```

### Step 3: Follow the 8-Step Roadmap

#### Step 0: Start Early (20-50 Tasks)
```bash
# Create initial eval suite structure
mkdir -p evals/{tasks,results,graders}

# Start with representative tasks
# - Common use cases (60%)
# - Edge cases (20%)
# - Failure modes (20%)
```

#### Step 1: Convert Manual Tests
```python
# Transform existing QA tests into eval tasks
def convert_qa_to_eval(qa_case: dict) -> dict:
    return {
        "id": qa_case["id"],
        "prompt": qa_case["input"],
        "expected_outcome": qa_case["expected"],
        "grader": "code" if qa_case["has_tests"] else "model",
        "tags": qa_case.get("tags", [])
    }
```

#### Step 2: Ensure Clarity + Reference Solutions
```yaml
# Good task definition
task:
  id: "api-design-001"
  prompt: |
    Design a REST API for user management with:
    - CRUD operations
    - Authentication via JWT
    - Rate limiting
  reference_solution: "./solutions/api-design-001/"
  success_criteria:
    - "All endpoints documented"
    - "Auth middleware present"
    - "Rate limit config exists"
```

#### Step 3: Balance Positive/Negative Cases
```python
# Ensure eval suite balance
suite_composition = {
    "positive_cases": 0.5,    # Should succeed
    "negative_cases": 0.3,    # Should fail gracefully
    "edge_cases": 0.2         # Boundary conditions
}
```

#### Step 4: Isolate Environments
```yaml
# Docker-based isolation for coding evals
eval_environment:
  type: docker
  image: "eval-sandbox:latest"
  timeout: 300s
  resources:
    memory: "4g"
    cpu: "2"
  network: isolated
  cleanup: always
```

#### Step 5: Focus on Outcomes, Not Paths
```python
# GOOD: Outcome-focused grader
def grade_outcome(expected: dict, actual: dict) -> float:
    return compare_final_states(expected, actual)

# BAD: Path-focused grader (too brittle)
def grade_path(expected_steps: list, actual_steps: list) -> float:
    return step_by_step_match(expected_steps, actual_steps)
```

#### Step 6: Always Read Transcripts
```python
# Transcript analysis for debugging
def analyze_transcript(transcript: list) -> dict:
    return {
        "total_steps": len(transcript),
        "tool_usage": count_tool_calls(transcript),
        "errors": extract_errors(transcript),
        "decision_points": find_decision_points(transcript),
        "recovery_attempts": find_recovery_patterns(transcript)
    }
```

#### Step 7: Monitor Eval Saturation
```python
# Detect when evals are no longer useful
def check_saturation(results: list, window: int = 10) -> dict:
    recent = results[-window:]
    return {
        "pass_rate": sum(r["passed"] for r in recent) / len(recent),
        "variance": calculate_variance(recent),
        "is_saturated": all(r["passed"] for r in recent),
        "recommendation": "Add harder tasks" if saturated else "Continue"
    }
```

#### Step 8: Long-term Maintenance
```yaml
# Eval suite maintenance checklist
maintenance:
  weekly:
    - Review failed evals for false negatives
    - Check for flaky tests
  monthly:
    - Add new edge cases from production issues
    - Retire saturated evals
    - Update reference solutions
  quarterly:
    - Full benchmark recalibration
    - Team contribution review
```

### Step 4: Integrate with Production

#### CI/CD Integration
```yaml
# GitHub Actions example
name: Agent Evals
on: [push, pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run Evals
        run: |
          python run_evals.py --suite=core --mode=compact
      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: eval-results
          path: results/
```

#### Production Monitoring
```python
# Real-time eval sampling
class ProductionMonitor:
    def __init__(self, sample_rate: float = 0.1):
        self.sample_rate = sample_rate

    async def monitor(self, request, response):
        if random.random() < self.sample_rate:
            eval_result = await self.run_eval(request, response)
            self.log_result(eval_result)
            if eval_result["score"] < self.threshold:
                self.alert("Low quality response detected")
```

#### A/B Testing
```python
# Compare agent versions
def run_ab_test(suite: str, versions: list) -> dict:
    results = {}
    for version in versions:
        results[version] = run_eval_suite(suite, agent_version=version)
    return {
        "comparison": compare_results(results),
        "winner": determine_winner(results),
        "confidence": calculate_confidence(results)
    }
```

## Best Practices

### Do's ✅
1. **Start with 20-50 representative tasks**
2. **Use code-based graders when possible**
3. **Focus on outcomes, not paths**
4. **Read transcripts for debugging**
5. **Monitor for eval saturation**
6. **Balance positive/negative cases**
7. **Isolate eval environments**
8. **Version your eval suites**

### Don'ts ❌
1. **Don't over-rely on model-based graders without calibration**
2. **Don't ignore failed evals (false negatives exist)**
3. **Don't grade on intermediate steps**
4. **Don't skip transcript analysis**
5. **Don't use production data without sanitization**
6. **Don't let eval suites become stale**

## Success Patterns

### Pattern 1: Graduated Eval Complexity
```
Level 1: Unit evals (single capability)
Level 2: Integration evals (combined capabilities)
Level 3: End-to-end evals (full workflows)
Level 4: Adversarial evals (edge cases)
```

### Pattern 2: Eval-Driven Development
```
1. Write eval task for new feature
2. Run eval (expect failure)
3. Implement feature
4. Run eval (expect pass)
5. Add to regression suite
```

### Pattern 3: Continuous Calibration
```
Weekly: Review grader accuracy
Monthly: Update rubrics based on feedback
Quarterly: Full grader audit with human baseline
```

## Troubleshooting

### Problem: Eval scores at 100%
**Solution**: Add harder tasks, check for eval saturation (Step 7)

### Problem: Inconsistent model-based grader scores
**Solution**: Add more examples to rubric, use structured output, ensemble graders

### Problem: Evals too slow for CI
**Solution**: Use toon mode, parallelize, sample subset for PR checks

### Problem: Agent passes evals but fails in production
**Solution**: Add production failure cases to eval suite, increase diversity

## References

- [Anthropic: Demystifying evals for AI agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents)
- [SWE-bench](https://www.swebench.com/)
- [WebArena](https://webarena.dev/)
- [τ2-Bench](https://github.com/sierra-research/tau2-bench)

## Examples

### Example 1: Simple Coding Agent Eval

```python
# Task definition
task = {
    "id": "fizzbuzz-001",
    "prompt": "Write a fizzbuzz function in Python",
    "test_cases": [
        {"input": 3, "expected": "Fizz"},
        {"input": 5, "expected": "Buzz"},
        {"input": 15, "expected": "FizzBuzz"},
        {"input": 7, "expected": "7"}
    ]
}

# Grader
def grade(task, outcome):
    code = outcome["code"]
    exec(code)  # In sandbox
    for tc in task["test_cases"]:
        if fizzbuzz(tc["input"]) != tc["expected"]:
            return 0.0
    return 1.0
```

### Example 2: Conversational Agent Eval with LLM Rubric

```yaml
task:
  id: "support-refund-001"
  scenario: |
    Customer wants refund for damaged product.
    Product: Laptop, Order: #12345, Damage: Screen crack
  expected_actions:
    - Acknowledge issue
    - Verify order
    - Offer resolution options
  max_turns: 5

grader:
  type: model
  model: claude-3-5-sonnet-20241022
  rubric: |
    Score 1-5 on each dimension:
    - Empathy: Did agent acknowledge customer frustration?
    - Resolution: Was a clear solution offered?
    - Efficiency: Was issue resolved in reasonable turns?
```

Overview

This skill guides the design and implementation of comprehensive evaluation systems for AI agents. It covers grader types (code-, model-, human-based), benchmarks for different agent classes, an 8-step roadmap for building eval suites, and how to integrate evals into production and CI/CD. Use it to create repeatable, outcome-focused, and maintainable agent evaluations.

How this skill works

The skill explains what to inspect: tasks, trials, transcripts, outcomes, and grader outputs. It prescribes grader selection and rubric design per agent type, test harness and isolation patterns, and metrics to collect (pass rates, variance, resolution rate, etc.). It also describes workflows for running suites in CI, sampling in production, A/B testing, and monitoring eval saturation and maintenance.

When to use it

Building benchmarks for coding, conversational, research, or computer-use agents
Designing graders and rubrics for objective scoring
Implementing CI/CD pipelines with automated eval runs
Setting up production monitoring and sampling for model quality
Debugging agent behavior via transcript analysis
Measuring improvement and detecting eval saturation

Best practices

Start small: 20–50 representative tasks covering common, edge, and failure cases
Prefer code-based graders for structured outputs; use model graders for nuance and humans for final validation
Focus grading on final outcomes rather than intermediate steps
Isolate eval environments (containers, timeouts, resource limits) to avoid flakiness
Regularly read transcripts to identify failure modes and tool usage patterns
Monitor saturation and retire or harden tasks when pass rates are consistently high

Example use cases

Coding agent: run test suites, lint, build checks, and compute test-passage rate for each trial
Conversational agent: apply a multi-dimensional rubric (empathy, resolution, efficiency) with model-based grader and aggregated scores
Research agent: score grounding, coverage, source quality, and factual accuracy with a blend of automated checks and human review
Computer-use agent: verify UI state, file/db state, screenshots and outcome conditions in an isolated sandbox
CI integration: run compact eval subsets on PRs and full suites on nightly builds with artifact upload

FAQ

How do I choose between code-based and model-based graders?

Use code-based graders when you can define clear, testable success criteria (fast, reproducible). Use model-based graders when outputs are open-ended or require nuance; calibrate rubrics and validate with humans. Combine graders when appropriate.

When should I sample production traffic for evals?

Sample continuously but sparsely (e.g., 5–10%) to detect regressions without excessive cost. Sanitize data, anonymize user content, and trigger alerts for scores below thresholds so you can add failing cases to the eval suite.