home / skills / sickn33 / antigravity-awesome-skills / agent-evaluation

agent-evaluation skill

/skills/agent-evaluation

This skill helps you evaluate and benchmark LLM agents with behavioral tests, reliability metrics, and production-focused insights to reduce real-world

This is most likely a fork of the agent-evaluation skill from openclaw
npx playbooks add skill sickn33/antigravity-awesome-skills --skill agent-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.0 KB
---
name: agent-evaluation
description: "Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent."
source: vibeship-spawner-skills (Apache 2.0)
---

# Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
production. You've learned that evaluating LLM agents is fundamentally different from
testing traditional software—the same input can produce different outputs, and "correct"
often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression
tests, capability assessments, and reliability metrics. You understand that the goal isn't
100% test pass rate—it

## Capabilities

- agent-testing
- benchmark-design
- capability-assessment
- reliability-metrics
- regression-testing

## Requirements

- testing-fundamentals
- llm-fundamentals

## Patterns

### Statistical Test Evaluation

Run tests multiple times and analyze result distributions

### Behavioral Contract Testing

Define and test agent behavioral invariants

### Adversarial Testing

Actively try to break agent behavior

## Anti-Patterns

### ❌ Single-Run Testing

### ❌ Only Happy Path Tests

### ❌ Output String Matching

## ⚠️ Sharp Edges

| Issue | Severity | Solution |
|-------|----------|----------|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |

## Related Skills

Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`

Overview

This skill helps teams test and benchmark LLM agents with a focus on real-world reliability. It provides behavioral tests, capability assessments, and reliability metrics so you can catch failures that only appear in production. Use it to move beyond single-run benchmarks and measure agent behavior statistically and adversarially.

How this skill works

The skill runs test suites repeatedly to collect result distributions and flaky-test signals. It defines behavioral contracts and invariants to validate agent actions, then applies adversarial and regression tests to expose failure modes. Finally, it aggregates metrics (accuracy, stability, latency, failure modes) and surfaces gaps between benchmark performance and production behavior.

When to use it

  • Validating agents before production deployment
  • Designing benchmarks for new agent capabilities
  • Detecting behavioral regressions after model updates
  • Measuring agent reliability and stability over time
  • Benchmarking multiple agents under realistic conditions

Best practices

  • Run tests multiple times and analyze distributions rather than single-pass outcomes
  • Define clear behavioral contracts and invariants for expected agent behavior
  • Include adversarial and negative tests to reveal edge-case failures
  • Track multi-dimensional metrics to avoid optimizing for a single metric
  • Isolate test data from training and prompts to prevent leakage
  • Prioritize tests that reflect production inputs and user workflows

Example use cases

  • Behavioral regression suite that runs daily and flags increased flakiness
  • Capability assessment matrix comparing agents across reasoning, retrieval, and tool use
  • Adversarial campaign to find prompts that trigger unsafe or incorrect outputs
  • Production monitoring that correlates benchmark scores with real-world error rates
  • Benchmarking multiple agent versions to guide rollouts and canarying

FAQ

How many runs are enough for statistical testing?

Start with 30–100 runs per test to estimate distribution shape and stability; increase for low-frequency failures or high-variance tasks.

How do I prevent tests from being gamed by model tuning?

Use multi-dimensional metrics, rotate and expand test sets, include adversarial/negative examples, and monitor production behavior to detect metric gaming.