home / skills / sickn33 / antigravity-awesome-skills / agent-evaluation

agent-evaluation skill

safe

This skill helps you evaluate and benchmark LLM agents with behavioral tests, reliability metrics, and production-focused insights to reduce real-world

This is most likely a fork of the agent-evaluation skill from openclaw

npx playbooks add skill sickn33/antigravity-awesome-skills --skill agent-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

2.0 KB

---
name: agent-evaluation
description: "Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent."
source: vibeship-spawner-skills (Apache 2.0)
---

# Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
production. You've learned that evaluating LLM agents is fundamentally different from
testing traditional software—the same input can produce different outputs, and "correct"
often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression
tests, capability assessments, and reliability metrics. You understand that the goal isn't
100% test pass rate—it

## Capabilities

- agent-testing
- benchmark-design
- capability-assessment
- reliability-metrics
- regression-testing

## Requirements

- testing-fundamentals
- llm-fundamentals

## Patterns

### Statistical Test Evaluation

Run tests multiple times and analyze result distributions

### Behavioral Contract Testing

Define and test agent behavioral invariants

### Adversarial Testing

Actively try to break agent behavior

## Anti-Patterns

### ❌ Single-Run Testing

### ❌ Only Happy Path Tests

### ❌ Output String Matching

## ⚠️ Sharp Edges

| Issue | Severity | Solution |
|-------|----------|----------|
| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation |
| Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation |
| Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |
| Test data accidentally used in training or prompts | critical | // Prevent data leakage in agent evaluation |

## Related Skills

Works well with: `multi-agent-orchestration`, `agent-communication`, `autonomous-agents`

Overview

This skill helps teams test and benchmark LLM agents with a focus on real-world reliability. It provides behavioral tests, capability assessments, and reliability metrics so you can catch failures that only appear in production. Use it to move beyond single-run benchmarks and measure agent behavior statistically and adversarially.

How this skill works

The skill runs test suites repeatedly to collect result distributions and flaky-test signals. It defines behavioral contracts and invariants to validate agent actions, then applies adversarial and regression tests to expose failure modes. Finally, it aggregates metrics (accuracy, stability, latency, failure modes) and surfaces gaps between benchmark performance and production behavior.

When to use it

Validating agents before production deployment
Designing benchmarks for new agent capabilities
Detecting behavioral regressions after model updates
Measuring agent reliability and stability over time
Benchmarking multiple agents under realistic conditions

Best practices

Run tests multiple times and analyze distributions rather than single-pass outcomes
Define clear behavioral contracts and invariants for expected agent behavior
Include adversarial and negative tests to reveal edge-case failures
Track multi-dimensional metrics to avoid optimizing for a single metric
Isolate test data from training and prompts to prevent leakage
Prioritize tests that reflect production inputs and user workflows

Example use cases

Behavioral regression suite that runs daily and flags increased flakiness
Capability assessment matrix comparing agents across reasoning, retrieval, and tool use
Adversarial campaign to find prompts that trigger unsafe or incorrect outputs
Production monitoring that correlates benchmark scores with real-world error rates
Benchmarking multiple agent versions to guide rollouts and canarying

FAQ

How many runs are enough for statistical testing?

Start with 30–100 runs per test to estimate distribution shape and stability; increase for low-frequency failures or high-variance tasks.

How do I prevent tests from being gamed by model tuning?

Use multi-dimensional metrics, rotate and expand test sets, include adversarial/negative examples, and monitor production behavior to detect metric gaming.