home / skills / omer-metin / skills-for-antigravity / agent-evaluation

agent-evaluation skill

/skills/agent-evaluation

This skill helps you rigorously evaluate LLM agents using behavioral contracts, adversarial testing, and production monitoring to predict real-world

npx playbooks add skill omer-metin/skills-for-antigravity --skill agent-evaluation

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
2.1 KB
---
name: agent-evaluation
description: Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarksUse when "agent testing, agent evaluation, benchmark agents, agent reliability, test agent, testing, evaluation, benchmark, agents, reliability, quality" mentioned. 
---

# Agent Evaluation

## Identity

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in
production. You've learned that evaluating LLM agents is fundamentally different from
testing traditional software—the same input can produce different outputs, and "correct"
often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression
tests, capability assessments, and reliability metrics. You understand that the goal isn't
100% test pass rate—it's understanding agent behavior well enough to trust deployment.

Your core principles:
1. Statistical evaluation—run tests multiple times, analyze distributions
2. Behavioral contracts—define what agents should and shouldn't do
3. Adversarial testing—actively try to break agents
4. Production monitoring—evaluation doesn't end at deployment
5. Regression prevention—catch capability degradation early


## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill provides a structured framework for testing and benchmarking LLM agents across behavioral, capability, and reliability dimensions. It focuses on statistical evaluation, adversarial stress tests, and continuous production monitoring so teams can understand real-world agent performance. The goal is not to chase perfect scores but to quantify risk and prevent regressions before and after deployment.

How this skill works

The skill runs repeated test suites to collect output distributions, applies behavioral contracts to flag unacceptable actions, and uses targeted adversarial scenarios to expose weaknesses. It computes reliability metrics (e.g., failure rate, variance, time-to-fail) and integrates lightweight production monitors to surface drift and regressions. Results are summarized into actionable reports and prioritized remediation steps.

When to use it

  • Before deploying an agent that will interact with users or take actions
  • When validating agent improvements to ensure no capability regressions
  • To benchmark competing agent designs or model versions objectively
  • During incident response to reproduce and explain unexpected behaviors
  • When establishing SLAs or risk tolerances for production agents

Best practices

  • Run tests statistically: execute scenarios multiple times and analyze distributions, not single runs
  • Define explicit behavioral contracts (allowed/forbidden behaviors) and test against them
  • Include adversarial and edge-case scenarios to surface brittle behavior
  • Monitor post-deployment with simple metrics and lightweight logs to detect drift early
  • Automate regression checks into CI to catch capability loss before release

Example use cases

  • Benchmark two agent architectures on task success rate and hallucination frequency
  • Create a behavioral regression suite to ensure safety checks remain effective after updates
  • Run adversarial campaigns to measure how often the agent violates policy constraints
  • Set up production monitors that alert when response variance or error rate increases
  • Score releases on reliability metrics so product teams can prioritize fixes

FAQ

Why run tests multiple times instead of a single run?

Agents are nondeterministic; repeated runs reveal variability and rare failure modes that single-run tests miss.

Can this replace user testing?

No. Automated evaluation quantifies risk and catches many issues early, but real user testing is still necessary for usability and context-specific failures.