home / skills / oimiragieo / agent-studio / eval-harness-updater
This skill refreshes evaluation harnesses to maintain live and fallback reliability under instability, with drift detection, timeout handling, and regression
npx playbooks add skill oimiragieo/agent-studio --skill eval-harness-updaterReview the files below or copy the command above to add this skill to your agents.
---
name: eval-harness-updater
description: Refresh evaluation harnesses with live/fallback parser reliability, SLO gates, and regression checks.
version: 1.0.0
model: sonnet
invoked_by: both
user_invocable: true
tools: [Read, Write, Edit, Glob, Grep, Bash, Skill, MemoryRecord, WebSearch, WebFetch]
args: '--harness <path-or-name> [--trigger reflection|evolve|manual]'
error_handling: graceful
streaming: supported
---
# Eval Harness Updater
Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.
## Focus Areas
- Prompt and parser drift
- Timeout/partial-stream handling
- SLO and regression gates
- Dual-run fallback consistency
## Workflow
1. Resolve harness path.
2. Research test/eval best practices.
3. Add RED regressions for parsing and timeout edge cases.
4. Patch minimal harness logic.
5. Validate eval outputs and CI gates.
This skill refreshes evaluation harnesses to keep live and fallback modes reliable in unstable environments. It focuses on parser reliability, SLO gates, regression checks, and consistent dual-run fallbacks. The goal is minimal, targeted patches that restore actionable eval outputs and CI validation.
The updater resolves the target harness path, researches current test and evaluation best practices, and injects targeted regression tests for parsing and timeout edge cases. It patches minimal harness logic to improve timeout/partial-stream handling and adds SLO and regression gates so CI can detect degradations. Finally, it validates outputs and ensures both live and fallback runs remain consistent.
Will this change alter evaluation semantics?
Patches are designed to be minimal and focused on reliability; they should not change core evaluation semantics but may normalize ambiguous outputs for consistency.
How do SLO gates interact with CI?
SLO gates become CI checks that fail the pipeline when latency or correctness metrics fall below defined thresholds, preventing regressions from merging.