home / skills / oimiragieo / agent-studio / eval-harness-updater

eval-harness-updater skill

/.claude/skills/eval-harness-updater

This skill refreshes evaluation harnesses to maintain live and fallback reliability under instability, with drift detection, timeout handling, and regression

npx playbooks add skill oimiragieo/agent-studio --skill eval-harness-updater

Review the files below or copy the command above to add this skill to your agents.

Files (10)
SKILL.md
881 B
---
name: eval-harness-updater
description: Refresh evaluation harnesses with live/fallback parser reliability, SLO gates, and regression checks.
version: 1.0.0
model: sonnet
invoked_by: both
user_invocable: true
tools: [Read, Write, Edit, Glob, Grep, Bash, Skill, MemoryRecord, WebSearch, WebFetch]
args: '--harness <path-or-name> [--trigger reflection|evolve|manual]'
error_handling: graceful
streaming: supported
---

# Eval Harness Updater

Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.

## Focus Areas

- Prompt and parser drift
- Timeout/partial-stream handling
- SLO and regression gates
- Dual-run fallback consistency

## Workflow

1. Resolve harness path.
2. Research test/eval best practices.
3. Add RED regressions for parsing and timeout edge cases.
4. Patch minimal harness logic.
5. Validate eval outputs and CI gates.

Overview

This skill refreshes evaluation harnesses to keep live and fallback modes reliable in unstable environments. It focuses on parser reliability, SLO gates, regression checks, and consistent dual-run fallbacks. The goal is minimal, targeted patches that restore actionable eval outputs and CI validation.

How this skill works

The updater resolves the target harness path, researches current test and evaluation best practices, and injects targeted regression tests for parsing and timeout edge cases. It patches minimal harness logic to improve timeout/partial-stream handling and adds SLO and regression gates so CI can detect degradations. Finally, it validates outputs and ensures both live and fallback runs remain consistent.

When to use it

  • Eval harnesses fail intermittently due to prompt or parser drift.
  • Timeouts or partial-stream responses cause flaky or missing results.
  • You need SLO-based gating to prevent regressions from reaching production.
  • Dual-run (live + fallback) outputs diverge and require alignment.
  • CI pipelines lack tests for parser edge cases or time-based failures.

Best practices

  • Add focused RED (regression) tests for known parser failures and timeouts.
  • Keep harness patches minimal and reversible; prefer guardrails over sweeping changes.
  • Use SLO gates that reflect user-facing latency and correctness thresholds.
  • Validate both live and fallback runs in CI to surface divergence early.
  • Log partial-stream and timeout events separately to enable targeted fixes.

Example use cases

  • Inject a regression that reproduces a prompt-induced parser error and fail CI until fixed.
  • Wrap stream reads with guarded timeouts and add tests for partial payloads.
  • Create an SLO gate that blocks merges when eval latency exceeds a defined threshold.
  • Patch fallback parsing to mirror live-run normalization so dual-run outputs match.
  • Add a small harness shim that converts ambiguous tokens into a stable parse format.

FAQ

Will this change alter evaluation semantics?

Patches are designed to be minimal and focused on reliability; they should not change core evaluation semantics but may normalize ambiguous outputs for consistency.

How do SLO gates interact with CI?

SLO gates become CI checks that fail the pipeline when latency or correctness metrics fall below defined thresholds, preventing regressions from merging.