home / skills / athola / claude-night-market / subagent-testing

subagent-testing skill

safe

/plugins/abstract/skills/subagent-testing

This skill validates other skills with fresh subagent instances using a three-phase TDD approach to prevent priming bias.

npx playbooks add skill athola/claude-night-market --skill subagent-testing

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

3.0 KB

---
name: subagent-testing
description: TDD-style testing methodology for skills using fresh subagent instances
  to prevent priming bias and validate skill effectiveness. Use when validating skill
  improvements, testing skill effectiveness, preventing priming bias, measuring skill
  impact on behavior. Do not use when implementing skills (use skill-authoring instead),
  creating hooks (use hook-authoring instead).
category: testing
tags:
- testing
- validation
- TDD
- subagents
- fresh-instances
token_budget: 30
progressive_loading: true
---

# Subagent Testing - TDD for Skills

Test skills with fresh subagent instances to prevent priming bias and validate effectiveness.

## Table of Contents

1. [Overview](#overview)
2. [Why Fresh Instances Matter](#why-fresh-instances-matter)
3. [Testing Methodology](#testing-methodology)
4. [Quick Start](#quick-start)
5. [Detailed Testing Guide](#detailed-testing-guide)
6. [Success Criteria](#success-criteria)

## Overview

**Fresh instances prevent priming:** Each test uses a new Claude conversation to verify
the skill's impact is measured, not conversation history effects.

## Why Fresh Instances Matter

### The Priming Problem
Running tests in the same conversation creates bias:
- Prior context influences responses
- Skill effects get mixed with conversation history
- Can't isolate skill's true impact

### Fresh Instance Benefits
- **Isolation**: Each test starts clean
- **Reproducibility**: Consistent baseline state
- **Measurement**: Clear before/after comparison
- **Validation**: Proves skill effectiveness, not priming

## Testing Methodology

Three-phase TDD-style approach:

### Phase 1: Baseline Testing (RED)
Test without skill to establish baseline behavior.

### Phase 2: With-Skill Testing (GREEN)
Test with skill loaded to measure improvements.

### Phase 3: Rationalization Testing (REFACTOR)
Test skill's anti-rationalization guardrails.

## Quick Start

```bash
# 1. Create baseline tests (without skill)
# Use 5 diverse scenarios
# Document full responses

# 2. Create with-skill tests (fresh instances)
# Load skill explicitly
# Use identical prompts
# Compare to baseline

# 3. Create rationalization tests
# Test anti-rationalization patterns
# Verify guardrails work
```

## Detailed Testing Guide

For complete testing patterns, examples, and templates:
- **[Testing Patterns](modules/testing-patterns.md)** - Full TDD methodology
- **[Test Examples](modules/testing-patterns.md)** - Baseline, with-skill, rationalization tests
- **[Analysis Templates](modules/testing-patterns.md)** - Scoring and comparison frameworks

## Success Criteria

- **Baseline**: Document 5+ diverse baseline scenarios
- **Improvement**: ≥50% improvement in skill-related metrics
- **Consistency**: Results reproducible across fresh instances
- **Rationalization Defense**: Guardrails prevent ≥80% of rationalization attempts

## See Also

- **skill-authoring**: Creating effective skills
- **bulletproof-skill**: Anti-rationalization patterns
- **test-skill**: Automated skill testing command

Overview

This skill provides a TDD-style testing methodology that runs each test in a fresh subagent instance to avoid priming bias and accurately measure a skill's effect. It focuses on repeatable baseline vs. with-skill comparisons and includes explicit rationalization tests to verify guardrails. Use it to validate improvements, measure impact, and ensure results are reproducible.

How this skill works

Tests run in three phases: baseline (RED) without the skill, with-skill (GREEN) using fresh subagent instances, and rationalization/refactor checks to confirm guardrails. Each scenario uses identical prompts across fresh sessions so differences reflect the skill alone. Results are scored and compared to success criteria like improvement percentage and consistency across runs.

When to use it

Validating that a new or changed skill actually improves behavior versus baseline.
Measuring skill impact on specific metrics (accuracy, safety, brevity, etc.).
Detecting and preventing priming bias from prior conversation context.
Testing anti-rationalization and guardrail effectiveness.
Regression testing skills after iterative changes or refactors.

Best practices

Always create 5+ diverse baseline scenarios to establish a stable baseline.
Spawn a fresh subagent instance for every test case to ensure isolation.
Use identical prompts and scoring rules for baseline and with-skill runs.
Automate comparison scoring and record full responses for auditability.
Include explicit rationalization prompts to validate guardrails and anti-bias logic.

Example use cases

Compare answer accuracy with and without a summarizer skill across 10 prompts.
Validate a safety filter by running adversarial prompts on fresh instances to measure rejection rate.
Measure reduction in verbosity after enabling a brevity skill using identical prompts.
Run regression checks after refactoring a skill to confirm no drop in effectiveness.
Test anti-rationalization logic by prompting for policy-violating workaround attempts.

FAQ

Why use fresh subagent instances for every test?

Fresh instances remove prior conversation context so measured differences reflect the skill, not priming or history.

How many scenarios should I run for a valid baseline?

Document at least five diverse scenarios; more are better for statistical confidence and coverage.