home / skills / fusengine / agents / prompt-testing

This skill helps you test, compare, and optimize prompt performance using structured AB testing and clear metrics.

npx playbooks add skill fusengine/agents --skill prompt-testing

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
4.4 KB
---
name: prompt-testing
description: A/B testing and performance metrics for prompts
allowed-tools: Read, Write, Bash
---

# Prompt Testing

Skill for testing, comparing, and measuring prompt performance.

## Documentation

- [metrics.md](docs/metrics.md) - Performance metrics definition
- [methodology.md](docs/methodology.md) - A/B testing protocol

## Testing Workflow

```text
1. DEFINE
   └── Test objective
   └── Metrics to measure
   └── Success criteria

2. PREPARE
   └── Variants A and B
   └── Test dataset
   └── Baseline (if existing)

3. EXECUTE
   └── Run on dataset
   └── Collect results
   └── Document observations

4. ANALYZE
   └── Calculate metrics
   └── Compare variants
   └── Identify patterns

5. DECIDE
   └── Recommendation
   └── Statistical confidence
   └── Next iterations
```

## Performance Metrics

### Quality

| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Accuracy** | Correct responses | Correct / Total |
| **Compliance** | Format adherence | Compliant / Total |
| **Consistency** | Response stability | 1 - Variance |
| **Relevance** | Meeting the need | Average score (1-5) |

### Efficiency

| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Tokens Input** | Prompt size | Token count |
| **Tokens Output** | Response size | Token count |
| **Latency** | Response time | ms |
| **Cost** | Price per request | Tokens × Price |

### Robustness

| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Edge Cases** | Edge case handling | Passed / Total |
| **Jailbreak Resist** | Bypass resistance | Blocked / Attempts |
| **Error Recovery** | Error recovery | Recovered / Errors |

## Test Format

### Test Dataset

```json
{
  "name": "Test Dataset v1",
  "description": "Dataset for testing prompt XYZ",
  "cases": [
    {
      "id": "case_001",
      "type": "standard",
      "input": "Test input",
      "expected": "Expected output",
      "tags": ["basic", "format"]
    },
    {
      "id": "case_002",
      "type": "edge_case",
      "input": "Edge input",
      "expected": "Expected behavior",
      "tags": ["edge", "error"]
    }
  ]
}
```

### Test Report

```markdown
# A/B Test Report: {{TEST_NAME}}

## Configuration

| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |

## Tested Variants

### Variant A (Baseline)
[Description or link to prompt A]

### Variant B (Challenger)
[Description or link to prompt B]

## Results

### Overall Scores

| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |

### Detail by Case Type

| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |

### Problematic Cases

| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |

## Analysis

### B's Strengths
- [Improvement 1]
- [Improvement 2]

### B's Weaknesses
- [Regression 1]

### Observations
[Qualitative insights]

## Recommendation

**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A

**Confidence**: High / Medium / Low

**Justification**:
[Explanation of recommendation]

## Next Steps
1. [Action 1]
2. [Action 2]
```

## Commands

```bash
# Create a test
/prompt test create --name "Test v1" --dataset tests.json

# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json

# View results
/prompt test results --id test_001

# Compare two tests
/prompt test compare --tests test_001,test_002
```

## Decision Criteria

### When to adopt variant B?

```text
IF:
  - Accuracy B >= Accuracy A
  AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
  AND no regression on edge cases
THEN:
  → Adopt B

ELSE IF:
  - Accuracy improvement > 10%
  AND token regression < 20%
THEN:
  → Consider B (acceptable trade-off)

ELSE:
  → Keep A or iterate
```

## Best Practices

1. **Minimum 20 test cases** for significance
2. **Include edge cases** (15-20% of dataset)
3. **Test multiple runs** for consistency
4. **Document hypotheses** before testing
5. **Version the prompts** being tested

Overview

This skill runs A/B tests and measures prompt performance using defined metrics and a repeatable protocol. It helps teams compare prompt variants, quantify trade-offs (quality, efficiency, robustness), and produce actionable recommendations. The focus is on reproducible results and clear decision criteria.

How this skill works

It defines objectives and success criteria, prepares variants and a labeled test dataset, executes runs to collect responses and telemetry, then computes metrics like accuracy, compliance, tokens, latency, and edge-case handling. Results are aggregated into a structured report that highlights deltas, problematic cases, statistical confidence, and a recommended action. Commands support creating tests, running A/B comparisons, viewing results, and comparing past tests.

When to use it

  • Selecting between two prompt designs before deployment
  • Measuring impact of prompt edits on quality and cost
  • Validating robustness against edge cases and jailbreak attempts
  • Tracking regression across prompt versions
  • Benchmarking prompts across models or configurations

Best practices

  • Define clear test objectives and hypothesis before running tests
  • Use a minimum of 20 diverse cases with 15–20% edge cases
  • Run multiple repeats to measure consistency and variance
  • Version and document each prompt variant and dataset used
  • Report both absolute metrics and per-case breakdowns for transparency

Example use cases

  • Compare a concise instruction prompt vs a detailed walkthrough to see accuracy vs tokens trade-off
  • Evaluate a defensive prompt change for jailbreak resistance on adversarial inputs
  • Measure latency and token cost differences when moving prompts between models
  • Validate that a candidate prompt does not regress on existing edge-case behaviors
  • Produce an A/B test report to justify adopting a new prompt in production

FAQ

What minimum dataset size is recommended?

Use at least 20 test cases; include 15–20% edge cases for meaningful results.

How do you decide to adopt a challenger prompt?

Adopt when accuracy improves or matches while tokens stay within 10% or accuracy gain exceeds 5%, and there is no edge-case regression; stronger gains or acceptable token trade-offs may still justify adoption.