home / skills / fusengine / agents / prompt-testing
This skill helps you test, compare, and optimize prompt performance using structured AB testing and clear metrics.
npx playbooks add skill fusengine/agents --skill prompt-testingReview the files below or copy the command above to add this skill to your agents.
---
name: prompt-testing
description: A/B testing and performance metrics for prompts
allowed-tools: Read, Write, Bash
---
# Prompt Testing
Skill for testing, comparing, and measuring prompt performance.
## Documentation
- [metrics.md](docs/metrics.md) - Performance metrics definition
- [methodology.md](docs/methodology.md) - A/B testing protocol
## Testing Workflow
```text
1. DEFINE
└── Test objective
└── Metrics to measure
└── Success criteria
2. PREPARE
└── Variants A and B
└── Test dataset
└── Baseline (if existing)
3. EXECUTE
└── Run on dataset
└── Collect results
└── Document observations
4. ANALYZE
└── Calculate metrics
└── Compare variants
└── Identify patterns
5. DECIDE
└── Recommendation
└── Statistical confidence
└── Next iterations
```
## Performance Metrics
### Quality
| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Accuracy** | Correct responses | Correct / Total |
| **Compliance** | Format adherence | Compliant / Total |
| **Consistency** | Response stability | 1 - Variance |
| **Relevance** | Meeting the need | Average score (1-5) |
### Efficiency
| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Tokens Input** | Prompt size | Token count |
| **Tokens Output** | Response size | Token count |
| **Latency** | Response time | ms |
| **Cost** | Price per request | Tokens × Price |
### Robustness
| Metric | Description | Calculation |
|--------|-------------|-------------|
| **Edge Cases** | Edge case handling | Passed / Total |
| **Jailbreak Resist** | Bypass resistance | Blocked / Attempts |
| **Error Recovery** | Error recovery | Recovered / Errors |
## Test Format
### Test Dataset
```json
{
"name": "Test Dataset v1",
"description": "Dataset for testing prompt XYZ",
"cases": [
{
"id": "case_001",
"type": "standard",
"input": "Test input",
"expected": "Expected output",
"tags": ["basic", "format"]
},
{
"id": "case_002",
"type": "edge_case",
"input": "Edge input",
"expected": "Expected behavior",
"tags": ["edge", "error"]
}
]
}
```
### Test Report
```markdown
# A/B Test Report: {{TEST_NAME}}
## Configuration
| Parameter | Value |
|-----------|-------|
| Date | {{DATE}} |
| Dataset | {{DATASET}} |
| Cases tested | {{N_CASES}} |
| Model | {{MODEL}} |
## Tested Variants
### Variant A (Baseline)
[Description or link to prompt A]
### Variant B (Challenger)
[Description or link to prompt B]
## Results
### Overall Scores
| Metric | A | B | Delta | Winner |
|--------|---|---|-------|--------|
| Accuracy | X% | Y% | +/-Z% | A/B |
| Compliance | X% | Y% | +/-Z% | A/B |
| Tokens | X | Y | +/-Z | A/B |
| Latency | Xms | Yms | +/-Zms | A/B |
### Detail by Case Type
| Type | A | B | Notes |
|------|---|---|-------|
| Standard | X% | Y% | |
| Edge cases | X% | Y% | |
| Error cases | X% | Y% | |
### Problematic Cases
| Case ID | Expected | A | B | Analysis |
|---------|----------|---|---|----------|
| case_XXX | ... | ❌ | ✅ | [Explanation] |
## Analysis
### B's Strengths
- [Improvement 1]
- [Improvement 2]
### B's Weaknesses
- [Regression 1]
### Observations
[Qualitative insights]
## Recommendation
**Verdict**: ✅ Adopt B / ⚠️ Iterate / ❌ Keep A
**Confidence**: High / Medium / Low
**Justification**:
[Explanation of recommendation]
## Next Steps
1. [Action 1]
2. [Action 2]
```
## Commands
```bash
# Create a test
/prompt test create --name "Test v1" --dataset tests.json
# Run an A/B test
/prompt test run --a prompt_a.md --b prompt_b.md --dataset tests.json
# View results
/prompt test results --id test_001
# Compare two tests
/prompt test compare --tests test_001,test_002
```
## Decision Criteria
### When to adopt variant B?
```text
IF:
- Accuracy B >= Accuracy A
AND (Tokens B <= Tokens A * 1.1 OR accuracy improvement > 5%)
AND no regression on edge cases
THEN:
→ Adopt B
ELSE IF:
- Accuracy improvement > 10%
AND token regression < 20%
THEN:
→ Consider B (acceptable trade-off)
ELSE:
→ Keep A or iterate
```
## Best Practices
1. **Minimum 20 test cases** for significance
2. **Include edge cases** (15-20% of dataset)
3. **Test multiple runs** for consistency
4. **Document hypotheses** before testing
5. **Version the prompts** being tested
This skill runs A/B tests and measures prompt performance using defined metrics and a repeatable protocol. It helps teams compare prompt variants, quantify trade-offs (quality, efficiency, robustness), and produce actionable recommendations. The focus is on reproducible results and clear decision criteria.
It defines objectives and success criteria, prepares variants and a labeled test dataset, executes runs to collect responses and telemetry, then computes metrics like accuracy, compliance, tokens, latency, and edge-case handling. Results are aggregated into a structured report that highlights deltas, problematic cases, statistical confidence, and a recommended action. Commands support creating tests, running A/B comparisons, viewing results, and comparing past tests.
What minimum dataset size is recommended?
Use at least 20 test cases; include 15–20% edge cases for meaningful results.
How do you decide to adopt a challenger prompt?
Adopt when accuracy improves or matches while tokens stay within 10% or accuracy gain exceeds 5%, and there is no edge-case regression; stronger gains or acceptable token trade-offs may still justify adoption.