home / skills / zpankz / mcp-skillset / retrospective-validation

retrospective-validation skill

This skill validates methodologies using historical data to deliver high-confidence results without live deployment, saving time and money.

npx playbooks add skill zpankz/mcp-skillset --skill retrospective-validation

Review the files below or copy the command above to add this skill to your agents.

Files (5)

SKILL.md

8.0 KB

---
name: Retrospective Validation
description: Validate methodology effectiveness using historical data without live deployment. Use when rich historical data exists (100+ instances), methodology targets observable patterns (error prevention, test strategy, performance optimization), pattern matching is feasible with clear detection rules, and live deployment has high friction (CI/CD integration effort, user study time, deployment risk). Enables 40-60% time reduction vs prospective validation, 60-80% cost reduction. Confidence calculation model provides statistical rigor. Validated in error recovery (1,336 errors, 23.7% prevention, 0.79 confidence).
allowed-tools: Read, Grep, Glob, Bash
---

# Retrospective Validation

**Validate methodologies with historical data, not live deployment.**

> When you have 1,000 past errors, you don't need to wait for 1,000 future errors to prove your methodology works.

---

## When to Use This Skill

Use this skill when:
- 📊 **Rich historical data**: 100+ instances (errors, test failures, performance issues)
- 🎯 **Observable patterns**: Methodology targets detectable issues
- 🔍 **Pattern matching feasible**: Clear detection heuristics, measurable false positive rate
- ⚡ **High deployment friction**: CI/CD integration costly, user studies time-consuming
- 📈 **Statistical rigor needed**: Want confidence intervals, not just hunches
- ⏰ **Time constrained**: Need validation in hours, not weeks

**Don't use when**:
- ❌ Insufficient data (<50 instances)
- ❌ Emergent effects (human behavior change, UX improvements)
- ❌ Pattern matching unreliable (>20% false positive rate)
- ❌ Low deployment friction (1-2 hour CI/CD integration)

---

## Quick Start (30 minutes)

### Step 1: Check Historical Data (5 min)

```bash
# Example: Error data for meta-cc
meta-cc query-tools --status error | jq '. | length'
# Output: 1336 errors ✅ (>100 threshold)

# Example: Test failures from CI logs
grep "FAILED" ci-logs/*.txt | wc -l
# Output: 427 failures ✅
```

**Threshold**: ≥100 instances for statistical confidence

### Step 2: Define Detection Rule (10 min)

```yaml
Tool: validate-path.sh
Prevents: "File not found" errors
Detection:
  - Error message matches: "no such file or directory"
  - OR "cannot read file"
  - OR "file does not exist"
Confidence: High (90%+) - deterministic check
```

### Step 3: Apply Rule to Historical Data (10 min)

```bash
# Count matches
grep -E "(no such file|cannot read|does not exist)" errors.log | wc -l
# Output: 163 errors (12.2% of total)

# Sample manual validation (30 errors)
# True positives: 28/30 (93.3%)
# Adjusted: 163 * 0.933 = 152 preventable ✅
```

### Step 4: Calculate Confidence (5 min)

```
Confidence = Data Quality × Accuracy × Logical Correctness
           = 0.85 × 0.933 × 1.0
           = 0.79 (High confidence)
```

**Result**: Tool would have prevented 152 errors with 79% confidence.

---

## Four-Phase Process

### Phase 1: Data Collection

**1. Identify Data Sources**

For Claude Code / meta-cc:
```bash
# Error history
meta-cc query-tools --status error

# User pain points
meta-cc query-user-messages --pattern "error|fail|broken"

# Error context
meta-cc query-context --error-signature "..."
```

For other projects:
- Git history (commits, diffs, blame)
- CI/CD logs (test failures, build errors)
- Application logs (runtime errors)
- Issue trackers (bug reports)

**2. Quantify Baseline**

Metrics needed:
- **Volume**: Total instances (e.g., 1,336 errors)
- **Rate**: Frequency (e.g., 5.78% error rate)
- **Distribution**: Category breakdown (e.g., file-not-found: 12.2%)
- **Impact**: Cost (e.g., MTTD: 15 min, MTTR: 30 min)

### Phase 2: Pattern Definition

**1. Create Detection Rules**

For each tool/methodology:
```yaml
what_it_prevents: Error type or failure mode
detection_rule: Pattern matching heuristic
confidence: Estimated accuracy (high/medium/low)
```

**2. Define Success Criteria**

```yaml
prevention: Message matches AND tool would catch it
speedup: Tool faster than manual debugging
reliability: No false positives/negatives in sample
```

### Phase 3: Validation Execution

**1. Apply Rules to Historical Data**

```bash
# Pseudo-code
for instance in historical_data:
  category = classify(instance)
  tool = find_applicable_tool(category)
  if would_have_prevented(tool, instance):
    count_prevented++

prevention_rate = count_prevented / total * 100
```

**2. Sample Manual Validation**

```
Sample size: 30 instances (95% confidence)
For each: "Would tool have prevented this?"
Calculate: True positive rate, False positive rate
Adjust: prevention_claim * true_positive_rate
```

**Example** (Bootstrap-003):
```
Sample: 30/317 claimed prevented
True positives: 28 (93.3%)
Adjusted: 317 * 0.933 = 296 errors
Confidence: High (93%+)
```

**3. Measure Performance**

```bash
# Tool time
time tool.sh < test_input
# Output: 0.05s

# Manual time (estimate from historical data)
# Average debug time: 15 min = 900s

# Speedup: 900 / 0.05 = 18,000x
```

### Phase 4: Confidence Assessment

**Confidence Formula**:

```
Confidence = D × A × L

Where:
D = Data Quality (0.5-1.0)
A = Accuracy (True Positive Rate, 0.5-1.0)
L = Logical Correctness (0.5-1.0)
```

**Data Quality** (D):
- 1.0: Complete, accurate, representative
- 0.8-0.9: Minor gaps or biases
- 0.6-0.7: Significant gaps
- <0.6: Unreliable data

**Accuracy** (A):
- 1.0: 100% true positive rate (verified)
- 0.8-0.95: High (sample validation 80-95%)
- 0.6-0.8: Medium (60-80%)
- <0.6: Low (unreliable pattern matching)

**Logical Correctness** (L):
- 1.0: Deterministic (tool directly addresses root cause)
- 0.8-0.9: High correlation (strong evidence)
- 0.6-0.7: Moderate correlation
- <0.6: Weak or speculative

**Example** (Bootstrap-003):
```
D = 0.85 (Complete error logs, minor gaps in context)
A = 0.933 (93.3% true positive rate from sample)
L = 1.0 (File validation is deterministic)

Confidence = 0.85 × 0.933 × 1.0 = 0.79 (High)
```

**Interpretation**:
- ≥0.75: High confidence (publishable)
- 0.60-0.74: Medium confidence (needs caveats)
- 0.45-0.59: Low confidence (suggestive, not conclusive)
- <0.45: Insufficient confidence (need prospective validation)

---

## Comparison: Retrospective vs Prospective

| Aspect | Retrospective | Prospective |
|--------|--------------|-------------|
| **Time** | Hours-days | Weeks-months |
| **Cost** | Low (queries) | High (deployment) |
| **Risk** | Zero | May introduce issues |
| **Confidence** | 0.60-0.95 | 0.90-1.0 |
| **Data** | Historical | New |
| **Scope** | Full history | Limited window |
| **Bias** | Hindsight | None |

**When to use each**:
- **Retrospective**: Fast validation, high data volume, observable patterns
- **Prospective**: Behavioral effects, UX, emergent properties
- **Hybrid**: Retrospective first, limited prospective for edge cases

---

## Success Criteria

Retrospective validation succeeded when:

1. **Sufficient data**: ≥100 instances analyzed
2. **High confidence**: ≥0.75 overall confidence score
3. **Sample validated**: ≥80% true positive rate
4. **Impact quantified**: Prevention % or speedup measured
5. **Time savings**: 40-60% faster than prospective validation

**Bootstrap-003 Validation**:
- ✅ Data: 1,336 errors analyzed
- ✅ Confidence: 0.79 (high)
- ✅ Sample: 93.3% true positive rate
- ✅ Impact: 23.7% error prevention
- ✅ Time: 3 hours vs 2+ weeks (prospective)

---

## Related Skills

**Parent framework**:
- [methodology-bootstrapping](../methodology-bootstrapping/SKILL.md) - Core OCA cycle

**Complementary acceleration**:
- [rapid-convergence](../rapid-convergence/SKILL.md) - Fast iteration (uses retrospective)
- [baseline-quality-assessment](../baseline-quality-assessment/SKILL.md) - Strong iteration 0

---

## References

**Core guide**:
- [Four-Phase Process](reference/process.md) - Detailed methodology
- [Confidence Calculation](reference/confidence.md) - Statistical rigor
- [Detection Rules](reference/detection-rules.md) - Pattern matching guide

**Examples**:
- [Error Recovery Validation](examples/error-recovery-1336-errors.md) - Bootstrap-003

---

**Status**: ✅ Validated | Bootstrap-003 | 0.79 confidence | 40-60% time reduction

Overview

This skill validates methodology effectiveness using existing historical data instead of deploying changes live. It shows when a rule or tool would have prevented past incidents, provides a quantified prevention rate and computes a confidence score with clear criteria. Use it to get fast, low-risk validation when you have rich logs or CI history.

How this skill works

The skill collects historical instances (errors, test failures, performance records), defines detection rules that map a methodology to observable patterns, and applies those rules across the dataset. It uses sampling to estimate true positive/false positive rates, adjusts prevention claims accordingly, and computes a confidence score as Data Quality × Accuracy × Logical Correctness. Results include prevention percentage, speedup estimates, and a publishable confidence level.

When to use it

You have ≥100 historical instances (errors, failures, incidents)
Methodology targets observable, pattern-matchable issues
CI/CD or user-study deployment is costly or slow
You need statistical rigor (confidence intervals) quickly
You want a low-risk, fast validation alternative to live trials

Best practices

Start by enumerating sources: CI logs, runtime logs, issue trackers, commit history
Define clear, testable detection rules and record expected false positive ranges
Use a 30-instance sample for manual validation to estimate accuracy reliably
Quantify data quality and logical fit before trusting high prevention claims
Report prevention adjusted by true positive rate and include confidence band

Example use cases

Validate a file-validation tool against historical "file not found" errors and estimate preventable count
Measure how many past CI failures a proposed test-flakiness handler would have avoided
Estimate performance optimization impact by detecting past timeouts matching the optimization’s profile
Assess an error-recovery script’s coverage over historical runtime exceptions to prioritize deployment
Run retrospective checks to reduce costly CI/CD integration work by proving value first

FAQ

What minimum data volume is required?

Aim for at least 100 instances; under 50 is usually insufficient for reliable retrospective claims.

When is prospective validation still necessary?

Use prospective validation for behavioral effects, UX changes, human workflow shifts, or when pattern matching is unreliable.

How is the confidence score interpreted?

Confidence = Data Quality × Accuracy × Logical Correctness; ≥0.75 is high (publishable), 0.60–0.74 is medium, below 0.60 needs more validation.