home / skills / ghostscientist / skills / experiment-design-checklist

experiment-design-checklist skill

safe

This skill helps you design rigorous experiments by detailing hypotheses, variables, baselines, ablations, confounds, and statistical plans.

npx playbooks add skill ghostscientist/skills --skill experiment-design-checklist

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

6.6 KB

---
name: experiment-design-checklist
description: Generates a rigorous experiment design given a hypothesis. Use when asked to design experiments, plan experiments, create an experimental setup, or figure out how to test a research hypothesis. Covers controls, baselines, ablations, metrics, statistical tests, and compute estimates.
---

# Experiment Design Checklist

Prevent the "I ran experiments for 3 months and they're meaningless" disaster through rigorous upfront design.

## The Core Principle

Before running ANY experiment, you should be able to answer:
1. What specific claim will this experiment support or refute?
2. What would convince a skeptical reviewer?
3. What could go wrong that would invalidate the results?

## Process

### Step 1: State the Hypothesis Precisely

Convert your research question into falsifiable predictions:

**Template:**
```
If [intervention/method], then [measurable outcome], because [mechanism].
```

**Examples:**
- "If we add auxiliary contrastive loss, then downstream task accuracy increases by >2%, because representations become more separable."
- "If we use learned positional encodings, then performance on sequences >4096 tokens improves, because the model can extrapolate beyond training length."

**Null hypothesis:** What does "no effect" look like? This is what you're trying to reject.

### Step 2: Identify Variables

**Independent Variables (what you manipulate):**
| Variable | Levels | Rationale |
|----------|--------|-----------|
| [Var 1] | [Level A, B, C] | [Why these levels] |

**Dependent Variables (what you measure):**
| Metric | How Measured | Why This Metric |
|--------|--------------|-----------------|
| [Metric 1] | [Procedure] | [Justification] |

**Control Variables (what you hold constant):**
| Variable | Fixed Value | Why Fixed |
|----------|-------------|-----------|
| [Var 1] | [Value] | [Prevents confound X] |

### Step 3: Choose Baselines

Every experiment needs comparisons. No result is meaningful in isolation.

**Baseline Hierarchy:**

1. **Random/Trivial Baseline**
   - What does random chance achieve?
   - Sanity check that the task isn't trivial

2. **Simple Baseline**
   - Simplest reasonable approach
   - Often embarrassingly effective

3. **Standard Baseline**
   - Well-known method from literature
   - Apples-to-apples comparison

4. **State-of-the-Art Baseline**
   - Current best published result
   - Only if you're claiming SOTA

5. **Ablated Self**
   - Your method minus key components
   - Shows each component contributes

**For each baseline, document:**
- Source (paper, implementation)
- Hyperparameters used
- Whether you re-ran or used reported numbers
- Any modifications made

### Step 4: Design Ablations

Ablations answer: "Is each component necessary?"

**Ablation Template:**
| Variant | What's Removed/Changed | Expected Effect | If No Effect... |
|---------|----------------------|-----------------|-----------------|
| Full Model | Nothing | Best performance | - |
| w/o Component A | Remove A | Performance drops X% | A isn't helping |
| w/o Component B | Remove B | Performance drops Y% | B isn't helping |
| Component A only | Only A, no B | Shows A's isolated contribution | - |

**Good ablations are:**
- Surgical (one change at a time)
- Interpretable (clear what was changed)
- Informative (result tells you something)

### Step 5: Address Confounds

Things that could explain your results OTHER than your hypothesis:

**Common Confounds:**

| Confound | How to Check | How to Control |
|----------|--------------|----------------|
| Hyperparameter tuning advantage | Same tuning budget for all | Report tuning procedure |
| Compute advantage | Matched FLOPs/params | Report compute used |
| Data leakage | Check train/test overlap | Strict separation |
| Random seed luck | Multiple seeds | Report variance |
| Implementation bugs (baseline) | Verify baseline numbers | Use official implementations |
| Cherry-picked examples | Random or systematic selection | Pre-register selection criteria |

### Step 6: Statistical Rigor

**Sample Size:**
- How many random seeds? (Minimum: 3, better: 5+)
- How many data splits? (If applicable)
- Power analysis: Can you detect expected effect size?

**What to Report:**
- Mean ± standard deviation (or standard error)
- Confidence intervals where appropriate
- Statistical significance tests if claiming "better"

**Appropriate Tests:**
| Comparison | Test | Assumptions |
|------------|------|-------------|
| Two methods, normal data | t-test | Normality, equal variance |
| Two methods, unknown dist | Mann-Whitney U | Ordinal data |
| Multiple methods | ANOVA + post-hoc | Normality |
| Multiple methods, unknown | Kruskal-Wallis | Ordinal data |
| Paired comparisons | Wilcoxon signed-rank | Same test instances |

**Avoid:**
- p-hacking (running until significant)
- Multiple comparison problems (Bonferroni correct)
- Reporting only favorable metrics

### Step 7: Compute Budget

Before running, estimate:

| Component | Estimate | Notes |
|-----------|----------|-------|
| Single training run | X GPU-hours | [Details] |
| Hyperparameter search | Y runs × X hours | [Search strategy] |
| Baselines | Z runs × W hours | [Which baselines] |
| Ablations | N variants × X hours | [Which ablations] |
| Seeds | M seeds × above | [How many seeds] |
| **Total** | **T GPU-hours** | Buffer: 1.5-2x |

**Go/No-Go Decision:** Is this feasible with available resources?

### Step 8: Pre-Registration (Optional but Recommended)

Write down BEFORE running:
- Exact hypotheses
- Primary metrics (not chosen post-hoc)
- Analysis plan
- What would constitute "success"

This prevents unconscious goal-post moving.

## Output: Experiment Design Document

```markdown
# Experiment Design: [Title]

## Hypothesis
[Precise statement]

## Variables
### Independent
[Table]

### Dependent
[Table]

### Controls
[Table]

## Baselines
1. [Baseline 1]: [Source, details]
2. [Baseline 2]: [Source, details]

## Ablations
[Table]

## Confound Mitigation
[Table]

## Statistical Plan
- Seeds: [N]
- Tests: [Which tests for which comparisons]
- Significance threshold: [α level]

## Compute Budget
[Table with total estimate]

## Success Criteria
- Primary: [What must be true]
- Secondary: [Nice to have]

## Timeline
- Phase 1: [What, when]
- Phase 2: [What, when]

## Known Risks
1. [Risk 1]: [Mitigation]
2. [Risk 2]: [Mitigation]
```

## Red Flags in Experiment Design

🚩 "We'll figure out the metrics later"
🚩 "One run should be enough"
🚩 "We don't need baselines, it's obviously better"
🚩 "Let's just see what happens"
🚩 "We can always run more if it's not significant"
🚩 No compute estimate before starting
🚩 Vague success criteria

Overview

This skill generates a rigorous experiment design from a stated hypothesis, producing a complete checklist and reproducible plan. It guides you to specify falsifiable predictions, choose baselines and ablations, control confounds, estimate compute, and define statistical tests and success criteria. Use it to avoid wasted runs and produce publishable, defensible results.

How this skill works

Given a hypothesis, the skill converts it into a precise testable statement and enumerates independent, dependent, and control variables. It proposes a hierarchy of baselines, designs surgical ablations, lists common confounds with mitigation steps, recommends statistical tests and sample sizes, and produces a compute budget and go/no-go criteria. The output is an experiment design document you can pre-register and execute.

When to use it

Designing experiments to test a research hypothesis
Planning ML model comparisons or ablation studies
Estimating compute and timelines before running experiments
Preparing pre-registered analysis for publication or review
Creating reproducible evaluation plans for teams

Best practices

State a falsifiable hypothesis using the template: If [intervention], then [measurable outcome], because [mechanism].
Define independent, dependent, and control variables explicitly; hold confounds constant or document them.
Always include multiple baselines: random, simple, standard, and ablated self; report sources and hyperparameters.
Run multiple seeds and/or splits (minimum 3, preferable 5+) and report mean ± std and confidence intervals.
Pre-specify primary metrics, statistical tests, significance thresholds, and an analysis plan before running experiments.
Estimate total compute (single run × runs × seeds × ablations × buffer 1.5–2×) and decide go/no-go on feasibility.

Example use cases

Compare a novel training loss against standard baselines and ablate each loss component.
Test positional encoding variants for long-context generalization with matched compute budgets.
Plan a hyperparameter search and compute budget for a new model architecture across 5 seeds.
Pre-register a clinical or behavioral experiment with primary metrics and statistical tests.
Produce a reproducible experiment protocol for replication by external reviewers.

FAQ

What sample size or seeds should I use?

Use at least 3 seeds as an absolute minimum; 5+ is better for variance estimates. Perform a power analysis if you expect small effect sizes to determine required runs.

How do I choose appropriate baselines?

Include trivial/random and simple baselines first, then standard literature methods; only compare to SOTA if claiming state-of-the-art and document sources and hyperparameters.

What statistical tests are recommended?

Choose tests based on data and pairing: t-test or ANOVA for normally distributed metrics, Mann-Whitney or Kruskal-Wallis for unknown distributions, and paired tests (Wilcoxon) for matched instances.

How do I prevent compute advantage confounds?

Match FLOPs, parameter counts, or training steps across comparisons, document compute used, and report any unavoidable differences explicitly.