home / skills / menkesu / awesome-pm-skills / exp-driven-dev
This skill helps you implement experiment-driven development by guiding hypothesis, metrics, sampling, and feature flags for fast, data-informed decisions.
npx playbooks add skill menkesu/awesome-pm-skills --skill exp-driven-devReview the files below or copy the command above to add this skill to your agents.
---
name: exp-driven-dev
description: Builds features with A/B testing in mind using Ronny Kohavi's frameworks and Netflix/Airbnb experimentation culture. Use when implementing feature flags, choosing metrics, designing experiments, or building for fast iteration. Focuses on guardrail metrics, statistical significance, and experiment-driven development.
---
# Experimentation-Driven Development
## When This Skill Activates
Claude uses this skill when:
- Building new features that affect core metrics
- Implementing A/B testing infrastructure
- Making data-driven decisions
- Setting up feature flags for gradual rollouts
- Choosing which metrics to track
## Core Frameworks
### 1. Experiment Design (Source: Ronny Kohavi, Microsoft/Netflix)
**The HITS Framework:**
**H - Hypothesis:**
> "We believe that [change] will cause [metric] to [increase/decrease] because [reason]"
**I - Implementation:**
- Feature flag setup
- Treatment vs control
- Sample size calculation
**T - Test:**
- Run for statistical significance
- Monitor guardrail metrics
- Watch for unexpected effects
**S - Ship or Stop:**
- Ship if positive
- Stop if negative
- Iterate if inconclusive
**Example:**
```markdown
Hypothesis:
"We believe that adding social proof ('X people bought this')
will increase conversion rate by 10%
because it reduces purchase anxiety."
Implementation:
- Control: No social proof
- Treatment: Show "X people bought"
- Sample size: 10,000 users per variant
- Duration: 2 weeks
Test:
- Primary metric: Conversion rate
- Guardrails: Cart abandonment, return rate
Ship or Stop:
- If conversion +5% or more → Ship
- If conversion -2% or less → Stop
- If inconclusive → Iterate and retest
```
---
### 2. Metric Selection
**Primary Metric:**
- ONE metric you're trying to move
- Directly tied to business value
- Clear success threshold
**Guardrail Metrics:**
- Metrics that shouldn't degrade
- Prevent gaming the system
- Ensure quality maintained
**Example:**
```
Feature: Streamlined checkout
Primary Metric:
✅ Purchase completion rate (+10%)
Guardrail Metrics:
⚠️ Cart abandonment (don't increase)
⚠️ Return rate (don't increase)
⚠️ Support tickets (don't increase)
⚠️ Load time (stay <2s)
```
---
### 3. Statistical Significance
**The Math:**
```
Minimum sample size = (Effect size, Confidence, Power)
Typical settings:
- Confidence: 95% (p < 0.05)
- Power: 80% (detect 80% of real effects)
- Effect size: Minimum detectable change
Example:
- Baseline conversion: 10%
- Minimum detectable effect: +1% (to 11%)
- Required: ~15,000 users per variant
```
**Common Mistakes:**
- ❌ Stopping test early (peeking bias)
- ❌ Running too short (seasonal effects)
- ❌ Too many variants (dilutes sample)
- ❌ Changing test mid-flight
---
### 4. Feature Flag Architecture
**Implementation:**
```javascript
// Feature flag pattern
function checkoutFlow(user) {
if (isFeatureEnabled(user, 'new-checkout')) {
return newCheckoutExperience();
} else {
return oldCheckoutExperience();
}
}
// Gradual rollout
function isFeatureEnabled(user, feature) {
const rolloutPercent = getFeatureRollout(feature);
const userBucket = hashUserId(user.id) % 100;
return userBucket < rolloutPercent;
}
// Experiment assignment
function assignExperiment(user, experiment) {
const variant = consistentHash(user.id, experiment);
track('experiment_assigned', {
userId: user.id,
experiment: experiment,
variant: variant
});
return variant;
}
```
---
## Decision Tree: Should We Experiment?
```
NEW FEATURE
│
├─ Affects core metrics? ──────YES──→ EXPERIMENT REQUIRED
│ NO ↓
│
├─ Risky change? ──────────────YES──→ EXPERIMENT RECOMMENDED
│ NO ↓
│
├─ Uncertain impact? ──────────YES──→ EXPERIMENT USEFUL
│ NO ↓
│
├─ Easy to A/B test? ─────────YES──→ WHY NOT EXPERIMENT?
│ NO ↓
│
└─ SHIP WITHOUT TEST ←────────────────┘
(But still feature flag for rollback)
```
## Action Templates
### Template 1: Experiment Spec
```markdown
# Experiment: [Name]
## Hypothesis
**We believe:** [change]
**Will cause:** [metric] to [increase/decrease]
**Because:** [reasoning]
## Variants
### Control (50%)
[Current experience]
### Treatment (50%)
[New experience]
## Metrics
### Primary Metric
- **What:** [metric name]
- **Current:** [baseline]
- **Target:** [goal]
- **Success:** [threshold]
### Guardrail Metrics
- **Metric 1:** [name] - Don't decrease
- **Metric 2:** [name] - Don't increase
- **Metric 3:** [name] - Maintain
## Sample Size
- **Users needed:** [X per variant]
- **Duration:** [Y days]
- **Confidence:** 95%
- **Power:** 80%
## Implementation
```javascript
if (experiment('feature-name') === 'treatment') {
// New experience
} else {
// Old experience
}
```
## Success Criteria
- [ ] Primary metric improved by [X]%
- [ ] No guardrail degradation
- [ ] Statistical significance reached
- [ ] No unexpected negative effects
## Decision
- **If positive:** Ship to 100%
- **If negative:** Rollback, iterate
- **If inconclusive:** Extend or redesign
```
### Template 2: Feature Flag Implementation
```typescript
// features.ts
export const FEATURES = {
'new-checkout': {
rollout: 10, // 10% of users
enabled: true,
description: 'New streamlined checkout flow'
},
'ai-recommendations': {
rollout: 0, // Not live yet
enabled: false,
description: 'AI-powered product recommendations'
}
};
// feature-flags.ts
export function isEnabled(userId: string, feature: string): boolean {
const config = FEATURES[feature];
if (!config || !config.enabled) return false;
const bucket = consistentHash(userId) % 100;
return bucket < config.rollout;
}
// usage in code
if (isEnabled(user.id, 'new-checkout')) {
return <NewCheckout />;
} else {
return <OldCheckout />;
}
```
### Template 3: Experiment Dashboard
```markdown
# Experiment Dashboard
## Active Experiments
### Experiment 1: [Name]
- **Status:** Running
- **Started:** [date]
- **Progress:** [X]% sample size reached
- **Primary metric:** [current result]
- **Guardrails:** ✅ All healthy
### Experiment 2: [Name]
- **Status:** Complete
- **Result:** Treatment won (+15% conversion)
- **Decision:** Ship to 100%
- **Shipped:** [date]
## Key Metrics
### Experiment Velocity
- **Experiments launched:** [X per month]
- **Win rate:** [Y]%
- **Average duration:** [Z] days
### Impact
- **Revenue impact:** +$[X]
- **Conversion improvement:** +[Y]%
- **User satisfaction:** +[Z] NPS
## Learnings
- [Key insight 1]
- [Key insight 2]
- [Key insight 3]
```
## Quick Reference
### 🧪 Experiment Checklist
**Before Starting:**
- [ ] Hypothesis written (believe → cause → because)
- [ ] Primary metric defined
- [ ] Guardrails identified
- [ ] Sample size calculated
- [ ] Feature flag implemented
- [ ] Tracking instrumented
**During Experiment:**
- [ ] Don't peek early (wait for significance)
- [ ] Monitor guardrails daily
- [ ] Watch for unexpected effects
- [ ] Log any external factors (holidays, outages)
**After Experiment:**
- [ ] Statistical significance reached
- [ ] Guardrails not degraded
- [ ] Decision made (ship/stop/iterate)
- [ ] Learning documented
---
## Real-World Examples
### Example 1: Netflix Experimentation
**Volume:** 250+ experiments running at once
**Approach:** Everything is an experiment
**Culture:** "Strong opinions, weakly held - let data decide"
**Example Test:**
- Hypothesis: Bigger thumbnails increase engagement
- Result: No improvement, actually hurt browse time
- Decision: Rollback
- Learning: Saved $$ by not shipping
---
### Example 2: Airbnb's Experiments
**Test:** New search ranking algorithm
**Primary:** Bookings per search
**Guardrails:**
- Search quality (ratings of bookings)
- Host earnings (don't concentrate bookings)
- Guest satisfaction
**Result:** +3% bookings, all guardrails healthy → Ship
---
### Example 3: Stripe's Feature Flags
**Approach:** Every feature behind flag
**Benefits:**
- Instant rollback (flip flag)
- Gradual rollout (1% → 5% → 25% → 100%)
- Test in production safely
**Example:**
```javascript
if (experiments.isEnabled('instant-payouts')) {
return <InstantPayouts />;
}
```
---
## Common Pitfalls
### ❌ Mistake 1: Peeking Too Early
**Problem:** Stopping test before statistical significance
**Fix:** Calculate sample size upfront, wait for it
### ❌ Mistake 2: No Guardrails
**Problem:** Gaming the metric (increase clicks but hurt quality)
**Fix:** Always define guardrails
### ❌ Mistake 3: Too Many Variants
**Problem:** Not enough users per variant
**Fix:** Limit to 2-3 variants max
### ❌ Mistake 4: Ignoring External Factors
**Problem:** Holiday spike looks like treatment effect
**Fix:** Note external events, extend duration
---
## Related Skills
- **metrics-frameworks** - For choosing right metrics
- **growth-embedded** - For growth experiments
- **ship-decisions** - For when to ship vs test more
- **strategic-build** - For deciding what to test
---
## Key Quotes
**Ronny Kohavi:**
> "The best way to predict the future is to run an experiment."
**Netflix Culture:**
> "Strong opinions, weakly held. Let data be the tie-breaker."
**Airbnb:**
> "We trust our intuition to generate hypotheses, and we trust data to make decisions."
---
## Further Learning
- **references/experiment-design-guide.md** - Complete methodology
- **references/statistical-significance.md** - Sample size calculations
- **references/feature-flags-implementation.md** - Code examples
- **references/guardrail-metrics.md** - Choosing guardrails
This skill builds features with A/B testing and feature-flag-first thinking using Ronny Kohavi’s frameworks and experimentation practices from Netflix and Airbnb. It helps teams design experiments, pick a single primary metric plus guardrails, and implement feature flags and tracking for fast, safe iteration. Use it to make data-driven ship/stop decisions and avoid common statistical and operational mistakes.
The skill guides you through hypothesis framing (HITS), experiment implementation (consistent assignment, sample size, rollout), test execution (monitoring significance and guardrails), and the post-test decision (ship, stop, iterate). It provides templates for experiment specs, feature-flag code patterns, sample-size rules, and a simple dashboard checklist to track progress and outcomes. It enforces best practices like one primary metric, clear success thresholds, and avoiding early peeking.
How big should my sample size be?
Calculate using baseline rate, minimum detectable effect, 95% confidence, and 80% power; typical e-commerce tests often need thousands to tens of thousands per variant.
When is it okay to ship without an experiment?
Ship without an experiment only for trivial UI polish with no metric risk, but still use a feature flag for rollback.