home / skills / jmagly / aiwg / flaky-detect

This skill identifies flaky tests by analyzing CI history and test patterns to improve reliability and CI stability.

npx playbooks add skill jmagly/aiwg --skill flaky-detect

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.8 KB
---
name: flaky-detect
description: Identify flaky tests from CI history and test execution patterns. Use when debugging intermittent test failures, auditing test reliability, or improving CI stability.
version: 1.0.0
---

# Flaky Detect Skill

## Purpose

Identify flaky tests (tests that pass and fail non-deterministically) by analyzing CI history, execution patterns, and test characteristics. Google research shows 4.56% of tests are flaky, costing millions in developer productivity.

## Research Foundation

| Finding | Source | Reference |
|---------|--------|-----------|
| 4.56% flaky rate | Google (2016) | [Flaky Tests at Google](https://testing.googleblog.com/2016/05/flaky-tests-at-google-and-how-we.html) |
| ML Classification | FlaKat (2024) | [arXiv:2403.01003](https://arxiv.org/abs/2403.01003) - 85%+ accuracy |
| LLM Auto-repair | FlakyFix (2023) | [arXiv:2307.00012](https://arxiv.org/html/2307.00012v4) |
| Flaky Taxonomy | Luo et al. (2014) | "An Empirical Analysis of Flaky Tests" |

## When This Skill Applies

- User reports "tests sometimes fail" or "intermittent failures"
- CI has been unstable or unreliable
- User wants to audit test suite reliability
- Pre-release quality assessment
- Debugging non-deterministic behavior

## Trigger Phrases

| Natural Language | Action |
|------------------|--------|
| "Find flaky tests" | Analyze CI history for flaky patterns |
| "Why does CI keep failing?" | Identify flaky tests causing failures |
| "Test suite is unreliable" | Full flaky test audit |
| "This test sometimes passes" | Analyze specific test for flakiness |
| "Audit test reliability" | Comprehensive flaky detection |
| "Quarantine flaky tests" | Identify and isolate flaky tests |

## Flaky Test Taxonomy (Google Research)

| Category | Percentage | Root Causes |
|----------|------------|-------------|
| **Async/Timing** | 45% | Race conditions, insufficient waits, timeouts |
| **Test Order** | 20% | Shared state, execution order dependencies |
| **Environment** | 15% | File system, network, configuration differences |
| **Resource Limits** | 10% | Memory, threads, connection pools |
| **Non-deterministic** | 10% | Random values, timestamps, UUIDs |

## Detection Methods

### 1. CI History Analysis

Parse GitHub Actions / CI logs to find inconsistent results:

```python
def analyze_ci_history(repo, days=30):
    """Analyze CI runs for flaky patterns"""
    runs = get_ci_runs(repo, days)
    test_results = {}

    for run in runs:
        for test in run.tests:
            if test.name not in test_results:
                test_results[test.name] = {"pass": 0, "fail": 0}

            if test.passed:
                test_results[test.name]["pass"] += 1
            else:
                test_results[test.name]["fail"] += 1

    # Identify flaky tests (pass rate between 5% and 95%)
    flaky = []
    for test, results in test_results.items():
        total = results["pass"] + results["fail"]
        if total >= 5:  # Enough data
            pass_rate = results["pass"] / total
            if 0.05 < pass_rate < 0.95:
                flaky.append({
                    "test": test,
                    "pass_rate": pass_rate,
                    "total_runs": total
                })

    return sorted(flaky, key=lambda x: x["pass_rate"])
```

### 2. Code Pattern Analysis

Scan test code for flaky patterns:

```python
FLAKY_PATTERNS = [
    # Timing issues
    (r'setTimeout|sleep|delay', "timing", "Uses explicit delays"),
    (r'Date\.now\(\)|new Date\(\)', "timing", "Uses current time"),

    # Async issues
    (r'\.then\([^)]*\)(?!.*await)', "async", "Promise without await"),
    (r'async.*(?!await)', "async", "Async without await"),

    # Order dependencies
    (r'Math\.random\(\)', "random", "Uses random values"),
    (r'uuid|nanoid', "random", "Uses generated IDs"),

    # Environment
    (r'process\.env', "environment", "Environment-dependent"),
    (r'fs\.(read|write)', "environment", "File system access"),
    (r'fetch\(|axios\.|http\.', "network", "Network calls"),
]

def scan_for_flaky_patterns(test_file):
    """Scan test file for flaky patterns"""
    content = read_file(test_file)
    matches = []

    for pattern, category, description in FLAKY_PATTERNS:
        if re.search(pattern, content):
            matches.append({
                "category": category,
                "description": description,
                "pattern": pattern
            })

    return matches
```

### 3. Re-run Analysis

Run tests multiple times to detect flakiness:

```bash
# Run tests 10 times, track results
for i in {1..10}; do
  npm test -- --reporter=json >> test-results.jsonl
done

# Analyze for inconsistency
python analyze_reruns.py test-results.jsonl
```

## Output Format

```markdown
## Flaky Test Report

**Analysis Period**: Last 30 days
**Total Tests**: 450
**Flaky Tests Found**: 12 (2.7%)

### Critical Flaky Tests (< 50% pass rate)

#### 1. `test/api/login.test.ts:45`
**Pass Rate**: 42% (21/50 runs)
**Category**: Timing
**Pattern**: Uses `Date.now()` for token expiry

```typescript
// Flaky code
it('should expire token after 1 hour', () => {
  const token = createToken();
  const expiry = Date.now() + 3600000;  // Flaky!
  expect(token.expiresAt).toBe(expiry);
});
```

**Root Cause**: Test creates token and checks expiry in same millisecond sometimes, different millisecond other times.

**Recommended Fix**: Use mocked time
```typescript
it('should expire token after 1 hour', () => {
  vi.setSystemTime(new Date('2024-01-01T00:00:00Z'));
  const token = createToken();
  expect(token.expiresAt).toBe(new Date('2024-01-01T01:00:00Z').getTime());
  vi.useRealTimers();
});
```

### High Flaky Tests (50-80% pass rate)

#### 2. `test/db/connection.test.ts:23`
**Pass Rate**: 68% (34/50 runs)
**Category**: Resource
**Pattern**: Connection pool exhaustion

[... more tests ...]

### Summary by Category

| Category | Count | Impact |
|----------|-------|--------|
| Timing | 5 | HIGH |
| Async | 3 | HIGH |
| Environment | 2 | MEDIUM |
| Order | 1 | MEDIUM |
| Network | 1 | LOW |

### Recommendations

1. **Quick Win**: Fix 5 timing tests with `vi.setSystemTime()` (+0.5% stability)
2. **Medium Effort**: Add proper async handling (+0.3% stability)
3. **Infrastructure**: Add test isolation for DB tests (+0.2% stability)

### Quarantine Candidates

These tests should be skipped in CI until fixed:

```javascript
// vitest.config.ts
export default {
  test: {
    exclude: [
      'test/api/login.test.ts',       // Timing flaky
      'test/db/connection.test.ts',   // Resource flaky
    ]
  }
}
```

**Note**: Track quarantined tests in `.aiwg/testing/flaky-quarantine.md`
```

## Quarantine Process

### 1. Identify

```bash
# Run flaky detection
python scripts/flaky_detect.py --ci-history 30 --threshold 95
```

### 2. Quarantine

```javascript
// Mark test as flaky
describe.skip('flaky: login expiry', () => {
  // FLAKY: https://github.com/org/repo/issues/123
  // Root cause: timing-dependent
  // Fix in progress: PR #456
});
```

### 3. Track

Create tracking issue:
```markdown
## Flaky Test: test/api/login.test.ts:45

- **Pass Rate**: 42%
- **Category**: Timing
- **Root Cause**: Uses real system time
- **Quarantined**: 2024-12-12
- **Fix PR**: #456
- **Target Unquarantine**: 2024-12-15
```

### 4. Fix and Unquarantine

After fix:
```bash
# Verify fix with multiple runs
for i in {1..20}; do npm test -- test/api/login.test.ts; done

# Remove from quarantine if all pass
```

## Integration Points

- Works with `flaky-fix` skill for automated repairs
- Reports to CI dashboard
- Feeds into `/flow-gate-check` for release decisions
- Tracks in `.aiwg/testing/flaky-registry.md`

## Script Reference

### flaky_detect.py
Analyze CI history for flaky tests:
```bash
python scripts/flaky_detect.py --repo owner/repo --days 30
```

### flaky_scanner.py
Scan code for flaky patterns:
```bash
python scripts/flaky_scanner.py --target test/
```

Overview

This skill identifies flaky tests by analyzing CI history, test reruns, and test-code patterns to surface intermittent failures that reduce CI reliability. It produces prioritized reports with pass rates, root-cause categories, and concrete remediation or quarantine actions. Use it to improve release stability, speed up debugging, and track flaky-test fixes over time.

How this skill works

The skill parses CI run history (GitHub Actions, other CI logs) to compute pass/fail rates per test and flags tests with inconsistent outcomes. It scans test source code for common flaky patterns (timing, async misuse, random values, environment or network access). It also supports automated re-run campaigns to verify flakiness, then produces a ranked report with recommendations and quarantine suggestions.

When to use it

  • Intermittent test failures reported by developers or CI
  • CI stability audits before releases or post-incident
  • When triaging failing pipelines to separate flaky tests from real regressions
  • To build a prioritized remediation plan for test reliability
  • When adding test quarantine or tracking to CI workflows

Best practices

  • Require a minimum number of runs (e.g., >=5) before classifying a test as flaky
  • Combine CI-history metrics with code-pattern scans and reruns for higher confidence
  • Prioritize fixes by impact: timing and async issues first, then environment/resource problems
  • Quarantine only when a fix is blocked; track quarantined tests and set unquarantine targets
  • Use mocking for time and network and add isolation for stateful resources

Example use cases

  • Run a 30-day CI history analysis to find top flaky tests and their pass rates
  • Scan the test suite to list files that use Date.now, Math.random, or real network calls
  • Automate reruns of a suspicious test 10–20 times to confirm intermittent behavior
  • Generate a quarantine manifest and CI exclude list for temporary release stabilization
  • Produce a remediation plan mapping each flaky test to recommended fixes and PRs

FAQ

How do you decide a test is flaky?

A test is flagged as flaky when historical runs show inconsistent outcomes (pass rate between configured thresholds) and there is sufficient data to avoid false positives.

Can it fix flaky tests automatically?

The skill provides concrete repair suggestions (e.g., mock time, await promises) and integrates with automated-fix tooling, but human review is recommended for safe fixes.

Which CI systems are supported?

The analysis works with standard CI logs (GitHub Actions, other JSON-like test reports). Custom adapters can be added to parse other CI providers.