home / skills / dyad-sh / dyad / deflake-e2e
This skill identifies and fixes flaky E2E tests by running them repeatedly and investigating failures to stabilize CI outcomes.
npx playbooks add skill dyad-sh/dyad --skill deflake-e2eReview the files below or copy the command above to add this skill to your agents.
---
name: dyad:deflake-e2e
description: Identify and fix flaky E2E tests by running them repeatedly and investigating failures.
---
# Deflake E2E Tests
Identify and fix flaky E2E tests by running them repeatedly and investigating failures.
## Arguments
- `$ARGUMENTS`: (Optional) Specific E2E test file(s) to deflake (e.g., `main.spec.ts` or `e2e-tests/main.spec.ts`). If not provided, will prompt to deflake the entire test suite.
## Instructions
1. **Check if specific tests are provided:**
If `$ARGUMENTS` is empty or not provided, ask the user:
> "No specific tests provided. Do you want to deflake the entire E2E test suite? This can take a very long time as each test will be run 10 times."
Wait for user confirmation before proceeding. If they decline, ask them to provide specific test files.
2. **Install dependencies:**
```
npm install
```
3. **Build the app binary:**
```
npm run build
```
**IMPORTANT:** This step is required before running E2E tests. E2E tests run against the built binary. If you make any changes to application code (anything outside of `e2e-tests/`), you MUST re-run `npm run build` before running E2E tests again, otherwise you'll be testing the old version.
4. **Run tests repeatedly to detect flakiness:**
For each test file, run it 10 times:
```
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
```
**IMPORTANT:** `PLAYWRIGHT_RETRIES=0` is required to disable automatic retries. Without it, CI environments (where `CI=true`) default to 2 retries, causing flaky tests to pass on retry and be incorrectly skipped as "not flaky."
Notes:
- If `$ARGUMENTS` is provided without the `e2e-tests/` prefix, add it
- If `$ARGUMENTS` is provided without the `.spec.ts` suffix, add it
- A test is considered **flaky** if it fails at least once out of 10 runs
5. **For each flaky test, investigate with debug logs:**
Run the failing test with Playwright browser debugging enabled:
```
DEBUG=pw:browser PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts
```
Analyze the debug output to understand:
- Timing issues (race conditions, elements not ready)
- Animation/transition interference
- Network timing variability
- State leaking between tests
- Snapshot comparison differences
6. **Fix the flaky test:**
Common fixes following Playwright best practices:
- Use `await expect(locator).toBeVisible()` before interacting with elements
- Use `await page.waitForLoadState('networkidle')` for network-dependent tests
- Use stable selectors (data-testid, role, text) instead of fragile CSS selectors
- Add explicit waits for animations: `await page.waitForTimeout(300)` (use sparingly)
- Use `await expect(locator).toHaveScreenshot()` options like `maxDiffPixelRatio` for visual tests
- Ensure proper test isolation (clean state before/after tests)
**IMPORTANT:** Do NOT change any application code. Assume the application code is correct. Only modify test files and snapshot baselines.
7. **Update snapshot baselines if needed:**
If the flakiness is due to legitimate visual differences:
```
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --update-snapshots
```
8. **Verify the fix:**
Re-run the test 10 times to confirm it's no longer flaky:
```
PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
```
The test should pass all 10 runs consistently.
9. **Summarize results:**
Report to the user:
- Which tests were identified as flaky
- What was causing the flakiness
- What fixes were applied
- Verification results (all 10 runs passing)
- Any tests that could not be fixed and need further investigation
This skill identifies and fixes flaky end-to-end (E2E) tests by running them repeatedly, inspecting failures, and applying Playwright best practices to stabilize tests. It guides you through running tests 10 times, diagnosing intermittent failures, applying test-side fixes, updating snapshots when appropriate, and verifying the fixes.
The skill runs specified E2E test files (or prompts to run the entire suite) 10 times with automatic retries disabled to expose intermittent failures. For any test that fails at least once, it enables browser debug logs to surface timing, animation, network, or state-leak issues, then recommends and applies test-level fixes and snapshot updates. Finally, it re-runs the tests to confirm stability.
How do you decide a test is flaky?
A test is considered flaky if it fails at least once out of 10 runs with Playwright retries disabled.
When should I update snapshot baselines?
Update snapshots only when visual differences are legitimate changes; run with --update-snapshots and then re-run the 10 repeats to verify stability.