home / skills / dyad-sh / dyad / deflake-e2e

deflake-e2e skill

safe

This skill identifies and fixes flaky E2E tests by running them repeatedly and investigating failures to stabilize CI outcomes.

npx playbooks add skill dyad-sh/dyad --skill deflake-e2e

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

3.8 KB

---
name: dyad:deflake-e2e
description: Identify and fix flaky E2E tests by running them repeatedly and investigating failures.
---

# Deflake E2E Tests

Identify and fix flaky E2E tests by running them repeatedly and investigating failures.

## Arguments

- `$ARGUMENTS`: (Optional) Specific E2E test file(s) to deflake (e.g., `main.spec.ts` or `e2e-tests/main.spec.ts`). If not provided, will prompt to deflake the entire test suite.

## Instructions

1. **Check if specific tests are provided:**

   If `$ARGUMENTS` is empty or not provided, ask the user:

   > "No specific tests provided. Do you want to deflake the entire E2E test suite? This can take a very long time as each test will be run 10 times."

   Wait for user confirmation before proceeding. If they decline, ask them to provide specific test files.

2. **Install dependencies:**

   ```
   npm install
   ```

3. **Build the app binary:**

   ```
   npm run build
   ```

   **IMPORTANT:** This step is required before running E2E tests. E2E tests run against the built binary. If you make any changes to application code (anything outside of `e2e-tests/`), you MUST re-run `npm run build` before running E2E tests again, otherwise you'll be testing the old version.

4. **Run tests repeatedly to detect flakiness:**

   For each test file, run it 10 times:

   ```
   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
   ```

   **IMPORTANT:** `PLAYWRIGHT_RETRIES=0` is required to disable automatic retries. Without it, CI environments (where `CI=true`) default to 2 retries, causing flaky tests to pass on retry and be incorrectly skipped as "not flaky."

   Notes:
   - If `$ARGUMENTS` is provided without the `e2e-tests/` prefix, add it
   - If `$ARGUMENTS` is provided without the `.spec.ts` suffix, add it
   - A test is considered **flaky** if it fails at least once out of 10 runs

5. **For each flaky test, investigate with debug logs:**

   Run the failing test with Playwright browser debugging enabled:

   ```
   DEBUG=pw:browser PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts
   ```

   Analyze the debug output to understand:
   - Timing issues (race conditions, elements not ready)
   - Animation/transition interference
   - Network timing variability
   - State leaking between tests
   - Snapshot comparison differences

6. **Fix the flaky test:**

   Common fixes following Playwright best practices:
   - Use `await expect(locator).toBeVisible()` before interacting with elements
   - Use `await page.waitForLoadState('networkidle')` for network-dependent tests
   - Use stable selectors (data-testid, role, text) instead of fragile CSS selectors
   - Add explicit waits for animations: `await page.waitForTimeout(300)` (use sparingly)
   - Use `await expect(locator).toHaveScreenshot()` options like `maxDiffPixelRatio` for visual tests
   - Ensure proper test isolation (clean state before/after tests)

   **IMPORTANT:** Do NOT change any application code. Assume the application code is correct. Only modify test files and snapshot baselines.

7. **Update snapshot baselines if needed:**

   If the flakiness is due to legitimate visual differences:

   ```
   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --update-snapshots
   ```

8. **Verify the fix:**

   Re-run the test 10 times to confirm it's no longer flaky:

   ```
   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<testfile>.spec.ts --repeat-each=10
   ```

   The test should pass all 10 runs consistently.

9. **Summarize results:**

   Report to the user:
   - Which tests were identified as flaky
   - What was causing the flakiness
   - What fixes were applied
   - Verification results (all 10 runs passing)
   - Any tests that could not be fixed and need further investigation

Overview

This skill identifies and fixes flaky end-to-end (E2E) tests by running them repeatedly, inspecting failures, and applying Playwright best practices to stabilize tests. It guides you through running tests 10 times, diagnosing intermittent failures, applying test-side fixes, updating snapshots when appropriate, and verifying the fixes.

How this skill works

The skill runs specified E2E test files (or prompts to run the entire suite) 10 times with automatic retries disabled to expose intermittent failures. For any test that fails at least once, it enables browser debug logs to surface timing, animation, network, or state-leak issues, then recommends and applies test-level fixes and snapshot updates. Finally, it re-runs the tests to confirm stability.

When to use it

You see intermittent E2E failures in CI or locally that pass on retry.
You want to validate whether a failing test is genuinely broken or flaky.
You need a reproducible process to stabilize visual or timing-sensitive tests.
You maintain a test suite where tests share global state or cause flakiness.
You need to update failing snapshot baselines after intentional UI changes.

Best practices

Always run npm install and npm run build before E2E runs; E2E tests exercise the built binary.
Disable Playwright automatic retries (PLAYWRIGHT_RETRIES=0) so flakiness is visible.
Run each candidate test 10 times (--repeat-each=10) to classify flaky tests reliably.
Prefer stable selectors (data-testid, role, text) and explicit expect checks (toBeVisible) before interactions.
Use page.waitForLoadState('networkidle') and minimal explicit waits only when necessary; isolate tests and reset state.
Do not change application code—limit changes to test files and snapshot baselines.

Example use cases

Investigating a UI test that occasionally times out in CI but passes locally.
Stabilizing a visual snapshot test that shows minor pixel diffs across runs.
Finding race conditions where an element is interacted with before it becomes ready.
Confirming a suspected flaky test no longer fails after applying waits or selector fixes.
Documenting which tests were flaky and summarizing fixes for team review.

FAQ

How do you decide a test is flaky?

A test is considered flaky if it fails at least once out of 10 runs with Playwright retries disabled.

When should I update snapshot baselines?

Update snapshots only when visual differences are legitimate changes; run with --update-snapshots and then re-run the 10 repeats to verify stability.