home / skills / dyad-sh / dyad / deflake-e2e-recent-commits

deflake-e2e-recent-commits skill

/.claude/skills/deflake-e2e-recent-commits

This skill automatically deflakes recent E2E tests by aggregating flaky results from main CI and PRs, then proposes fixes and reports.

npx playbooks add skill dyad-sh/dyad --skill deflake-e2e-recent-commits

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
7.3 KB
---
name: dyad:deflake-e2e-recent-commits
description: Automatically gather flaky E2E tests from recent CI runs on the main branch and from recent PRs by wwwillchen/wwwillchen-bot, then deflake them.
---

# Deflake E2E Tests from Recent Commits

Automatically gather flaky E2E tests from recent CI runs on the main branch and from recent PRs by wwwillchen/wwwillchen-bot, then deflake them.

## Arguments

- `$ARGUMENTS`: (Optional) Number of recent commits to scan (default: 10)

## Task Tracking

**You MUST use the TodoWrite tool to track your progress.** At the start, create todos for each major step below. Mark each todo as `in_progress` when you start it and `completed` when you finish.

## Instructions

1. **Gather flaky tests from recent CI runs on main:**

   List recent CI workflow runs triggered by pushes to main:

   ```
   gh api "repos/{owner}/{repo}/actions/workflows/ci.yml/runs?branch=main&event=push&per_page=<COMMIT_COUNT * 3>&status=completed" --jq '.workflow_runs[] | select(.conclusion == "success" or .conclusion == "failure") | {id, head_sha, conclusion}'
   ```

   **Note:** We fetch 3x the desired commit count because many runs may be `cancelled` (due to concurrency groups). Filter to only `success` and `failure` conclusions to get runs that actually completed and have artifacts.

   Use `$ARGUMENTS` as the commit count, defaulting to 10 if not provided.

   For each completed run, download the `html-report` artifact which contains `results.json` with the full Playwright test results:

   a. Find the html-report artifact for the run:

   ```
   gh api "repos/{owner}/{repo}/actions/runs/<run_id>/artifacts?per_page=30" --jq '.artifacts[] | select(.name | startswith("html-report")) | select(.expired == false) | .name'
   ```

   b. Download it using `gh run download`:

   ```
   gh run download <run_id> --name <artifact_name> --dir /tmp/playwright-report-<run_id>
   ```

   c. Parse `/tmp/playwright-report-<run_id>/results.json` to extract flaky tests. Write a Node.js script inside the `.claude/` directory to do this parsing. Flaky tests are those where the final result status is `"passed"` but a prior result has status `"failed"`, `"timedOut"`, or `"interrupted"`. The test title is built by joining parent suite titles (including the spec file path) and the test title, separated by `>`.

   d. Clean up the downloaded artifact directory after parsing.

   **Note:** Some runs may not have an html-report artifact (e.g., if they were cancelled early, the merge-reports job didn't complete, or artifacts have expired past the 3-day retention period). Skip these runs and continue to the next one.

2. **Gather flaky tests from recent PRs by wwwillchen and wwwillchen-bot:**

   In addition to main branch CI runs, scan recent open PRs authored by `wwwillchen` or `wwwillchen-bot` for flaky tests reported in Playwright report comments.

   a. List recent open PRs by these authors:

   ```
   gh pr list --author wwwillchen --state open --limit 10 --json number,title
   gh pr list --author wwwillchen-bot --state open --limit 10 --json number,title
   ```

   b. For each PR, find the most recent Playwright Test Results comment (posted by a bot, containing "🎭 Playwright Test Results"):

   ```
   gh api "repos/{owner}/{repo}/issues/<pr_number>/comments" --jq '[.[] | select(.user.type == "Bot" and (.body | contains("Playwright Test Results")))] | last'
   ```

   c. Parse the comment body to extract flaky tests. The comment format includes a "⚠️ Flaky Tests" section with test names in backticks:
   - Look for lines matching the pattern: ``- `<test_title>` (passed after N retries)``
   - Extract the test title from within the backticks
   - The test title format is: `<spec_file.spec.ts> > <Suite Name> > <Test Name>`

   d. Add these flaky tests to the overall collection, noting they came from PR #N for the summary

3. **Deduplicate and rank by frequency:**

   Count how many times each test appears as flaky across all CI runs. Sort by frequency (most flaky first). Group tests by their spec file.

   Print a summary table:

   ```
   Flaky test summary:
   - setup_flow.spec.ts > Setup Flow > setup banner shows correct state... (7 occurrences)
   - select_component.spec.ts > select component next.js (5 occurrences)
   ...
   ```

4. **Skip if no flaky tests found:**

   If no flaky tests are found, report "No flaky tests found in recent commits or PRs" and stop.

5. **Install dependencies and build:**

   ```
   npm install
   npm run build
   ```

   **IMPORTANT:** This build step is required before running E2E tests. If you make any changes to application code (anything outside of `e2e-tests/`), you MUST re-run `npm run build`.

6. **Deflake each flaky test spec file (sequentially):**

   For each unique spec file that has flaky tests (ordered by total flaky occurrences, most flaky first):

   a. Run the spec file 10 times to confirm flakiness (note: `<spec_file>` already includes the `.spec.ts` extension from parsing):

   ```
   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --repeat-each=10
   ```

   **IMPORTANT:** `PLAYWRIGHT_RETRIES=0` is required to disable automatic retries. Without it, CI environments (where `CI=true`) default to 2 retries, causing flaky tests to pass on retry and be incorrectly skipped.

   b. If the test passes all 10 runs, skip it (it may have been fixed already).

   c. If the test fails at least once, investigate with debug logs:

   ```
   DEBUG=pw:browser PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file>
   ```

   d. Fix the flaky test following Playwright best practices:
   - Use `await expect(locator).toBeVisible()` before interacting with elements
   - Use `await page.waitForLoadState('networkidle')` for network-dependent tests
   - Use stable selectors (data-testid, role, text) instead of fragile CSS selectors
   - Add explicit waits for animations: `await page.waitForTimeout(300)` (use sparingly)
   - Use `await expect(locator).toHaveScreenshot()` options like `maxDiffPixelRatio` for visual tests
   - Ensure proper test isolation (clean state before/after tests)

   **IMPORTANT:** Do NOT change any application code. Only modify test files and snapshot baselines.

   e. Update snapshot baselines if needed:

   ```
   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --update-snapshots
   ```

   f. Verify the fix by running 10 times again:

   ```
   PLAYWRIGHT_RETRIES=0 PLAYWRIGHT_HTML_OPEN=never npm run e2e -- e2e-tests/<spec_file> --repeat-each=10
   ```

   g. If the test still fails after your fix attempt, revert any changes to that spec file and move on to the next one. Do not spend more than 2 attempts fixing a single spec file.

7. **Summarize results:**

   Report:
   - Total flaky tests found across main branch commits and PRs
   - Sources of flaky tests (main branch CI runs vs. PR comments from wwwillchen/wwwillchen-bot)
   - Which tests were successfully deflaked
   - What fixes were applied to each
   - Which tests could not be fixed (and why)
   - Verification results

8. **Create PR with fixes:**

   If any fixes were made, run `/dyad:pr-push` to commit, lint, test, and push the changes as a PR.

   Use a branch name like `deflake-e2e-<date>` (e.g., `deflake-e2e-2025-01-15`).

   The PR title should be: `fix: deflake E2E tests (<list of spec files>)`

Overview

This skill automatically finds flaky end-to-end tests from recent CI runs on the main branch and from recent PRs authored by wwwillchen/wwwillchen-bot, then attempts to deflake them. It collects Playwright results artifacts and bot comments, ranks flaky tests by frequency, and runs a controlled deflaking workflow. The goal is targeted, repeatable fixes to tests without changing application code.

How this skill works

It scans recent completed CI workflow runs on main (configurable commit count) and downloads Playwright html-report artifacts to extract flaky tests from results.json. It also scans recent open PR comments posted by the Playwright bot to extract reported flaky tests. Tests are deduplicated and ranked by occurrence, then each spec file is run repeatedly to confirm flakiness before applying test-only fixes and verifying them.

When to use it

  • After a series of CI failures where tests intermittently fail on main
  • When Playwright bot comments in PRs report flaky tests
  • As part of a regular maintenance pass to reduce CI flakiness
  • Before a release to improve E2E stability and reduce noise
  • When you need a prioritized list of highest-impact flaky tests

Best practices

  • Always use the TodoWrite tool to track progress and mark steps in_progress/completed
  • Use PLAYWRIGHT_RETRIES=0 when reproducing flakiness to avoid masking issues
  • Do not change application code; only modify tests and snapshot baselines
  • Run npm install and npm run build before running E2E tests and after any app-code changes
  • Limit fix attempts to two per spec file; revert and move on if unresolved

Example use cases

  • Scan the last 10 commits on main, gather flaky tests, and create PRs with test fixes
  • Investigate flaky tests reported in recent PR comments by wwwillchen-bot and deflake them
  • Run targeted deflake workflow for the top 5 most frequently flaky spec files
  • Automate a weekly deflake sweep to keep CI noise low
  • Confirm suspected flaky tests by running a spec 10 times with retries disabled

FAQ

How many commits does it scan by default?

It scans 10 recent commits by default, but the commit count is configurable via an argument.

What counts as a flaky test?

A test whose final run status is passed but has at least one prior run with failed, timedOut, or interrupted in the Playwright results.

Can this change application code to fix flakiness?

No. The workflow only edits test files and snapshot baselines; changing application code is explicitly disallowed.