home / skills / dyad-sh / dyad / check-workflows

check-workflows skill

/.claude/skills/check-workflows

This skill analyzes the past day's GitHub Actions workflow runs to detect actionable failures and open issues for remediation.

npx playbooks add skill dyad-sh/dyad --skill check-workflows

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.8 KB
---
name: dyad:check-workflows
description: Check GitHub Actions workflow runs from the past day, identify severe or consistent failures, and file an issue if actionable problems are found.
---

# Check Workflows

Check GitHub Actions workflow runs from the past day for severe or consistent failures and file a GitHub issue if actionable problems are found.

## Arguments

- `$ARGUMENTS`: (Optional) Number of hours to look back (default: 24)

## Instructions

### 1. Gather recent workflow runs

Fetch all workflow runs from the past N hours (default 24):

```
gh run list --limit 100 --json workflowName,status,conclusion,event,headBranch,createdAt,databaseId,url,name
```

Filter to only runs created within the lookback window. Group runs by workflow name.

### 2. Classify each failure

For each failed run, determine if it is **expected** or **actionable** by checking these rules:

#### Expected failures (IGNORE these):

1. **Nightly Runner Cleanup**: This workflow intentionally reboots self-hosted macOS runners, which kills the runner process mid-job. It will almost always show as "failed" even when working correctly. **Always skip this workflow entirely.**

2. **Cascading failures from CI**: When the main CI workflow fails, these downstream workflows will also fail because they depend on CI artifacts (e.g. `html-report`, blob reports). This is noise, not an independent problem:
   - Playwright Report Comment (fails with "artifact not found")
   - Upload to Flakiness.io (fails when no flakiness reports exist)
   - Merge PR when ready (skipped/fails when CI hasn't passed)

3. **CLA Assistant**: Failures just mean a contributor hasn't signed the CLA yet. This resolves on its own.

4. **Cancelled runs**: Runs cancelled due to concurrency groups (newer push cancels older run) are normal.

5. **`action_required` / `neutral` conclusions**: Standard GitHub behavior for fork PRs or first-time contributors needing manual approval.

6. **CI failures on non-main branches**: Individual PR CI failures are expected — contributors may have formatting issues, lockfile mismatches, test failures, etc. These are the contributor's responsibility.

7. **Claude Deflake E2E**: This workflow is expected to sometimes have long runs or partial failures as it investigates flaky tests.

#### Actionable failures (FLAG these):

1. **Permission errors**: Workflow can't access secrets, missing `GITHUB_TOKEN`, 403/401 errors on API calls that should be authenticated, `Resource not accessible by integration` errors.

2. **Consistent CI failures on main branch**: If the CI workflow fails on 2+ consecutive pushes to main, something is likely broken. Check if different commits are failing for the same reason.

3. **Infrastructure failures**: Self-hosted runners not coming back online (check if Nightly Runner Cleanup's verify steps are failing), runners consistently unavailable, disk space issues.

4. **Repeated rate limiting**: If GitHub API rate limiting is causing the same workflow to fail across multiple runs (not just a one-off).

5. **Action version issues**: Deprecated or broken GitHub Action versions causing failures.

6. **Workflow configuration errors**: YAML syntax errors, invalid inputs, missing required secrets (distinct from permission issues).

7. **Scheduled workflow failures**: If a scheduled/cron workflow (other than Nightly Runner Cleanup) fails consistently, it likely indicates a systemic issue.

### 3. Investigate actionable failures

For each potentially actionable failure, get more details:

```
gh run view <run_id> --log-failed 2>/dev/null | head -100
```

Look for:

- The specific error message
- Whether the failure is in a setup step (infrastructure) vs. a test/build step (code)
- Whether the same failure appears across multiple runs

### 4. Determine severity

After investigation, categorize actionable failures:

- **SEVERE**: Permission errors, infrastructure down, main branch consistently broken, workflow configuration errors
- **MODERATE**: Repeated rate limiting, deprecated action warnings, intermittent infrastructure issues
- **LOW**: One-off transient failures that resolved on retry

Only proceed to file an issue if there are SEVERE or MODERATE findings.

### 5. Check for existing issues

Before creating a new issue, check if there's already an open issue about workflow problems:

```
gh issue list --label "workflow-health" --state open --json number,title,body
```

If an existing issue covers the same problems, do not create a duplicate. Instead, add a comment to the existing issue with the latest findings.

### 6. File a GitHub issue

If there are actionable findings (SEVERE or MODERATE), create a GitHub issue:

```
gh issue create --title "Workflow issues: <X>, <Y>, and <Z>" --label "workflow-health" --body "$(cat <<'EOF'
## Workflow Health Report

**Period:** <start_time> to <end_time>
**Total runs checked:** <N>
**Failures found:** <N actionable> actionable, <N expected> expected (ignored)

## Issues Found

### <Issue 1 Title>
- **Workflow:** <workflow name>
- **Severity:** SEVERE / MODERATE
- **Failed runs:**
  - [Run #<id>](<url>) — <date>
  - [Run #<id>](<url>) — <date>
- **Error:** <brief error description>
- **Suggested fix:** <how to resolve>

### <Issue 2 Title>
...

## Expected Failures (Ignored)
<Brief summary of expected failures that were skipped and why>

---
*This issue was automatically created by the daily workflow health check.*
EOF
)"
```

The issue title should list the specific problems found (e.g., "Workflow issues: CI permissions error, flakiness upload rate-limited"). Keep it concise but descriptive.

### 7. Report results

Summarize:

- How many workflow runs were checked
- How many were expected failures (and which categories)
- How many were actionable (and what was found)
- Whether an issue was filed (with link) or if everything looks healthy
- If no actionable issues were found, report "All workflows healthy" and do not create an issue

Overview

This skill checks GitHub Actions workflow runs from the past day (or a configurable lookback) to detect severe or recurring failures and files a GitHub issue when actionable problems are found. It focuses on separating expected noise from real incidents, triaging failures, and creating a concise, outcome-oriented issue to drive remediation.

How this skill works

The skill lists recent workflow runs and groups them by workflow name. It classifies each failure as expected or actionable using a set of explicit rules, then investigates actionable failures by fetching failed logs and identifying error patterns. If findings are SEVERE or MODERATE and not already reported, it files a structured GitHub issue with examples, severity, and suggested fixes.

When to use it

  • Daily automated health check of repository CI and scheduled workflows.
  • After noticing an uptick in failing workflows to determine systemic vs. expected noise.
  • When main-branch CI appears unstable across multiple pushes.
  • To detect permission, runner, or rate-limiting issues early.
  • Before opening an incident to ensure duplicates are avoided.

Best practices

  • Ignore known non-actionable workflows (nightly runner cleanup, CLA bot, contributor PR CI on non-main branches).
  • Require at least two consecutive failures on main for CI-consistency alerts to avoid false positives.
  • Fetch the first 100 recent runs and clamp to the configured lookback window for performance.
  • Search existing issues with a workflow-health label before filing a new one to prevent duplicates.
  • Include log snippets, run links, severity, and a suggested fix in every filed issue.

Example use cases

  • Detecting a missing GITHUB_TOKEN causing 401/403 errors across scheduled workflows and filing a SEVERE issue.
  • Noticing the main CI fails on two successive pushes with the same test error and opening a consistency incident.
  • Identifying repeated GitHub API rate-limit errors across runs and creating a MODERATE issue recommending retries or backoff.
  • Skipping noise like Playwright report upload failures that cascade from CI and only reporting the root CI failure.
  • Spotting self-hosted runner disk-space or availability problems and reporting an infrastructure outage.

FAQ

How far back does the skill look for runs?

By default it checks the past 24 hours; you can pass a different lookback in hours as an optional argument.

When will it create a GitHub issue?

It files an issue only for SEVERE or MODERATE findings that are actionable and not already covered by an open workflow-health issue.

How does it avoid noisy or expected failures?

It applies explicit ignore rules for known noisy workflows and conditions (e.g., nightly runner cleanup, CLA assistant, cancelled runs, non-main PR CI).