home / skills / dyad-sh / dyad / check-workflows

check-workflows skill

safe

This skill analyzes the past day's GitHub Actions workflow runs to detect actionable failures and open issues for remediation.

npx playbooks add skill dyad-sh/dyad --skill check-workflows

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.8 KB

---
name: dyad:check-workflows
description: Check GitHub Actions workflow runs from the past day, identify severe or consistent failures, and file an issue if actionable problems are found.
---

# Check Workflows

Check GitHub Actions workflow runs from the past day for severe or consistent failures and file a GitHub issue if actionable problems are found.

## Arguments

- `$ARGUMENTS`: (Optional) Number of hours to look back (default: 24)

## Instructions

### 1. Gather recent workflow runs

Fetch all workflow runs from the past N hours (default 24):

```
gh run list --limit 100 --json workflowName,status,conclusion,event,headBranch,createdAt,databaseId,url,name
```

Filter to only runs created within the lookback window. Group runs by workflow name.

### 2. Classify each failure

For each failed run, determine if it is **expected** or **actionable** by checking these rules:

#### Expected failures (IGNORE these):

1. **Nightly Runner Cleanup**: This workflow intentionally reboots self-hosted macOS runners, which kills the runner process mid-job. It will almost always show as "failed" even when working correctly. **Always skip this workflow entirely.**

2. **Cascading failures from CI**: When the main CI workflow fails, these downstream workflows will also fail because they depend on CI artifacts (e.g. `html-report`, blob reports). This is noise, not an independent problem:
   - Playwright Report Comment (fails with "artifact not found")
   - Upload to Flakiness.io (fails when no flakiness reports exist)
   - Merge PR when ready (skipped/fails when CI hasn't passed)

3. **CLA Assistant**: Failures just mean a contributor hasn't signed the CLA yet. This resolves on its own.

4. **Cancelled runs**: Runs cancelled due to concurrency groups (newer push cancels older run) are normal.

5. **`action_required` / `neutral` conclusions**: Standard GitHub behavior for fork PRs or first-time contributors needing manual approval.

6. **CI failures on non-main branches**: Individual PR CI failures are expected — contributors may have formatting issues, lockfile mismatches, test failures, etc. These are the contributor's responsibility.

7. **Claude Deflake E2E**: This workflow is expected to sometimes have long runs or partial failures as it investigates flaky tests.

#### Actionable failures (FLAG these):

1. **Permission errors**: Workflow can't access secrets, missing `GITHUB_TOKEN`, 403/401 errors on API calls that should be authenticated, `Resource not accessible by integration` errors.

2. **Consistent CI failures on main branch**: If the CI workflow fails on 2+ consecutive pushes to main, something is likely broken. Check if different commits are failing for the same reason.

3. **Infrastructure failures**: Self-hosted runners not coming back online (check if Nightly Runner Cleanup's verify steps are failing), runners consistently unavailable, disk space issues.

4. **Repeated rate limiting**: If GitHub API rate limiting is causing the same workflow to fail across multiple runs (not just a one-off).

5. **Action version issues**: Deprecated or broken GitHub Action versions causing failures.

6. **Workflow configuration errors**: YAML syntax errors, invalid inputs, missing required secrets (distinct from permission issues).

7. **Scheduled workflow failures**: If a scheduled/cron workflow (other than Nightly Runner Cleanup) fails consistently, it likely indicates a systemic issue.

### 3. Investigate actionable failures

For each potentially actionable failure, get more details:

```
gh run view <run_id> --log-failed 2>/dev/null | head -100
```

Look for:

- The specific error message
- Whether the failure is in a setup step (infrastructure) vs. a test/build step (code)
- Whether the same failure appears across multiple runs

### 4. Determine severity

After investigation, categorize actionable failures:

- **SEVERE**: Permission errors, infrastructure down, main branch consistently broken, workflow configuration errors
- **MODERATE**: Repeated rate limiting, deprecated action warnings, intermittent infrastructure issues
- **LOW**: One-off transient failures that resolved on retry

Only proceed to file an issue if there are SEVERE or MODERATE findings.

### 5. Check for existing issues

Before creating a new issue, check if there's already an open issue about workflow problems:

```
gh issue list --label "workflow-health" --state open --json number,title,body
```

If an existing issue covers the same problems, do not create a duplicate. Instead, add a comment to the existing issue with the latest findings.

### 6. File a GitHub issue

If there are actionable findings (SEVERE or MODERATE), create a GitHub issue:

```
gh issue create --title "Workflow issues: <X>, <Y>, and <Z>" --label "workflow-health" --body "$(cat <<'EOF'
## Workflow Health Report

**Period:** <start_time> to <end_time>
**Total runs checked:** <N>
**Failures found:** <N actionable> actionable, <N expected> expected (ignored)

## Issues Found

### <Issue 1 Title>
- **Workflow:** <workflow name>
- **Severity:** SEVERE / MODERATE
- **Failed runs:**
  - [Run #<id>](<url>) — <date>
  - [Run #<id>](<url>) — <date>
- **Error:** <brief error description>
- **Suggested fix:** <how to resolve>

### <Issue 2 Title>
...

## Expected Failures (Ignored)
<Brief summary of expected failures that were skipped and why>

---
*This issue was automatically created by the daily workflow health check.*
EOF
)"
```

The issue title should list the specific problems found (e.g., "Workflow issues: CI permissions error, flakiness upload rate-limited"). Keep it concise but descriptive.

### 7. Report results

Summarize:

- How many workflow runs were checked
- How many were expected failures (and which categories)
- How many were actionable (and what was found)
- Whether an issue was filed (with link) or if everything looks healthy
- If no actionable issues were found, report "All workflows healthy" and do not create an issue

Overview

This skill checks GitHub Actions workflow runs from the past day (or a configurable lookback) to detect severe or recurring failures and files a GitHub issue when actionable problems are found. It focuses on separating expected noise from real incidents, triaging failures, and creating a concise, outcome-oriented issue to drive remediation.

How this skill works

The skill lists recent workflow runs and groups them by workflow name. It classifies each failure as expected or actionable using a set of explicit rules, then investigates actionable failures by fetching failed logs and identifying error patterns. If findings are SEVERE or MODERATE and not already reported, it files a structured GitHub issue with examples, severity, and suggested fixes.

When to use it

Daily automated health check of repository CI and scheduled workflows.
After noticing an uptick in failing workflows to determine systemic vs. expected noise.
When main-branch CI appears unstable across multiple pushes.
To detect permission, runner, or rate-limiting issues early.
Before opening an incident to ensure duplicates are avoided.

Best practices

Ignore known non-actionable workflows (nightly runner cleanup, CLA bot, contributor PR CI on non-main branches).
Require at least two consecutive failures on main for CI-consistency alerts to avoid false positives.
Fetch the first 100 recent runs and clamp to the configured lookback window for performance.
Search existing issues with a workflow-health label before filing a new one to prevent duplicates.
Include log snippets, run links, severity, and a suggested fix in every filed issue.

Example use cases

Detecting a missing GITHUB_TOKEN causing 401/403 errors across scheduled workflows and filing a SEVERE issue.
Noticing the main CI fails on two successive pushes with the same test error and opening a consistency incident.
Identifying repeated GitHub API rate-limit errors across runs and creating a MODERATE issue recommending retries or backoff.
Skipping noise like Playwright report upload failures that cascade from CI and only reporting the root CI failure.
Spotting self-hosted runner disk-space or availability problems and reporting an infrastructure outage.

FAQ

How far back does the skill look for runs?

By default it checks the past 24 hours; you can pass a different lookback in hours as an optional argument.

When will it create a GitHub issue?

It files an issue only for SEVERE or MODERATE findings that are actionable and not already covered by an open workflow-health issue.

How does it avoid noisy or expected failures?

It applies explicit ignore rules for known noisy workflows and conditions (e.g., nightly runner cleanup, CLA assistant, cancelled runs, non-main PR CI).