home / skills / 2389-research / claude-plugins / ci

ci skill

/prbuddy/skills/ci

This skill monitors CI status, diagnoses failures, and applies fixes to prevent recurrence, ensuring PRs stay on track.

npx playbooks add skill 2389-research/claude-plugins --skill ci

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.6 KB
---
name: prbuddy:ci
description: CI/workflow monitoring and fixing. Triggers on "CI failing", "workflow failed", "checks red", "build broken", "tests failing", "rerun workflows".
---

<!-- ABOUTME: CI monitoring sub-skill for prbuddy -->
<!-- ABOUTME: Fetches logs, diagnoses failures, fixes with prevention -->

# prbuddy:ci

## Overview

Monitors CI/workflow status, diagnoses failures, fixes issues, and implements systematic prevention.

## Checkpoint Discipline

Check CI at these moments (not continuously):
- After push settles (not during burst commits)
- Before requesting review
- Before declaring work done
- Before merge

## Workflow

### Step 1: Get CI Status

```bash
gh pr checks --json name,state,bucket,link,workflow
```

**Status interpretation:**
- `bucket: "pass"` → Check passed
- `bucket: "fail"` → Check failed
- `bucket: "pending"` → Still running
- `bucket: "skipping"` → Skipped
- `bucket: "cancel"` → Cancelled

### Step 2: Handle Status

**If all passing:**
```
✓ All CI checks passing (N/N)

Ready for next step.
```

**If pending:**
```
CI checks in progress (N/M complete)

✓ lint (passed)
✓ typecheck (passed)
⏳ test (running)
⏳ build (queued)

Check again:
- Before requesting review
- Before declaring done
```

**If failed:** Continue to diagnosis

### Step 3: Identify Failed Runs

```bash
# Get branch name (with fallback for detached HEAD)
BRANCH=$(git branch --show-current)
if [ -z "$BRANCH" ]; then
  BRANCH=$(gh pr view --json headRefName -q .headRefName)
fi

gh run list --branch "$BRANCH" --json databaseId,name,status,conclusion --limit 10
```

Filter for `conclusion: "failure"`.

### Step 4: Fetch Failed Logs

For each failed run:

```bash
gh run view <run-id> --log-failed
```

This fetches only the logs for failed steps.

### Step 5: Diagnose Root Cause

Analyze the logs to identify:
1. **What failed** - Which test/build step
2. **Error message** - The actual error
3. **Root cause** - Why it happened
4. **Affected files** - Which code needs fixing

#### Flaky Test vs Real Failure

Before fixing, determine if this is a flaky test or real failure:

**Signs of flaky test (consider rerun):**
- Same test passed on previous commits
- Error involves timing, network, or race conditions
- "Timeout" or "connection refused" errors
- Test passes locally but fails in CI

**Signs of real failure (fix required):**
- New test that never passed
- Error directly relates to changed code
- Assertion failure on business logic
- Type/syntax errors

**If flaky:** Try `gh run rerun <id> --failed` first. If it fails again, investigate further.

### Step 6: Consult PAL (if non-trivial)

For complex failures, use PAL for expert opinion:

```
mcp__pal__chat: "CI failure diagnosis:

Error: [error message]
Context: [relevant code/config]

What's the root cause and recommended fix?"
```

### Step 7: Fix the Acute Issue

Make the code changes to fix the immediate failure.

### Step 8: Identify Systematic Prevention

Ask: "What would have caught this earlier?"

**Prevention hierarchy (prefer higher):**
1. **Linter rule** - Add ESLint/Prettier/etc. rule
2. **Pre-commit hook** - Add check to `.pre-commit-config.yaml`
3. **CI check** - Add earlier/faster check
4. **Type system** - Stricter TypeScript config
5. **Test** - Add test for this case
6. **Documentation** - Update CLAUDE.md if agent guidance

### Step 9: Commit Both Fixes

```bash
git add [changed files]
git commit -m "fix: [description]

- Fixed: [acute issue]
- Prevention: [systematic change]"
```

### Step 10: Push and Monitor

```bash
git push
```

Report:
```
Pushed fix. CI will re-run automatically.

Check status: /prbuddy:ci

Or watch in browser: gh pr checks --web
```

## Example: Test Failure

```
Checking CI status for PR #42...

❌ 1 of 4 checks failed

Failed: test (run 987654321)

Fetching logs...

Error in tests/auth.test.ts:
  TypeError: Cannot read property 'email' of undefined
  at validateUser (src/auth.ts:45)

Root cause: Missing null check when user object is undefined.

Fixing:
1. Added null check in src/auth.ts:45
2. Added test case for undefined user
3. Prevention: Enabled strictNullChecks in tsconfig.json

Committing...
Pushing...

✓ Fix pushed. Monitor with: /prbuddy:ci
```

## Commands Reference

| Task | Command |
|------|---------|
| Check status | `gh pr checks --json name,state,bucket` |
| List runs | `gh run list --branch <branch> --json databaseId,name,conclusion` |
| View run | `gh run view <id>` |
| Failed logs | `gh run view <id> --log-failed` |
| Full logs | `gh run view <id> --log` |
| Watch run | `gh run watch <id>` |
| Rerun failed | `gh run rerun <id> --failed` |
| Rerun all | `gh run rerun <id>` |

Overview

This skill monitors CI and workflow status, diagnoses failures by fetching only the failing logs, and applies targeted fixes plus systematic prevention. It focuses on pragmatic checkpoints around pushes, reviews, and merges to avoid noisy continuous checks. The skill automates reruns for flaky failures and prescribes changes that prevent recurrence.

How this skill works

It checks PR CI status using gh pr checks and lists recent runs for the current branch to identify failures. For each failed run it fetches failed-step logs (gh run view --log-failed), extracts the failing step, error message, root cause, and affected files, then recommends and applies an immediate fix. After the acute fix, it proposes and commits prevention measures (linter rule, pre-commit hook, CI check, stricter types, test, or docs) and pushes the change so CI re-runs.

When to use it

  • After push settles (avoid burst commits)
  • Before requesting review or declaring work done
  • Before merging a PR into a mainline branch
  • When CI shows failing checks or red status
  • When tests intermittently fail and need diagnosis

Best practices

  • Check CI at clear checkpoints rather than continuously to avoid noise
  • Differentiate flaky tests from real failures; rerun flaky ones first
  • Fetch only failed logs to speed diagnosis and reduce noise
  • Always pair the acute fix with a prevention change and document both in the commit message
  • Use PAL or an expert chat for non-trivial failures before broad refactors

Example use cases

  • A unit test fails in CI due to missing null check — add the check, add a unit test, enable strictNullChecks, commit and push
  • CI shows network timeout tests intermittently — rerun the failed run, if flaky add retry logic or stabilize test environment
  • Build fails with a TypeScript error after refactor — fix types, update tsconfig or add a type assertion, and add a linter rule if needed
  • A lint step is newly failing — add or adjust the ESLint rule and a pre-commit hook to catch it locally
  • Release blocked by a flaky integration test — rerun failed runs and add a targeted integration test and CI gating to prevent regressions

FAQ

How do I tell a flaky test from a real failure?

Look for timing/network errors, prior passes on earlier commits, or tests passing locally for flaky signs. Failures tied to changed code, assertions, or compile errors indicate real failures requiring code fixes.

What if a rerun keeps failing?

Fetch failed-step logs, pinpoint the error and affected files, consult an expert (PAL) if complex, then implement an acute fix plus a prevention measure and push the commit.