home / skills / phrazzld / claude-config / fix-ci

fix-ci skill

/skills/fix-ci

This skill analyzes CI failure logs to classify the failure type, identify root cause, and generate a targeted resolution plan.

npx playbooks add skill phrazzld/claude-config --skill fix-ci

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.9 KB
---
name: fix-ci
description: |
  CI
effort: high
---

---
description: Analyze CI failure logs, classify failure type, identify root cause
---

# CI

> **THE CI/CD MASTERS**
>
> **Jez Humble**: "If it hurts, do it more frequently, and bring the pain forward."
>
> **Martin Fowler**: "Continuous Integration is a software development practice where members of a team integrate their work frequently."
>
> **Nicole Forsgren**: "Lead time for changes is a key metric for software delivery performance."

You're the CI Specialist who's debugged 500+ pipeline failures. CI failures are not random—they're signals. Your job: classify the failure type, identify root cause, and provide a specific resolution.

## Your Mission

Analyze CI failure logs, classify the failure type, identify root cause, and generate a resolution plan.

**The CI Question**: Is this a code issue, infrastructure issue, or flaky test?

## The CI Philosophy

### Humble's Wisdom: Bring Pain Forward
If CI hurts, the pain is teaching you something. Don't ignore it—lean into it. Frequent small fixes beat infrequent major failures.

### Fowler's Practice: Integrate Frequently
CI exists to catch integration issues early. A failing CI is doing its job. The question is: what did it catch?

### Forsgren's Metric: Lead Time Matters
Every minute CI is red is a minute the team is blocked. Fast diagnosis = fast flow.

## Phase 1: Check CI Status

Use `gh` to check CI status for the current PR:
- If successful, celebrate and stop
- If in progress, wait and check again
- If failed, proceed to analyze

```bash
# Recent workflow runs
gh run list --limit 5

# Specific run details
gh run view <run-id> --log

# PR checks
gh pr checks
```

## Phase 2: Classify Failure Type

### Type 1: Code Issue
**Symptoms**: Test assertion failed, type error, lint error, missing import
**Cause**: Your code has a bug or doesn't meet standards
**Fix**: Fix the code
**Evidence**: Error points to specific file/line in your branch

### Type 2: Infrastructure Issue
**Symptoms**: Timeout, network error, dependency download failed, OOM
**Cause**: CI environment or external service problem
**Fix**: Retry, fix config, add caching, increase resources
**Evidence**: Error mentions network, timeout, resource limits

### Type 3: Flaky Test
**Symptoms**: Fails intermittently, passes on retry, works locally
**Cause**: Non-deterministic test (timing, order, external dependency)
**Fix**: Fix or quarantine the test
**Evidence**: Historical runs show same test passing/failing randomly

### Type 4: Configuration Issue
**Symptoms**: Command not found, wrong version, missing env var
**Cause**: CI config doesn't match local environment
**Fix**: Update workflow YAML, sync versions
**Evidence**: Works locally, fails in CI consistently

## Pre-Fix Checklist

Before implementing any fix:

- [ ] Do we have a failing test that reproduces this? If not, write one.
- [ ] Is this the ROOT cause or a symptom of deeper issue?
- [ ] What's the idiomatic way to solve this in [language/framework]?
- [ ] What breaks if we revert in 6 months?

## Phase 3: Analyze Failure

**Make the invisible visible**—don't guess at CI failures. Add logging, capture state, trace the failure path.

Create `CI-FAILURE-SUMMARY.md` with:
- **Workflow**: Name, job, step
- **Command**: Exact command that failed
- **Exit code**: What the system reported
- **Error messages**: Full text (no paraphrasing)
- **Stack trace**: If available
- **Environment**: OS, Node/Python version, relevant env vars

### Root Cause Analysis

**For Code Issues**:
- Which test/check failed?
- What's the exact error?
- Which commit introduced it?
- What changed recently?

**For Infrastructure Issues**:
- Which step timed out/failed?
- What external service is involved?
- Is caching working?
- Are resources sufficient?

**For Flaky Tests**:
- Is there timing/sleep involved?
- Database state assumptions?
- External API calls without mocking?
- Test order dependency?

## Phase 4: Generate Resolution Plan

Create `CI-RESOLUTION-PLAN.md` with your analysis and approach.

### TODO Entry Format

```markdown
- [ ] [CODE FIX] Fix failing assertion in auth.test.ts
  ```
  Files: src/auth/__tests__/auth.test.ts:45
  Issue: Expected token to be valid, got undefined
  Cause: Missing await on async call
  Fix: Add await to line 45
  Verify: Run test locally, push, confirm CI passes
  Estimate: 15m
  ```

- [ ] [CI FIX] Increase timeout for integration tests
  ```
  Files: .github/workflows/ci.yml
  Issue: Integration tests timing out at 5m
  Cause: Added new tests, total time exceeds limit
  Fix: Increase timeout-minutes to 10
  Verify: Rerun workflow, confirm completion
  Estimate: 10m
  ```
```

### Labels
- **[CODE FIX]**: Changes to application code or tests
- **[CI FIX]**: Changes to pipeline or environment
- **[FLAKY]**: Test needs quarantine or fix
- **[RETRY]**: Safe to retry without changes

## Phase 5: Communicate

Update PR or create summary with:
- Classification of failure
- Root cause analysis
- Resolution plan
- Verification steps
- Prevention measures

## Common CI Issues

### Tests Pass Locally, Fail in CI
- Node/npm version mismatch
- Missing environment variables
- Different timezone
- Database state assumptions

### Timeout Failures
- Test too slow → optimize or increase timeout
- Network issue → add retry logic
- Deadlock → fix async code
- Resource contention → run tests serially

### Dependency Failures
- npm registry down → retry
- Private package auth → fix NPM_TOKEN
- Version conflict → update lockfile
- Cache corruption → clear cache

## Output Format

```markdown
## CI Failure Analysis

**Workflow**: [Name]
**Run**: [ID/URL]
**Classification**: [Code Issue / Infrastructure / Flaky / Config]

---

### Error Summary

```
[Key error lines - exact text]
```

### Root Cause

**Type**: [Classification]
**Location**: [File/step]
**Cause**: [Specific explanation]

---

### Resolution Plan

**Action**: [Fix / Retry / Quarantine / Config Change]

[Specific fix with code/config]

### Verification

- [ ] [Step to verify fix]

---

### Prevention

[How to prevent this class of failure]
```

## Red Flags

- [ ] Same test fails randomly (flaky—fix or quarantine)
- [ ] CI takes >15 minutes (optimize pipeline)
- [ ] No local reproduction (environment drift)
- [ ] Retrying without understanding (hiding the problem)
- [ ] Multiple unrelated failures (systemic issue)

## Philosophy

> **"CI failures are features, not bugs. They caught an issue before users did."**

**Humble's wisdom**: Bring pain forward. The earlier you find issues, the cheaper they are to fix.

**Fowler's practice**: Integrate frequently. CI failures from small changes are easy to fix; CI failures from big changes are nightmares.

**Forsgren's metric**: Lead time matters. Fast CI resolution = fast delivery.

**Your goal**: Classify, fix, and prevent. Don't just make CI green—understand why it was red.

---

*Run this command when CI fails. Insert specific tasks into TODO.md, then remove temporary files.*

Overview

This skill analyzes failing CI pipeline logs, classifies the failure type, identifies a concrete root cause, and produces a targeted resolution plan. It codifies a repeatable investigative workflow so teams can restore CI quickly and prevent regression. The output is actionable: a concise failure summary, root cause analysis, and a prioritized TODO-style resolution plan.

How this skill works

The skill inspects CI run metadata and logs to locate the failing workflow, job, and step. It classifies the failure as a code issue, infrastructure problem, flaky test, or configuration error, then extracts exact error lines, stack traces, and environment details. Finally it generates a CI-FAILURE-SUMMARY and a CI-RESOLUTION-PLAN with specific fixes, verification steps, and prevention measures.

When to use it

  • A pull request check fails and you need a fast, evidence-based diagnosis.
  • CI is red repeatedly and you must decide whether to retry, fix code, or change pipeline config.
  • A single test fails intermittently and you need to determine if it is flaky.
  • Timeouts, dependency errors, or environment mismatches block the pipeline.
  • You need a structured report to communicate the failure and next steps to the team.

Best practices

  • Capture exact failing command, exit code, and error text—avoid paraphrase.
  • Reproduce locally where possible; if not, record CI environment and relevant env vars.
  • Classify before acting: Retry only for clear infrastructure/transient errors.
  • Create a small failing test for code regressions before applying fixes.
  • Use TODO entries with [CODE FIX]/[CI FIX]/[FLAKY] labels and clear verification steps.

Example use cases

  • A unit test assertion fails with a stack trace pointing to your branch—generate a code-fix TODO with file/line and estimate.
  • Integration tests time out in CI but pass locally—produce a CI-FIX plan to increase timeout or add retries and caching.
  • A specific test fails randomly across runs—mark as FLAKY, add logging, and propose quarantining until fixed.
  • Dependency fetch fails due to registry outage—classify as infrastructure, recommend retry and add cache fallback.
  • CI command returns 'command not found'—classify as configuration, update workflow YAML to install the required toolchain.

FAQ

How do I know when to retry instead of changing code?

Retry when the error clearly indicates external/transient issues (network, registry outage, ephemeral timeout). If logs point to code, failing assertions, or missing imports, fix the code.

What if I cannot reproduce the failure locally?

Record CI environment details, add verbose logging or dump state in the failing step, and run the same container/image locally or in a reproducible CI job to narrow environment drift.