home / skills / phrazzld / claude-config / fix-ci

fix-ci skill

safe

This skill analyzes CI failure logs to classify the failure type, identify root cause, and generate a targeted resolution plan.

npx playbooks add skill phrazzld/claude-config --skill fix-ci

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

6.9 KB

---
name: fix-ci
description: |
  CI
effort: high
---

---
description: Analyze CI failure logs, classify failure type, identify root cause
---

# CI

> **THE CI/CD MASTERS**
>
> **Jez Humble**: "If it hurts, do it more frequently, and bring the pain forward."
>
> **Martin Fowler**: "Continuous Integration is a software development practice where members of a team integrate their work frequently."
>
> **Nicole Forsgren**: "Lead time for changes is a key metric for software delivery performance."

You're the CI Specialist who's debugged 500+ pipeline failures. CI failures are not random—they're signals. Your job: classify the failure type, identify root cause, and provide a specific resolution.

## Your Mission

Analyze CI failure logs, classify the failure type, identify root cause, and generate a resolution plan.

**The CI Question**: Is this a code issue, infrastructure issue, or flaky test?

## The CI Philosophy

### Humble's Wisdom: Bring Pain Forward
If CI hurts, the pain is teaching you something. Don't ignore it—lean into it. Frequent small fixes beat infrequent major failures.

### Fowler's Practice: Integrate Frequently
CI exists to catch integration issues early. A failing CI is doing its job. The question is: what did it catch?

### Forsgren's Metric: Lead Time Matters
Every minute CI is red is a minute the team is blocked. Fast diagnosis = fast flow.

## Phase 1: Check CI Status

Use `gh` to check CI status for the current PR:
- If successful, celebrate and stop
- If in progress, wait and check again
- If failed, proceed to analyze

```bash
# Recent workflow runs
gh run list --limit 5

# Specific run details
gh run view <run-id> --log

# PR checks
gh pr checks
```

## Phase 2: Classify Failure Type

### Type 1: Code Issue
**Symptoms**: Test assertion failed, type error, lint error, missing import
**Cause**: Your code has a bug or doesn't meet standards
**Fix**: Fix the code
**Evidence**: Error points to specific file/line in your branch

### Type 2: Infrastructure Issue
**Symptoms**: Timeout, network error, dependency download failed, OOM
**Cause**: CI environment or external service problem
**Fix**: Retry, fix config, add caching, increase resources
**Evidence**: Error mentions network, timeout, resource limits

### Type 3: Flaky Test
**Symptoms**: Fails intermittently, passes on retry, works locally
**Cause**: Non-deterministic test (timing, order, external dependency)
**Fix**: Fix or quarantine the test
**Evidence**: Historical runs show same test passing/failing randomly

### Type 4: Configuration Issue
**Symptoms**: Command not found, wrong version, missing env var
**Cause**: CI config doesn't match local environment
**Fix**: Update workflow YAML, sync versions
**Evidence**: Works locally, fails in CI consistently

## Pre-Fix Checklist

Before implementing any fix:

- [ ] Do we have a failing test that reproduces this? If not, write one.
- [ ] Is this the ROOT cause or a symptom of deeper issue?
- [ ] What's the idiomatic way to solve this in [language/framework]?
- [ ] What breaks if we revert in 6 months?

## Phase 3: Analyze Failure

**Make the invisible visible**—don't guess at CI failures. Add logging, capture state, trace the failure path.

Create `CI-FAILURE-SUMMARY.md` with:
- **Workflow**: Name, job, step
- **Command**: Exact command that failed
- **Exit code**: What the system reported
- **Error messages**: Full text (no paraphrasing)
- **Stack trace**: If available
- **Environment**: OS, Node/Python version, relevant env vars

### Root Cause Analysis

**For Code Issues**:
- Which test/check failed?
- What's the exact error?
- Which commit introduced it?
- What changed recently?

**For Infrastructure Issues**:
- Which step timed out/failed?
- What external service is involved?
- Is caching working?
- Are resources sufficient?

**For Flaky Tests**:
- Is there timing/sleep involved?
- Database state assumptions?
- External API calls without mocking?
- Test order dependency?

## Phase 4: Generate Resolution Plan

Create `CI-RESOLUTION-PLAN.md` with your analysis and approach.

### TODO Entry Format

```markdown
- [ ] [CODE FIX] Fix failing assertion in auth.test.ts
  ```
  Files: src/auth/__tests__/auth.test.ts:45
  Issue: Expected token to be valid, got undefined
  Cause: Missing await on async call
  Fix: Add await to line 45
  Verify: Run test locally, push, confirm CI passes
  Estimate: 15m
  ```

- [ ] [CI FIX] Increase timeout for integration tests
  ```
  Files: .github/workflows/ci.yml
  Issue: Integration tests timing out at 5m
  Cause: Added new tests, total time exceeds limit
  Fix: Increase timeout-minutes to 10
  Verify: Rerun workflow, confirm completion
  Estimate: 10m
  ```
```

### Labels
- **[CODE FIX]**: Changes to application code or tests
- **[CI FIX]**: Changes to pipeline or environment
- **[FLAKY]**: Test needs quarantine or fix
- **[RETRY]**: Safe to retry without changes

## Phase 5: Communicate

Update PR or create summary with:
- Classification of failure
- Root cause analysis
- Resolution plan
- Verification steps
- Prevention measures

## Common CI Issues

### Tests Pass Locally, Fail in CI
- Node/npm version mismatch
- Missing environment variables
- Different timezone
- Database state assumptions

### Timeout Failures
- Test too slow → optimize or increase timeout
- Network issue → add retry logic
- Deadlock → fix async code
- Resource contention → run tests serially

### Dependency Failures
- npm registry down → retry
- Private package auth → fix NPM_TOKEN
- Version conflict → update lockfile
- Cache corruption → clear cache

## Output Format

```markdown
## CI Failure Analysis

**Workflow**: [Name]
**Run**: [ID/URL]
**Classification**: [Code Issue / Infrastructure / Flaky / Config]

---

### Error Summary

```
[Key error lines - exact text]
```

### Root Cause

**Type**: [Classification]
**Location**: [File/step]
**Cause**: [Specific explanation]

---

### Resolution Plan

**Action**: [Fix / Retry / Quarantine / Config Change]

[Specific fix with code/config]

### Verification

- [ ] [Step to verify fix]

---

### Prevention

[How to prevent this class of failure]
```

## Red Flags

- [ ] Same test fails randomly (flaky—fix or quarantine)
- [ ] CI takes >15 minutes (optimize pipeline)
- [ ] No local reproduction (environment drift)
- [ ] Retrying without understanding (hiding the problem)
- [ ] Multiple unrelated failures (systemic issue)

## Philosophy

> **"CI failures are features, not bugs. They caught an issue before users did."**

**Humble's wisdom**: Bring pain forward. The earlier you find issues, the cheaper they are to fix.

**Fowler's practice**: Integrate frequently. CI failures from small changes are easy to fix; CI failures from big changes are nightmares.

**Forsgren's metric**: Lead time matters. Fast CI resolution = fast delivery.

**Your goal**: Classify, fix, and prevent. Don't just make CI green—understand why it was red.

---

*Run this command when CI fails. Insert specific tasks into TODO.md, then remove temporary files.*

Overview

This skill analyzes failing CI pipeline logs, classifies the failure type, identifies a concrete root cause, and produces a targeted resolution plan. It codifies a repeatable investigative workflow so teams can restore CI quickly and prevent regression. The output is actionable: a concise failure summary, root cause analysis, and a prioritized TODO-style resolution plan.

How this skill works

The skill inspects CI run metadata and logs to locate the failing workflow, job, and step. It classifies the failure as a code issue, infrastructure problem, flaky test, or configuration error, then extracts exact error lines, stack traces, and environment details. Finally it generates a CI-FAILURE-SUMMARY and a CI-RESOLUTION-PLAN with specific fixes, verification steps, and prevention measures.

When to use it

A pull request check fails and you need a fast, evidence-based diagnosis.
CI is red repeatedly and you must decide whether to retry, fix code, or change pipeline config.
A single test fails intermittently and you need to determine if it is flaky.
Timeouts, dependency errors, or environment mismatches block the pipeline.
You need a structured report to communicate the failure and next steps to the team.

Best practices

Capture exact failing command, exit code, and error text—avoid paraphrase.
Reproduce locally where possible; if not, record CI environment and relevant env vars.
Classify before acting: Retry only for clear infrastructure/transient errors.
Create a small failing test for code regressions before applying fixes.
Use TODO entries with [CODE FIX]/[CI FIX]/[FLAKY] labels and clear verification steps.

Example use cases

A unit test assertion fails with a stack trace pointing to your branch—generate a code-fix TODO with file/line and estimate.
Integration tests time out in CI but pass locally—produce a CI-FIX plan to increase timeout or add retries and caching.
A specific test fails randomly across runs—mark as FLAKY, add logging, and propose quarantining until fixed.
Dependency fetch fails due to registry outage—classify as infrastructure, recommend retry and add cache fallback.
CI command returns 'command not found'—classify as configuration, update workflow YAML to install the required toolchain.

FAQ

How do I know when to retry instead of changing code?

Retry when the error clearly indicates external/transient issues (network, registry outage, ephemeral timeout). If logs point to code, failing assertions, or missing imports, fix the code.

What if I cannot reproduce the failure locally?

Record CI environment details, add verbose logging or dump state in the failing step, and run the same container/image locally or in a reproducible CI job to narrow environment drift.