home / skills / phrazzld / claude-config / triage

triage skill

/skills/triage

This skill performs multi-source production triage by auditing Sentry, Vercel logs, health endpoints, and CI, guiding investigation and fixes.

npx playbooks add skill phrazzld/claude-config --skill triage

Review the files below or copy the command above to add this skill to your agents.

Files (10)
SKILL.md
7.4 KB
---
name: triage
description: |
  Multi-source observability triage. Checks Sentry, Vercel logs, health endpoints, GitHub CI/CD.
  Drives: investigate -> fix -> PR -> postmortem workflow.
  Invoke for: production issues, error spikes, CI failures, user reports, incident response.
argument-hint: "[action: status | investigate ISSUE-ID | investigate-ci RUN-ID | fix | postmortem ISSUE-ID]"
---

# /triage

Fix production issues. Run audit, investigate, fix, postmortem.

**This is a fixer.** It uses `/check-production` as its primitive. Use `/log-production-issues` to create issues instead of fixing.

## Usage

```bash
/triage                        # Audit and fix highest priority (default)
/triage investigate VOL-456    # Deep dive on specific Sentry issue
/triage investigate-ci 12345   # Deep dive on specific CI run failure
/triage fix                    # Create PR for current fix
/triage postmortem VOL-456     # Generate postmortem after merge
```

## Stage 1: Production Audit

**Command:** `/triage` or `/triage status`

Invoke `/check-production` primitive for parallel checks:
1. **Sentry** - Unresolved issues via triage scripts
2. **Vercel logs** - Recent errors in stream
3. **Health endpoints** - `/api/health` response
4. **GitHub CI/CD** - Failed workflow runs

**Output format:**
```
TRIAGE STATUS - 2026-01-23 15:30
================================

SENTRY (volume-fitness)
  [P0] 3 unresolved issues
  Top: VOL-456 "PaymentIntent failed" (Score: 147, 23 users)

GITHUB CI/CD
  [P1] Main branch failing: "CI" workflow (run #1234)
       Failed: Type check - 2h ago
  [P2] 2 feature branches blocked

VERCEL LOGS
  [OK] No errors in last 10 minutes

HEALTH ENDPOINTS
  [OK] volume.fitness/api/health (200, 45ms)

RECOMMENDATION:
  1. Investigate VOL-456 immediately - 23 users affected
     Run: /triage investigate VOL-456
  2. Fix main branch CI - blocking all deploys
     Run: /triage investigate-ci 1234
```

If all clean: "All systems nominal. No action required."

## Stage 2: Investigate

### Delegation Pattern

For complex issues, delegate investigation to agentic tools (see `/delegate`):
- **Codex** — Code archaeology, stack trace analysis, debugging
- **Gemini** — Research current patterns, check for known issues
- **Thinktank** — Validate proposed fix before implementing

### Sentry Issues

**Command:** `/triage investigate ISSUE-ID`

Actions:
1. Fetch full issue context from Sentry
2. Create branch: `fix/ISSUE-ID-description`
3. Load affected files from stack trace
4. Check git history for related changes
5. Form root cause hypothesis (delegate to Codex for complex traces)

**Output:** Investigation summary with hypothesis and next steps.

### CI/CD Failures

**Command:** `/triage investigate-ci RUN-ID`

Actions:
1. Fetch failed workflow run details
   ```bash
   gh run view RUN-ID --log-failed
   ```
2. Identify failed step and error message
3. Create branch: `fix/ci-[workflow-name]-[date]`
4. Load affected files based on error
5. Check recent commits that may have caused regression

**Common CI failure patterns:**

| Failure Type | Typical Cause | Fix Approach |
|--------------|---------------|--------------|
| Type check | New code with type errors | Fix types locally, push |
| Lint | Style violations | Run `pnpm lint --fix` |
| Test | Broken/flaky tests | Run tests locally, fix or skip flaky |
| Build | Missing deps, config issues | Check package.json, build config |
| Deploy | Env vars, permissions | Check Vercel/platform settings |

**Output:** CI investigation summary with specific error and fix approach.

## Stage 3: Fix

**Command:** `/triage fix`

Prerequisites: On `fix/` branch with changes.

Actions:
1. Run tests to verify fix
2. Create PR with standard format
3. Link Sentry issue in PR description

**PR format:**
```markdown
## Summary
[Fix description]

## Sentry Issue
- ID: ISSUE-ID
- Users affected: N
- First seen: DATE

## Test Plan
- [ ] Test case 1
- [ ] Test case 2
```

## Stage 4: Postmortem

**Command:** `/triage postmortem ISSUE-ID`

Prerequisites: Fix deployed (PR merged).

Actions:
1. Verify no new errors in Sentry
2. Generate postmortem document from template
3. Resolve Sentry issue
4. Create `docs/postmortems/YYYY-MM-DD-ISSUE-ID.md`

## Scripts

### Via Sentry MCP (Preferred)

When Sentry MCP is configured, use direct queries:
- "Show me unresolved errors in production"
- "What's the triage score for issue VOL-456?"
- "Get full context for the top error"

### Via CLI Scripts

```bash
# Multi-source orchestrator
~/.claude/skills/triage/scripts/check_all_sources.sh

# Individual checks
~/.claude/skills/triage/scripts/check_sentry.sh
~/.claude/skills/triage/scripts/check_vercel_logs.sh
~/.claude/skills/triage/scripts/check_health_endpoints.sh

# Sentry CLI directly
sentry-cli issues list --project=$SENTRY_PROJECT --status=unresolved
sentry-cli issues describe ISSUE-ID

# Postmortem generator
~/.claude/skills/triage/scripts/generate_postmortem.sh ISSUE-ID
```

### Via GitHub CLI

```bash
# List failed runs on main branch
gh run list --branch main --status failure --limit 10

# List all recent failures
gh run list --status failure --limit 10

# View failed run details
gh run view RUN-ID

# View only failed step logs
gh run view RUN-ID --log-failed

# Re-run failed jobs (after fix pushed)
gh run rerun RUN-ID --failed

# Watch a run in progress
gh run watch RUN-ID
```

## Workflow

```
/triage
   |
   v
[Issues found?]
   |
   +-- Sentry issue --> /triage investigate ISSUE-ID
   |                       |
   |                       v
   |                    [Fix locally]
   |                       |
   |                       v
   |                    /triage fix (creates PR)
   |                       |
   |                       v
   |                    [PR merged & deployed]
   |                       |
   |                       v
   |                    /triage postmortem ISSUE-ID
   |
   +-- CI failure --> /triage investigate-ci RUN-ID
   |                     |
   |                     v
   |                  [Fix locally, push]
   |                     |
   |                     v
   |                  [CI re-runs automatically]
   |                     |
   |                     v
   |                  [Verify CI green]
   |
   +-- No issues --> "All systems nominal"
```

## Environment Variables

```bash
# Required for Sentry
SENTRY_AUTH_TOKEN   # or SENTRY_MASTER_TOKEN
SENTRY_ORG          # Organization slug

# Auto-detected per project
SENTRY_PROJECT      # From .sentryclirc or .env.local

# Optional for Vercel
VERCEL_TOKEN        # For `vercel logs` access
```

## MCP Configuration (Recommended)

For AI-assisted triage, configure Sentry MCP:

```json
{
  "mcpServers": {
    "sentry": {
      "url": "https://mcp.sentry.dev/mcp",
      "transport": "http"
    }
  }
}
```

Or local with token:
```json
{
  "mcpServers": {
    "sentry": {
      "command": "npx",
      "args": ["-y", "@sentry/mcp-server"],
      "env": {
        "SENTRY_AUTH_TOKEN": "your-token",
        "SENTRY_ORG": "your-org"
      }
    }
  }
}
```

## Reuses

- `~/.claude/skills/sentry-observability/scripts/triage_score.sh`
- `~/.claude/skills/sentry-observability/scripts/issue_detail.sh`
- `~/.claude/skills/sentry-observability/scripts/resolve_issue.sh`

## Related

- `/check-production` - The primitive (audit only)
- `/log-production-issues` - Create GitHub issues from findings
- `/observability` - Full observability setup
- `/sentry-observability` - Sentry-specific operations
- `/verify-fix` - Verification checklist
- `/delegate` - Multi-AI orchestration pattern

Overview

This skill provides multi-source observability triage for production incidents. It audits Sentry, Vercel logs, health endpoints, and GitHub CI/CD, then drives an investigate -> fix -> PR -> postmortem workflow. Use it to rapidly surface, diagnose, and remediate high-priority production problems.

How this skill works

The skill runs a parallel production audit (Sentry unresolved issues, Vercel error streams, /api/health checks, and failed GitHub workflow runs) and ranks findings by priority. For a selected issue it creates a fix branch, loads stack traces and related files, checks git history, and produces an investigation summary and hypothesis. For CI failures it inspects run logs, identifies the failed step, and proposes targeted fixes. After changes, it helps run tests, open a PR with a standard template, and generate postmortems once the fix is deployed.

When to use it

  • Routine production audit or on-call shift start
  • Investigating a Sentry issue affecting users or spike in errors
  • Responding to failing GitHub CI/CD runs that block deploys
  • Validating health endpoint failures or unexplained latency
  • After a user report that appears in logs or error tracking

Best practices

  • Run the /triage audit at the start of shifts and after deploys
  • Investigate top-priority Sentry issues first (highest score and user impact)
  • Create descriptive fix branches: fix/ISSUE-ID-short-description
  • Run tests locally before creating the PR; include test plan in PR body
  • Use delegation (Codex, Thinktank, Gemini) for complex stack traces or hypotheses

Example use cases

  • Run /triage to get a snapshot of production health and recommendations
  • /triage investigate VOL-456 to deep dive a Sentry error impacting users
  • /triage investigate-ci 12345 to inspect a failed GH Actions run and produce a fix branch
  • /triage fix to run tests and open a PR for verified local fixes
  • /triage postmortem VOL-456 after deployment to generate the postmortem and resolve the Sentry issue

FAQ

What does the initial /triage audit include?

It queries unresolved Sentry issues, recent Vercel logs, health endpoint responses, and recent failed GitHub workflow runs, then ranks findings with recommendations.

When should I create a postmortem?

Create a postmortem after the fix is merged and deployed, once you verify the error is no longer occurring in Sentry; the skill generates a document and resolves the issue.