home / skills / phrazzld / claude-config / triage

triage skill

safe

This skill performs multi-source production triage, auditing Sentry, Vercel logs, health endpoints, and CI/CD to guide rapid investigation and fixes.

npx playbooks add skill phrazzld/claude-config --skill triage

Review the files below or copy the command above to add this skill to your agents.

Files (10)

SKILL.md

7.5 KB

---
name: triage
description: |
  Multi-source observability triage. Checks Sentry, Vercel logs, health endpoints, GitHub CI/CD.
  Drives: investigate -> fix -> PR -> postmortem workflow.
  Invoke for: production issues, error spikes, CI failures, user reports, incident response.
argument-hint: "[action: status | investigate ISSUE-ID | investigate-ci RUN-ID | fix | postmortem ISSUE-ID]"
effort: max
---

# /triage

Fix production issues. Run audit, investigate, fix, postmortem.

**This is a fixer.** It uses `/check-production` as its primitive. Use `/log-production-issues` to create issues instead of fixing.

## Usage

```bash
/triage                        # Audit and fix highest priority (default)
/triage investigate VOL-456    # Deep dive on specific Sentry issue
/triage investigate-ci 12345   # Deep dive on specific CI run failure
/triage fix                    # Create PR for current fix
/triage postmortem VOL-456     # Generate postmortem after merge
```

## Stage 1: Production Audit

**Command:** `/triage` or `/triage status`

Invoke `/check-production` primitive for parallel checks:
1. **Sentry** - Unresolved issues via triage scripts
2. **Vercel logs** - Recent errors in stream
3. **Health endpoints** - `/api/health` response
4. **GitHub CI/CD** - Failed workflow runs

**Output format:**
```
TRIAGE STATUS - 2026-01-23 15:30
================================

SENTRY (volume-fitness)
  [P0] 3 unresolved issues
  Top: VOL-456 "PaymentIntent failed" (Score: 147, 23 users)

GITHUB CI/CD
  [P1] Main branch failing: "CI" workflow (run #1234)
       Failed: Type check - 2h ago
  [P2] 2 feature branches blocked

VERCEL LOGS
  [OK] No errors in last 10 minutes

HEALTH ENDPOINTS
  [OK] volume.fitness/api/health (200, 45ms)

RECOMMENDATION:
  1. Investigate VOL-456 immediately - 23 users affected
     Run: /triage investigate VOL-456
  2. Fix main branch CI - blocking all deploys
     Run: /triage investigate-ci 1234
```

If all clean: "All systems nominal. No action required."

## Stage 2: Investigate

### Delegation Pattern

For complex issues, delegate investigation to agentic tools (see `/delegate`):
- **Codex** — Code archaeology, stack trace analysis, debugging
- **Gemini** — Research current patterns, check for known issues
- **Thinktank** — Validate proposed fix before implementing

### Sentry Issues

**Command:** `/triage investigate ISSUE-ID`

Actions:
1. Fetch full issue context from Sentry
2. Create branch: `fix/ISSUE-ID-description`
3. Load affected files from stack trace
4. Check git history for related changes
5. Form root cause hypothesis (delegate to Codex for complex traces)

**Output:** Investigation summary with hypothesis and next steps.

### CI/CD Failures

**Command:** `/triage investigate-ci RUN-ID`

Actions:
1. Fetch failed workflow run details
   ```bash
   gh run view RUN-ID --log-failed
   ```
2. Identify failed step and error message
3. Create branch: `fix/ci-[workflow-name]-[date]`
4. Load affected files based on error
5. Check recent commits that may have caused regression

**Common CI failure patterns:**

| Failure Type | Typical Cause | Fix Approach |
|--------------|---------------|--------------|
| Type check | New code with type errors | Fix types locally, push |
| Lint | Style violations | Run `pnpm lint --fix` |
| Test | Broken/flaky tests | Run tests locally, fix or skip flaky |
| Build | Missing deps, config issues | Check package.json, build config |
| Deploy | Env vars, permissions | Check Vercel/platform settings |

**Output:** CI investigation summary with specific error and fix approach.

## Stage 3: Fix

**Command:** `/triage fix`

Prerequisites: On `fix/` branch with changes.

Actions:
1. Run tests to verify fix
2. Create PR with standard format
3. Link Sentry issue in PR description

**PR format:**
```markdown
## Summary
[Fix description]

## Sentry Issue
- ID: ISSUE-ID
- Users affected: N
- First seen: DATE

## Test Plan
- [ ] Test case 1
- [ ] Test case 2
```

## Stage 4: Postmortem

**Command:** `/triage postmortem ISSUE-ID`

Prerequisites: Fix deployed (PR merged).

Actions:
1. Verify no new errors in Sentry
2. Generate postmortem document from template
3. Resolve Sentry issue
4. Create `docs/postmortems/YYYY-MM-DD-ISSUE-ID.md`

## Scripts

### Via Sentry MCP (Preferred)

When Sentry MCP is configured, use direct queries:
- "Show me unresolved errors in production"
- "What's the triage score for issue VOL-456?"
- "Get full context for the top error"

### Via CLI Scripts

```bash
# Multi-source orchestrator
~/.claude/skills/triage/scripts/check_all_sources.sh

# Individual checks
~/.claude/skills/triage/scripts/check_sentry.sh
~/.claude/skills/triage/scripts/check_vercel_logs.sh
~/.claude/skills/triage/scripts/check_health_endpoints.sh

# Sentry CLI directly
sentry-cli issues list --project=$SENTRY_PROJECT --status=unresolved
sentry-cli issues describe ISSUE-ID

# Postmortem generator
~/.claude/skills/triage/scripts/generate_postmortem.sh ISSUE-ID
```

### Via GitHub CLI

```bash
# List failed runs on main branch
gh run list --branch main --status failure --limit 10

# List all recent failures
gh run list --status failure --limit 10

# View failed run details
gh run view RUN-ID

# View only failed step logs
gh run view RUN-ID --log-failed

# Re-run failed jobs (after fix pushed)
gh run rerun RUN-ID --failed

# Watch a run in progress
gh run watch RUN-ID
```

## Workflow

```
/triage
   |
   v
[Issues found?]
   |
   +-- Sentry issue --> /triage investigate ISSUE-ID
   |                       |
   |                       v
   |                    [Fix locally]
   |                       |
   |                       v
   |                    /triage fix (creates PR)
   |                       |
   |                       v
   |                    [PR merged & deployed]
   |                       |
   |                       v
   |                    /triage postmortem ISSUE-ID
   |
   +-- CI failure --> /triage investigate-ci RUN-ID
   |                     |
   |                     v
   |                  [Fix locally, push]
   |                     |
   |                     v
   |                  [CI re-runs automatically]
   |                     |
   |                     v
   |                  [Verify CI green]
   |
   +-- No issues --> "All systems nominal"
```

## Environment Variables

```bash
# Required for Sentry
SENTRY_AUTH_TOKEN   # or SENTRY_MASTER_TOKEN
SENTRY_ORG          # Organization slug

# Auto-detected per project
SENTRY_PROJECT      # From .sentryclirc or .env.local

# Optional for Vercel
VERCEL_TOKEN        # For `vercel logs` access
```

## MCP Configuration (Recommended)

For AI-assisted triage, configure Sentry MCP:

```json
{
  "mcpServers": {
    "sentry": {
      "url": "https://mcp.sentry.dev/mcp",
      "transport": "http"
    }
  }
}
```

Or local with token:
```json
{
  "mcpServers": {
    "sentry": {
      "command": "npx",
      "args": ["-y", "@sentry/mcp-server"],
      "env": {
        "SENTRY_AUTH_TOKEN": "your-token",
        "SENTRY_ORG": "your-org"
      }
    }
  }
}
```

## Reuses

- `~/.claude/skills/sentry-observability/scripts/triage_score.sh`
- `~/.claude/skills/sentry-observability/scripts/issue_detail.sh`
- `~/.claude/skills/sentry-observability/scripts/resolve_issue.sh`

## Related

- `/check-production` - The primitive (audit only)
- `/log-production-issues` - Create GitHub issues from findings
- `/observability` - Full observability setup
- `/sentry-observability` - Sentry-specific operations
- `/verify-fix` - Verification checklist
- `/delegate` - Multi-AI orchestration pattern

Overview

This skill performs multi-source observability triage for production incidents, combining Sentry, Vercel logs, health endpoints, and GitHub CI/CD checks. It drives an investigate -> fix -> PR -> postmortem workflow so teams can move from detection to resolution and learning. Use it to prioritize issues, run targeted investigations, create fixes, and generate postmortems after deployment.

How this skill works

The skill runs a parallel production audit that queries Sentry for unresolved issues, tails Vercel logs for recent errors, checks health endpoints, and lists failed GitHub workflow runs. For a selected finding it creates a working branch, loads stack traces and related files, inspects recent commits, and produces an investigation summary with a root-cause hypothesis. When a fix is prepared it runs tests, opens a standardized PR, and after merge generates a postmortem and resolves the Sentry issue.

When to use it

Routine production audit at shift start or on-call handoff
Immediate response to error spikes or user reports
When GitHub CI/CD failures block deploys
Before creating a PR for a bug fix tied to an error
After deployment to validate fix and run postmortem

Best practices

Run the audit frequently and act on high-triage-score Sentry issues first
Create descriptive branches like fix/ISSUE-ID-short-desc before making changes
Delegate complex stack-trace analysis to specialized agents, but keep hypothesis and steps explicit
Run tests and CI locally before opening the PR; include Sentry link in PR description
Generate a concise postmortem after merge and store it under docs/postmortems with root cause and action items

Example use cases

/triage to produce a prioritized snapshot showing top Sentry issue and failing CI
/triage investigate VOL-456 to load full Sentry context, form hypothesis, and create a fix branch
/triage investigate-ci 12345 to fetch failed GitHub run logs, identify the broken step, and prepare a targeted fix
/triage fix to run tests and open a PR with the required Sentry metadata and test plan
/triage postmortem VOL-456 to verify no regressions, generate the postmortem file, and resolve the issue

FAQ

What permissions are required?

Tokens for Sentry (SENTRY_AUTH_TOKEN), optional Vercel token for logs, and GitHub CLI permissions to view and create runs/PRs are required.

How does it decide priority?

Priority is driven by triage score from Sentry, user impact counts, and CI failure severity; the audit recommends actions based on those signals.