home / skills / phrazzld / claude-config / check-production

check-production skill

safe

This skill audits production health by inspecting Sentry, Vercel logs, health endpoints, and CI/CD status to surface prioritized issues.

npx playbooks add skill phrazzld/claude-config --skill check-production

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

4.6 KB

---
name: check-production
description: |
  Check production health: Sentry errors, Vercel logs, health endpoints, GitHub CI/CD.
  Outputs structured findings. Use log-production-issues to create issues.
  Invoke for: production diagnostics, error audit, health status, CI failures.
effort: high
---

# /check-production

Audit production health. Output findings as structured report.

## What This Does

1. Query Sentry for unresolved issues
2. Check Vercel logs for recent errors
3. Test health endpoints
4. Check GitHub Actions for CI/CD failures
5. Output prioritized findings (P0-P3)

**This is a primitive.** It only investigates and reports. Use `/log-production-issues` to create GitHub issues or `/triage` to fix.

## Process

### 1. Sentry Check

```bash
# Run triage script if available
~/.claude/skills/triage/scripts/check_sentry.sh 2>/dev/null || echo "Sentry check unavailable"
```

Or spawn Sentry MCP query if configured.

### 2. Vercel Logs Check

```bash
# Check for recent errors
~/.claude/skills/triage/scripts/check_vercel_logs.sh 2>/dev/null || vercel logs --output json 2>/dev/null | head -50
```

### 3. Health Endpoints

```bash
# Test health endpoint
~/.claude/skills/triage/scripts/check_health_endpoints.sh 2>/dev/null || curl -sf "$(grep NEXT_PUBLIC_APP_URL .env.local 2>/dev/null | cut -d= -f2)/api/health" | jq .
```

### 4. GitHub CI/CD Check

```bash
# Check for failed workflow runs on default branch
gh run list --branch main --status failure --limit 5 2>/dev/null || \
gh run list --branch master --status failure --limit 5 2>/dev/null

# Get details on most recent failure
gh run list --status failure --limit 1 --json databaseId,name,conclusion,createdAt,headBranch 2>/dev/null

# Check for stale/stuck workflows
gh run list --status in_progress --json databaseId,name,createdAt 2>/dev/null
```

**What to look for:**
- Failed runs on main/master branch (broken CI)
- Failed runs on feature branches blocking PRs
- Stuck/in-progress runs that should have completed
- Patterns in failure types (tests, lint, build, deploy)

### 5. Quick Application Checks

```bash
# Check for error handling gaps
grep -rE "catch\s*\(\s*\)" --include="*.ts" --include="*.tsx" src/ app/ 2>/dev/null | head -5
# Empty catch blocks = silent failures
```

## Output Format

```markdown
## Production Health Check

### P0: Critical (Active Production Issues)
- [SENTRY-123] PaymentIntent failed - 23 users affected (Score: 147)
  Location: api/checkout.ts:45
  First seen: 2h ago

### P1: High (Degraded Performance / Broken CI)
- Health endpoint slow: /api/health responding in 2.3s (should be <500ms)
- Vercel logs show 5xx errors in last hour (count: 12)
- [CI] Main branch failing: "Build" workflow (run #1234)
  Failed step: "Type check"
  Error: Type 'string' is not assignable to type 'number'

### P2: Medium (Warnings)
- 3 empty catch blocks found (silent failures)
- Health endpoint missing database connectivity check
- [CI] 3 feature branch workflows failing (blocking PRs)

### P3: Low (Improvements)
- Consider adding Sentry performance monitoring
- Health endpoint could include more service checks

## Summary
- P0: 1 | P1: 3 | P2: 3 | P3: 2
- Recommendation: Fix P0 immediately, then fix main branch CI
```

## Priority Mapping

| Signal | Priority |
|--------|----------|
| Active errors affecting users | P0 |
| 5xx errors, slow responses | P1 |
| Main branch CI/CD failing | P1 |
| Feature branch CI blocking PRs | P2 |
| Silent failures, missing checks | P2 |
| Missing monitoring, improvements | P3 |

## Health Endpoint Anti-Pattern

**Health checks that lie are worse than no health check.** Example:

```typescript
// ❌ BAD: Reports "ok" without checking
return { status: "ok", services: { database: "ok" } };

// ✅ GOOD: Honest liveness probe (no fake service status)
return { status: "ok", timestamp: new Date().toISOString() };

// ✅ BETTER: Real readiness probe
const dbStatus = await checkDatabase() ? "ok" : "error";
return { status: dbStatus === "ok" ? "ok" : "degraded", services: { database: dbStatus } };
```

If you can't verify a service, don't report on it. False "ok" status masks outages.

## Analytics Note

This skill checks production health (errors, logs, endpoints), not product analytics.

For analytics auditing, see `/check-observability`. Note:
- **PostHog** is REQUIRED for product analytics (has MCP server)
- **Vercel Analytics** is NOT acceptable (no CLI/API/MCP - unusable for our workflow)

If you need to investigate user behavior or funnels during incident response, query PostHog via MCP.

## Related

- `/log-production-issues` - Create GitHub issues from findings
- `/triage` - Fix production issues
- `/observability` - Set up monitoring infrastructure

Overview

This skill audits production health across error tracking, deployment logs, health endpoints, and CI/CD. It runs targeted checks against Sentry, Vercel logs, health endpoints, and GitHub Actions and returns a prioritized, structured findings report (P0–P3). Use the report to decide immediate actions or to create issues with the companion issue-creation primitive.

How this skill works

The skill queries configured Sentry for unresolved or high-impact issues, inspects recent Vercel logs for 5xxs and runtime errors, and calls application health endpoints to validate readiness and liveness. It also checks GitHub Actions for failed or stuck workflow runs on main/master and feature branches. Findings are mapped to priority levels (P0–P3) and returned as a concise, actionable report; use the separate tooling to file issues or start triage.

When to use it

After alerts or PagerDuty triggers to audit production impact
During on-call rotations to validate system health
Before or after deploys to confirm no regressions in CI/CD
When investigating increased error rates or customer complaints
To run regular production health audits and report to stakeholders

Best practices

Prioritize P0 findings immediately and escalate to on-call owners
Verify health endpoints perform honest checks; don’t report services you can’t validate
Look for patterns across logs, Sentry issues, and CI failures to find root causes faster
Automate scheduled runs and gate follow-up by creating tracked issues
Include reproducible failure details: timestamps, error messages, stack locations, and affected users

Example use cases

Identify a P0 Sentry error affecting live users and collect stack trace and user count
Detect repeated 5xxs in Vercel logs and surface recent deploys or routes involved
Find failing main-branch GitHub Actions that block production deploys and list failing steps
Verify health endpoint returns real readiness and note missing dependency checks
Discover silent failures via empty catch blocks and recommend code remediation

FAQ

Does this skill fix issues automatically?

No. It only inspects and reports findings. Use the companion issue-creation or triage primitives to file issues or attempt fixes.

What priorities mean (P0–P3)?

P0 is active, user-impacting errors; P1 is degraded performance or broken CI; P2 is warnings and blocking PRs; P3 is low-priority improvements.