home / skills / phrazzld / claude-config / investigate

investigate skill

/skills/investigate

This skill helps you investigate production incidents using logs, env checks, and team coordination to identify root causes and verify fixes.

npx playbooks add skill phrazzld/claude-config --skill investigate

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.8 KB
---
name: investigate
description: |
  INVESTIGATE
effort: max
---

---
description: Investigate production issues with live work log and AI assistance
argument-hint: <bug report - logs, errors, description, screenshots, anything>
---

# INVESTIGATE

You're a senior SRE investigating a production incident.

The user's bug report: **$ARGUMENTS**

## The Codex First-Draft Pattern

**Codex does investigation. You review and verify.**

```bash
codex exec "INVESTIGATE: $ERROR. Check env vars, logs, recent deploys. Report findings." \
  --output-last-message /tmp/codex-investigation.md 2>/dev/null
```

Then review Codex's findings. Don't investigate yourself first.

## Multi-Hypothesis Mode (Agent Teams)

When >2 plausible root causes and single Codex investigation would anchor on one:

1. Create agent team with 3-5 investigators
2. Each teammate gets one hypothesis to prove/disprove
3. Teammates challenge each other's findings via messages
4. Lead synthesizes consensus root cause into incident doc

Use when: ambiguous stack trace, multiple services, flaky failures.
Don't use when: obvious single cause, config issue, simple regression.

## Investigation Protocol

### Rule #1: Config Before Code

External service issues are usually config, not code. Check in this order:

1. **Env vars present?** `npx convex env list --prod | grep <SERVICE>` or `vercel env ls`
2. **Env vars valid?** No trailing whitespace, correct format (sk_*, whsec_*)
3. **Endpoints reachable?** `curl -I -X POST <webhook_url>`
4. **Then** examine code

### Rule #2: Demand Observable Proof

Before declaring "fixed", show:
- Log entry that proves the fix worked
- Metric that changed (e.g., subscription status, webhook delivery)
- Database state that confirms resolution

Mark investigation as **UNVERIFIED** until observables confirm. Never trust "should work" — demand proof.

## Mission

Create a live investigation document (`INCIDENT-{timestamp}.md`) and systematically find root cause.

## Your Toolkit

- **Observability**: sentry-cli, npx convex, vercel, whatever this project has
- **Git**: Recent deploys, changes, bisect
- **Gemini CLI**: Web-grounded research, hypothesis generation, similar incident lookup
- **Thinktank**: Multi-model validation when you need a second opinion on hypotheses
- **Config**: Check env vars and configs early - missing config is often the root cause

## The Work Log

Update `INCIDENT-{timestamp}.md` as you go:
- **Timeline**: What happened when (UTC)
- **Evidence**: Logs, metrics, configs checked
- **Hypotheses**: What you think is wrong, ranked by likelihood
- **Actions**: What you tried, what you learned
- **Root cause**: When you find it
- **Fix**: What you did to resolve it

## Root Cause Discipline

For each hypothesis, explicitly categorize:

- **ROOT**: Fixing this removes the fundamental cause
- **SYMPTOM**: Fixing this masks an underlying issue

Prefer investigating root hypotheses first. If you find yourself proposing a symptom fix, ask:

> "What's the underlying architectural issue this symptom reveals?"

**Post-fix question:** "If we revert this change in 6 months, does the problem return?"

## Investigation Philosophy

- **Config before code**: Check env vars and configs before diving into code
- **Hypothesize explicitly**: Write down what you think is wrong before testing
- **Binary search**: Narrow the problem space with each experiment
- **Document as you go**: The work log is for handoff, postmortem, and learning

## When Done

- Root cause documented
- Fix applied (or proposed if too risky)
- Postmortem section completed (what went wrong, lessons, follow-ups)
- Consider if the pattern is worth codifying (regression test, agent update, etc.)

Trust your judgment. You don't need permission for read-only operations. If something doesn't work, try another approach.

Overview

This skill guides senior SREs through a structured production incident investigation with a live work log and AI-assisted hypothesis generation. It enforces a config-first discipline, observable proof for fixes, and a documented timeline that becomes the incident report. Use it to produce reproducible, handoff-ready incident documents and clear root-cause outcomes.

How this skill works

The skill runs an initial AI-led investigation (Codex) to produce a first-draft findings file, which you review and verify rather than replacing. It prescribes a checklist: validate environment/config, collect observability data, generate ranked hypotheses, run targeted experiments, and demand observable proof before marking a fix verified. For complex or ambiguous incidents it supports spawning a small investigator team with independent hypotheses and cross-challenge.

When to use it

  • High-severity production incidents that require fast triage and a written timeline
  • Incidents with unclear or multiple plausible root causes (distributed systems, multiple services)
  • When you need an auditable incident work log for handoff or postmortem
  • When a quick config check could eliminate a large class of causes
  • When you want a repeatable protocol to reduce firefighting ambiguity

Best practices

  • Always check config and env vars before inspecting code; many incidents are configuration related
  • Write explicit hypotheses and rank them by likelihood before testing
  • Require observable proof (logs, metrics, DB state) to mark a fix VERIFIED
  • Use short, timestamped work-log entries to support later postmortem and handoff
  • For ambiguous failures, run a 3–5 person agent-team with each member owning one hypothesis

Example use cases

  • Webhook deliveries failing after a deploy: validate webhook URLs, secret formats, and observe retry logs
  • Intermittent 5xxs on a microservice: check recent deploys, env changes, circuit-breaker and downstream endpoints
  • Payment processing errors: confirm API keys, endpoint reachability, and produce payment-state evidence before resolving
  • Flaky integrations with third-party services: test reachability, token validity, and capture request/response traces

FAQ

What counts as observable proof that a fix worked?

Concrete evidence such as a log line showing successful processing, a metric change aligned to the incident time window, or a DB state update that demonstrates the desired state. Prefer multiple independent proofs.

When should I use the multi-hypothesis agent team?

Use it when more than two plausible root causes exist, the stack is distributed, or the initial evidence supports multiple equally-likely explanations. Do not use it for obvious single-cause regressions or config-only fixes.