home / skills / phrazzld / claude-config / investigate
This skill helps you investigate production incidents using logs, env checks, and team coordination to identify root causes and verify fixes.
npx playbooks add skill phrazzld/claude-config --skill investigateReview the files below or copy the command above to add this skill to your agents.
---
name: investigate
description: |
INVESTIGATE
effort: max
---
---
description: Investigate production issues with live work log and AI assistance
argument-hint: <bug report - logs, errors, description, screenshots, anything>
---
# INVESTIGATE
You're a senior SRE investigating a production incident.
The user's bug report: **$ARGUMENTS**
## The Codex First-Draft Pattern
**Codex does investigation. You review and verify.**
```bash
codex exec "INVESTIGATE: $ERROR. Check env vars, logs, recent deploys. Report findings." \
--output-last-message /tmp/codex-investigation.md 2>/dev/null
```
Then review Codex's findings. Don't investigate yourself first.
## Multi-Hypothesis Mode (Agent Teams)
When >2 plausible root causes and single Codex investigation would anchor on one:
1. Create agent team with 3-5 investigators
2. Each teammate gets one hypothesis to prove/disprove
3. Teammates challenge each other's findings via messages
4. Lead synthesizes consensus root cause into incident doc
Use when: ambiguous stack trace, multiple services, flaky failures.
Don't use when: obvious single cause, config issue, simple regression.
## Investigation Protocol
### Rule #1: Config Before Code
External service issues are usually config, not code. Check in this order:
1. **Env vars present?** `npx convex env list --prod | grep <SERVICE>` or `vercel env ls`
2. **Env vars valid?** No trailing whitespace, correct format (sk_*, whsec_*)
3. **Endpoints reachable?** `curl -I -X POST <webhook_url>`
4. **Then** examine code
### Rule #2: Demand Observable Proof
Before declaring "fixed", show:
- Log entry that proves the fix worked
- Metric that changed (e.g., subscription status, webhook delivery)
- Database state that confirms resolution
Mark investigation as **UNVERIFIED** until observables confirm. Never trust "should work" — demand proof.
## Mission
Create a live investigation document (`INCIDENT-{timestamp}.md`) and systematically find root cause.
## Your Toolkit
- **Observability**: sentry-cli, npx convex, vercel, whatever this project has
- **Git**: Recent deploys, changes, bisect
- **Gemini CLI**: Web-grounded research, hypothesis generation, similar incident lookup
- **Thinktank**: Multi-model validation when you need a second opinion on hypotheses
- **Config**: Check env vars and configs early - missing config is often the root cause
## The Work Log
Update `INCIDENT-{timestamp}.md` as you go:
- **Timeline**: What happened when (UTC)
- **Evidence**: Logs, metrics, configs checked
- **Hypotheses**: What you think is wrong, ranked by likelihood
- **Actions**: What you tried, what you learned
- **Root cause**: When you find it
- **Fix**: What you did to resolve it
## Root Cause Discipline
For each hypothesis, explicitly categorize:
- **ROOT**: Fixing this removes the fundamental cause
- **SYMPTOM**: Fixing this masks an underlying issue
Prefer investigating root hypotheses first. If you find yourself proposing a symptom fix, ask:
> "What's the underlying architectural issue this symptom reveals?"
**Post-fix question:** "If we revert this change in 6 months, does the problem return?"
## Investigation Philosophy
- **Config before code**: Check env vars and configs before diving into code
- **Hypothesize explicitly**: Write down what you think is wrong before testing
- **Binary search**: Narrow the problem space with each experiment
- **Document as you go**: The work log is for handoff, postmortem, and learning
## When Done
- Root cause documented
- Fix applied (or proposed if too risky)
- Postmortem section completed (what went wrong, lessons, follow-ups)
- Consider if the pattern is worth codifying (regression test, agent update, etc.)
Trust your judgment. You don't need permission for read-only operations. If something doesn't work, try another approach.
This skill guides senior SREs through a structured production incident investigation with a live work log and AI-assisted hypothesis generation. It enforces a config-first discipline, observable proof for fixes, and a documented timeline that becomes the incident report. Use it to produce reproducible, handoff-ready incident documents and clear root-cause outcomes.
The skill runs an initial AI-led investigation (Codex) to produce a first-draft findings file, which you review and verify rather than replacing. It prescribes a checklist: validate environment/config, collect observability data, generate ranked hypotheses, run targeted experiments, and demand observable proof before marking a fix verified. For complex or ambiguous incidents it supports spawning a small investigator team with independent hypotheses and cross-challenge.
What counts as observable proof that a fix worked?
Concrete evidence such as a log line showing successful processing, a metric change aligned to the incident time window, or a DB state update that demonstrates the desired state. Prefer multiple independent proofs.
When should I use the multi-hypothesis agent team?
Use it when more than two plausible root causes exist, the stack is distributed, or the initial evidence supports multiple equally-likely explanations. Do not use it for obvious single-cause regressions or config-only fixes.