home / skills / yousufjoyian / claude-skills / root-cause-tracing

root-cause-tracing skill

safe

/debugging/root-cause-tracing

This skill traces bugs backward through the call stack to locate the original trigger and fix at the source.

This is most likely a fork of the root-cause-tracing skill from mamba-mental

npx playbooks add skill yousufjoyian/claude-skills --skill root-cause-tracing

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

5.5 KB

---
name: root-cause-tracing
description: Systematically trace bugs backward through call stack to find original trigger
---

# Root Cause Tracing

## Overview

Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.

**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.

## When to Use

```dot
digraph when_to_use {
    "Bug appears deep in stack?" [shape=diamond];
    "Can trace backwards?" [shape=diamond];
    "Fix at symptom point" [shape=box];
    "Trace to original trigger" [shape=box];
    "BETTER: Also add defense-in-depth" [shape=box];

    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
}
```

**Use when:**
- Error happens deep in execution (not at entry point)
- Stack trace shows long call chain
- Unclear where invalid data originated
- Need to find which test/code triggers the problem

## The Tracing Process

### 1. Observe the Symptom
```
Error: git init failed in /Users/jesse/project/packages/core
```

### 2. Find Immediate Cause
**What code directly causes this?**
```typescript
await execFileAsync('git', ['init'], { cwd: projectDir });
```

### 3. Ask: What Called This?
```typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
  → called by Session.initializeWorkspace()
  → called by Session.create()
  → called by test at Project.create()
```

### 4. Keep Tracing Up
**What value was passed?**
- `projectDir = ''` (empty string!)
- Empty string as `cwd` resolves to `process.cwd()`
- That's the source code directory!

### 5. Find Original Trigger
**Where did empty string come from?**
```typescript
const context = setupCoreTest(); // Returns { tempDir: '' }
Project.create('name', context.tempDir); // Accessed before beforeEach!
```

## Adding Stack Traces

When you can't trace manually, add instrumentation:

```typescript
// Before the problematic operation
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error('DEBUG git init:', {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync('git', ['init'], { cwd: directory });
}
```

**Critical:** Use `console.error()` in tests (not logger - may not show)

**Run and capture:**
```bash
npm test 2>&1 | grep 'DEBUG git init'
```

**Analyze stack traces:**
- Look for test file names
- Find the line number triggering the call
- Identify the pattern (same test? same parameter?)

## Finding Which Test Causes Pollution

If something appears during tests but you don't know which test:

Use the bisection script: @find-polluter.sh

```bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
```

Runs tests one-by-one, stops at first polluter. See script for usage.

## Real Example: Empty projectDir

**Symptom:** `.git` created in `packages/core/` (source code)

**Trace chain:**
1. `git init` runs in `process.cwd()` ← empty cwd parameter
2. WorktreeManager called with empty projectDir
3. Session.create() passed empty string
4. Test accessed `context.tempDir` before beforeEach
5. setupCoreTest() returns `{ tempDir: '' }` initially

**Root cause:** Top-level variable initialization accessing empty value

**Fix:** Made tempDir a getter that throws if accessed before beforeEach

**Also added defense-in-depth:**
- Layer 1: Project.create() validates directory
- Layer 2: WorkspaceManager validates not empty
- Layer 3: NODE_ENV guard refuses git init outside tmpdir
- Layer 4: Stack trace logging before git init

## Key Principle

```dot
digraph principle {
    "Found immediate cause" [shape=ellipse];
    "Can trace one level up?" [shape=diamond];
    "Trace backwards" [shape=box];
    "Is this the source?" [shape=diamond];
    "Fix at source" [shape=box];
    "Add validation at each layer" [shape=box];
    "Bug impossible" [shape=doublecircle];
    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];

    "Found immediate cause" -> "Can trace one level up?";
    "Can trace one level up?" -> "Trace backwards" [label="yes"];
    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
    "Trace backwards" -> "Is this the source?";
    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
    "Is this the source?" -> "Fix at source" [label="yes"];
    "Fix at source" -> "Add validation at each layer";
    "Add validation at each layer" -> "Bug impossible";
}
```

**NEVER fix just where the error appears.** Trace back to find the original trigger.

## Stack Trace Tips

**In tests:** Use `console.error()` not logger - logger may be suppressed
**Before operation:** Log before the dangerous operation, not after it fails
**Include context:** Directory, cwd, environment variables, timestamps
**Capture stack:** `new Error().stack` shows complete call chain

## Real-World Impact

From debugging session (2025-10-03):
- Found root cause through 5-level trace
- Fixed at source (getter validation)
- Added 4 layers of defense
- 1847 tests passed, zero pollution

Overview

This skill systematically traces bugs backward through the call stack to find the original trigger and fix issues at their source. It combines call-chain analysis, lightweight instrumentation, and test bisection to identify where invalid values or actions originate. The goal is to stop fixing symptoms and instead eliminate the root cause and add layered defenses.

How this skill works

Start at the symptom and identify the immediate failing operation. Walk up the call chain—examining callers, passed values, and initialization timing—until you find the original trigger. When manual tracing is unclear, add simple runtime logging that captures parameters and a stack trace just before the dangerous operation. Use targeted test bisection to find which test or setup step pollutes global state or provides invalid inputs.

When to use it

A runtime error appears deep in the stack rather than at an entry point
Stack trace shows a long call chain and unclear origin of invalid data
A test creates side effects or files in unexpected locations
You need to identify which test or setup step causes environment pollution
You want to harden code by fixing triggers rather than symptoms

Best practices

Log context immediately before the risky operation: include parameters, process.cwd(), NODE_ENV, timestamps, and new Error().stack
Use console.error in tests to ensure output appears in CI logs
Trace one level up at a time: ask who called this and what value was passed
Prefer fixes at the source and add validation in each layer (guard clauses and explicit checks)
When tests misbehave, run a bisection that executes tests one-by-one to locate the polluter

Example use cases

A git init ends up running in the repository root because an empty cwd resolved to process.cwd()
A test accesses a tempDir before beforeEach initialization, causing files to be created in source directories
A database is opened with the wrong path due to a top-level variable initialized too early
Introduce a getter that throws if accessed before setup to prevent premature initialization
Add NODE_ENV checks and pre-operation validation to refuse dangerous actions outside a safe temp directory

FAQ

What if the stack trace doesn't show my test file names?

Add logging that prints new Error().stack immediately before the operation; run tests with redirected stderr (e.g. npm test 2>&1 | grep 'DEBUG') to capture the debug lines.

How do I avoid fixing only the symptom?

Always ask what called the failing code and inspect the passed values; if a value originated from setup or a top-level initializer, fix that initialization and add validation at each layer to prevent recurrence.