home / skills / mamba-mental / agent-skill-manager / root-cause-tracing

root-cause-tracing skill

/skills/debugging/root-cause-tracing

This skill traces bugs backward through the call stack to identify the original trigger and fix at the source.

npx playbooks add skill mamba-mental/agent-skill-manager --skill root-cause-tracing

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
5.4 KB
---
name: Root Cause Tracing
description: Systematically trace bugs backward through call stack to find original trigger
when_to_use: when errors occur deep in execution and you need to trace back to find the original trigger
version: 1.1.0
languages: all
---

# Root Cause Tracing

## Overview

Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.

**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.

## When to Use

```dot
digraph when_to_use {
    "Bug appears deep in stack?" [shape=diamond];
    "Can trace backwards?" [shape=diamond];
    "Fix at symptom point" [shape=box];
    "Trace to original trigger" [shape=box];
    "BETTER: Also add defense-in-depth" [shape=box];

    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
}
```

**Use when:**
- Error happens deep in execution (not at entry point)
- Stack trace shows long call chain
- Unclear where invalid data originated
- Need to find which test/code triggers the problem

## The Tracing Process

### 1. Observe the Symptom
```
Error: git init failed in /Users/jesse/project/packages/core
```

### 2. Find Immediate Cause
**What code directly causes this?**
```typescript
await execFileAsync('git', ['init'], { cwd: projectDir });
```

### 3. Ask: What Called This?
```typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
  → called by Session.initializeWorkspace()
  → called by Session.create()
  → called by test at Project.create()
```

### 4. Keep Tracing Up
**What value was passed?**
- `projectDir = ''` (empty string!)
- Empty string as `cwd` resolves to `process.cwd()`
- That's the source code directory!

### 5. Find Original Trigger
**Where did empty string come from?**
```typescript
const context = setupCoreTest(); // Returns { tempDir: '' }
Project.create('name', context.tempDir); // Accessed before beforeEach!
```

## Adding Stack Traces

When you can't trace manually, add instrumentation:

```typescript
// Before the problematic operation
async function gitInit(directory: string) {
  const stack = new Error().stack;
  console.error('DEBUG git init:', {
    directory,
    cwd: process.cwd(),
    nodeEnv: process.env.NODE_ENV,
    stack,
  });

  await execFileAsync('git', ['init'], { cwd: directory });
}
```

**Critical:** Use `console.error()` in tests (not logger - may not show)

**Run and capture:**
```bash
npm test 2>&1 | grep 'DEBUG git init'
```

**Analyze stack traces:**
- Look for test file names
- Find the line number triggering the call
- Identify the pattern (same test? same parameter?)

## Finding Which Test Causes Pollution

If something appears during tests but you don't know which test:

Use the bisection script: @find-polluter.sh

```bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
```

Runs tests one-by-one, stops at first polluter. See script for usage.

## Real Example: Empty projectDir

**Symptom:** `.git` created in `packages/core/` (source code)

**Trace chain:**
1. `git init` runs in `process.cwd()` ← empty cwd parameter
2. WorktreeManager called with empty projectDir
3. Session.create() passed empty string
4. Test accessed `context.tempDir` before beforeEach
5. setupCoreTest() returns `{ tempDir: '' }` initially

**Root cause:** Top-level variable initialization accessing empty value

**Fix:** Made tempDir a getter that throws if accessed before beforeEach

**Also added defense-in-depth:**
- Layer 1: Project.create() validates directory
- Layer 2: WorkspaceManager validates not empty
- Layer 3: NODE_ENV guard refuses git init outside tmpdir
- Layer 4: Stack trace logging before git init

## Key Principle

```dot
digraph principle {
    "Found immediate cause" [shape=ellipse];
    "Can trace one level up?" [shape=diamond];
    "Trace backwards" [shape=box];
    "Is this the source?" [shape=diamond];
    "Fix at source" [shape=box];
    "Add validation at each layer" [shape=box];
    "Bug impossible" [shape=doublecircle];
    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];

    "Found immediate cause" -> "Can trace one level up?";
    "Can trace one level up?" -> "Trace backwards" [label="yes"];
    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
    "Trace backwards" -> "Is this the source?";
    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
    "Is this the source?" -> "Fix at source" [label="yes"];
    "Fix at source" -> "Add validation at each layer";
    "Add validation at each layer" -> "Bug impossible";
}
```

**NEVER fix just where the error appears.** Trace back to find the original trigger.

## Stack Trace Tips

**In tests:** Use `console.error()` not logger - logger may be suppressed
**Before operation:** Log before the dangerous operation, not after it fails
**Include context:** Directory, cwd, environment variables, timestamps
**Capture stack:** `new Error().stack` shows complete call chain

## Real-World Impact

From debugging session (2025-10-03):
- Found root cause through 5-level trace
- Fixed at source (getter validation)
- Added 4 layers of defense
- 1847 tests passed, zero pollution

Overview

This skill systematically traces bugs backward through the call stack to find the original trigger rather than treating symptoms. It guides you to identify the immediate failing operation, walk up callers, and discover the root cause so you can fix the source. The skill also recommends defensive layers and lightweight instrumentation to prevent regressions.

How this skill works

It inspects failure points (errors, unexpected filesystem or DB state) and asks "what called this?" to step up the call chain until the originating input or initialization is found. When manual tracing stalls, it adds targeted instrumentation (stack capture, pre-operation logs) and test bisection to pinpoint the polluter. Finally, it prescribes fixes at the origin plus validation layers to prevent recurrence.

When to use it

  • An error appears deep inside a long call stack
  • Stack trace shows many frames or points to unexpected cwd/file paths
  • You see side effects in the repository or filesystem during tests
  • You don’t know which test or initialization produced invalid input
  • You want to harden code against future regressions

Best practices

  • Start from the symptom and identify the immediate failing call before guessing fixes
  • Ask who/what passed the suspicious value and trace one level up repeatedly
  • Add pre-operation debug logs that include directory, cwd, env, and new Error().stack
  • Use console.error in tests so output is visible; capture logs with redirected stderr
  • Add validation at multiple layers instead of only fixing the symptom

Example use cases

  • A test creates .git in the source tree—trace backward to find an empty cwd parameter and the test initialization bug
  • Database opened with the wrong path—log call stacks to find where the path was built or passed
  • Intermittent CI pollution—run a bisection script to find the first test that introduces the side effect
  • Preventative hardening—add guards that refuse file operations outside temp directories in non-production envs
  • Add stack-capture wrapper around risky operations (git init, file writes) to speed future debugging

FAQ

What if the call chain hits native or external code and stops?

Instrument the last controllable boundary by logging inputs and the full Error().stack before the call, then trace callers from that captured stack. If needed, add assertions validating inputs at that boundary.

How do I find which test pollutes the repo?

Run tests one-by-one or use a bisection script that stops at the first polluter. Combine that with pre-operation debug logs to see the exact call and parameter causing the pollution.