home / skills / bobmatnyc / claude-mpm-skills / root-cause-tracing

root-cause-tracing skill

safe

This skill helps you trace root causes backward through the call stack to fix issues at their source rather than symptoms.

This is most likely a fork of the root-cause-tracing skill from mamba-mental

npx playbooks add skill bobmatnyc/claude-mpm-skills --skill root-cause-tracing

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

5.7 KB

---
name: Root Cause Tracing
description: Systematically trace bugs backward through call stack to find original trigger
when_to_use: when errors occur deep in execution and you need to trace back to find the original trigger
version: 2.0.0
languages: all
progressive_disclosure:
  entry_point:
    summary: "Trace bugs backward through call chains to find original triggers instead of fixing symptoms"
    when_to_use: "When errors manifest deep in execution, unclear data origins, or long call chains. Use AFTER systematic-debugging Phase 1."
    quick_start: "1. Observe symptom 2. Find immediate cause 3. Ask what called this 4. Keep tracing up 5. Fix at source + add defense"
  references:
    - tracing-techniques.md
    - examples.md
    - advanced-techniques.md
    - integration.md
context_limit: 800
tags:
  - debugging
  - root-cause
  - tracing
  - call-stack
---

# Root Cause Tracing

## Overview

Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.

**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.

This skill is a specialized technique within the systematic-debugging workflow, typically applied during Phase 1 (Root Cause Investigation) when dealing with deep call stacks.

## When to Use This Skill

**Use root-cause-tracing when:**
- Error happens deep in execution (not at entry point)
- Stack trace shows long call chain
- Unclear where invalid data originated
- Need to find which test/code triggers the problem
- Symptom appears far from actual cause

**Relationship with systematic-debugging:**
- systematic-debugging: The overall framework (Phases 1-4)
- root-cause-tracing: A specific technique for Phase 1 investigation
- Use root-cause-tracing WITHIN systematic-debugging Phase 1

## The Iron Law

```
NEVER FIX JUST WHERE THE ERROR APPEARS
ALWAYS TRACE BACK TO FIND THE ORIGINAL TRIGGER
```

Fixing symptoms creates bandaid solutions that mask root problems.

## Core Principles

1. **Trace Backward**: Follow call chain from symptom to source
2. **Find Original Trigger**: Identify where bad data/state originated
3. **Fix at Source**: Address root cause, not symptom
4. **Defense-in-Depth**: Add validation at each layer after fixing source

## Quick Start

### The 5-Step Trace Process

1. **Observe the Symptom**: What error message? What failed operation?
2. **Find Immediate Cause**: What code directly causes this error?
3. **Ask What Called This**: Trace one level up the call stack
4. **Keep Tracing Up**: Continue until you find the original trigger
5. **Fix at Source + Defense**: Fix root cause and add layer validation

### Decision Tree

```
Error appears deep in stack?
  → Yes: Start tracing backward
    → Can identify caller? → Trace one level up → Repeat
    → Cannot identify caller? → Add instrumentation (see advanced-techniques.md)
  → No: May not need tracing (error at entry point)
```

## The Tracing Process

**Example: Git init in wrong directory**
```
Error symptom → execFileAsync('git', ['init'], { cwd: '' })
  ← WorktreeManager.createSessionWorktree(projectDir='')
  ← Session.create() → Project.create() → Test code
  ← ROOT CAUSE: setupCoreTest() returns { tempDir: '' } before beforeEach
```

**At each level ask:** Where did this value come from? Is this the origin?

**For detailed tracing methodology, see [Tracing Techniques](references/tracing-techniques.md)**
**For complete real-world examples, see [Examples](references/examples.md)**

## After Finding Root Cause

**Fix at source** (throw if accessed before initialization) + **Add defense-in-depth** (validate at Project.create, WorkspaceManager, environment guards, instrumentation).

This prevents similar bugs and catches issues earlier.

## Navigation

For detailed information:
- **[Tracing Techniques](references/tracing-techniques.md)**: Complete tracing methodology, patterns, and decision trees
- **[Examples](references/examples.md)**: Real-world debugging scenarios with full trace chains
- **[Advanced Techniques](references/advanced-techniques.md)**: Stack traces, instrumentation, test pollution detection
- **[Integration](references/integration.md)**: How to use with systematic-debugging and other skills

## Key Reminders

- NEVER fix just where the error appears
- ALWAYS trace back to find the original trigger
- Use `console.error()` for debugging in tests (logger may be suppressed)
- Log BEFORE the dangerous operation, not after it fails
- Include context: directory, cwd, environment, timestamps
- Add defense-in-depth after fixing source
- Document your trace as you go (write down the call chain)

## Red Flags - STOP

STOP when thinking:
- "I'll just add validation here" (without finding source)
- "This will prevent the error" (symptom fix)
- "Too hard to trace back" (add instrumentation instead)
- "Quick fix for now" (creates technical debt)

**ALL of these mean: Continue tracing to find root cause.**

## Integration with Other Skills

- **systematic-debugging**: Use root-cause-tracing during Phase 1
- **defense-in-depth**: Add after finding root cause
- **verification-before-completion**: Verify fix worked at source
- **test-driven-development**: Write test for root cause, not symptom

See [Integration](references/integration.md) for complete workflow examples.

## Real-World Impact

From debugging session (2025-10-03):
- Found root cause through 5-level trace
- Fixed at source (getter validation)
- Added 4 layers of defense
- 1847 tests passed, zero pollution
- Time saved: 3+ hours vs symptom-fix approach

**Bottom line:** Tracing takes 15-30 minutes. Symptom fixes take hours of whack-a-mole.

Overview

This skill teaches a disciplined method to trace bugs backward through the call stack until you locate the original trigger. It focuses on fixing the source of invalid data or state rather than treating symptoms. The result is fewer regressions, clearer fixes, and more reliable systems.

How this skill works

Start at the symptom and walk the call chain upward, inspecting who produced the offending value or state at each level. When the caller cannot explain the value, add lightweight instrumentation or logs and continue tracing until the origin is found. Fix the root issue and then add validation at surrounding layers for defense-in-depth.

When to use it

An error manifests deep in the stack rather than at the entry point
Stack traces show long call chains and unclear origin of bad data
Tests or CI failures point to a downstream symptom with unknown trigger
You want to avoid band-aid fixes that mask recurring problems
Suspected test pollution or initialization order bugs
Investigating where an incorrect path, config, or object first appears

Best practices

Always trace one level up from the immediate cause until you find the origin
Log context before the operation (cwd, paths, timestamps) to capture origin data
Fix the root cause first, then add validation checks at adjacent layers
Instrument callers when the next-level caller is unclear or external
Document the call chain as you trace so fixes are auditable and reproducible

Example use cases

A test runs git init in the wrong folder — trace from the exec call up to the test setup that provided the empty cwd
A database opens with an incorrect path — find which factory or environment variable produced that path
Intermittent CI failure — follow the stack to locate a shared mutable fixture initialized in the wrong order
API returns malformed data — trace from serializer error back to the data producer
New feature causes downstream crashes — identify the new code path that injected invalid state

FAQ

How long should tracing take?

Most traces take 15–30 minutes; add brief instrumentation if a caller can’t be identified immediately.

When is adding validation enough?

Add validation only after fixing the source. Validation is a safety net, not a substitute for correcting the origin.