home / skills / sammcj / agentic-coding / systematic-debugging

systematic-debugging skill

/Skills/systematic-debugging

This skill applies a structured Fagan Inspection approach to diagnose stubborn bugs, guiding teams through phase-by-phase analysis and root-cause discovery.

npx playbooks add skill sammcj/agentic-coding --skill systematic-debugging

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
5.8 KB
---
name: performing-systematic-debugging-for-stubborn-problems
description: Applies a modified Fagan Inspection methodology to systematically resolve persistent bugs and complex issues. Use when multiple previous fix attempts have failed repeatedly, when dealing with intricate system interactions, or when a methodical root cause analysis is needed. Do not use for simple troubleshooting. Triggers after multiple failed debugging attempts on the same complex issue.
model: claude-opus-4-5-20251101
---

# Systematic Debugging with Fagan Inspection

This skill applies a modified Fagan Inspection methodology for systematic problem resolution when facing complex problems or stubborn bugs that have resisted multiple fix attempts.

## Process Overview

Follow these four phases sequentially. Do not skip phases or attempt fixes before completing the inspection.

### Phase 1: Initial Overview

Establish a clear understanding of the problem before analysis:

- **Explain the problem** in plain language without technical jargon
- **State expected behaviour** - what should happen
- **State actual behaviour** - what is happening instead
- **Document symptoms** - error messages, logs, observable failures
- **Context** - when does it occur, how often, under what conditions

**Output:** A clear problem statement that anyone could understand.

### Phase 2: Systematic Inspection

Perform a line-by-line walkthrough as the "Reader" role in Fagan Inspection. **Identify defects without attempting to fix them yet** - this is pure inspection.

Check against these defect categories:

1. **Logic Errors**
   - Incorrect conditional logic (wrong operators, inverted conditions)
   - Loop conditions (infinite loops, premature termination)
   - Control flow issues (unreachable code, wrong execution paths)

2. **Boundary Conditions**
   - Off-by-one errors
   - Edge cases (empty inputs, null values, maximum values)
   - Array/collection bounds

3. **Error Handling**
   - Unhandled exceptions
   - Missing validations
   - Silent failures (errors caught but not logged)
   - Incorrect error recovery

4. **Data Flow Issues**
   - Variable scope problems
   - Data transformation errors
   - Type mismatches or coercion issues
   - State management (stale data, race conditions)

5. **Integration Points**
   - API calls (incorrect endpoints, malformed requests, missing headers)
   - Database interactions (query errors, transaction handling)
   - External dependencies (version mismatches, configuration issues)
   - Timing issues (async/await problems, race conditions)

**Think aloud** during this phase. For each section of code:
- State what the code is intended to do
- Identify any discrepancies between intent and implementation
- Flag assumptions or unclear aspects
- Use ultrathink to think deeper on complex sections

**Output:** A categorised list of identified defects with line numbers and specific descriptions.

### Phase 3: Root Cause Analysis

After identifying issues, trace back to find the fundamental cause - not just symptoms.

**Five Whys Technique:**
- Ask "why" repeatedly (at least 3-5 times) to get to the underlying issue
- State each "why" explicitly in your analysis
- Example:
  - Why did the API call fail? → Because the request was malformed
  - Why was it malformed? → Because the data wasn't serialised correctly
  - Why wasn't it serialised? → Because the serialiser expected a different type
  - Why did it expect a different type? → Because the schema was updated but code wasn't
  - Root cause: Schema versioning mismatch between services

**Consider:**
- Environmental factors (configuration, dependencies, runtime environment)
- Timing and concurrency (race conditions, async issues)
- Hidden assumptions in the code or system design
- Historical context (recent changes, migrations, updates)

**State assumptions explicitly:**
- "I'm assuming X because..."
- "This presumes that Y is always..."
- Flag any assumptions that need verification

**Output:** A clear statement of the root cause, the chain of reasoning that led to it, and any assumptions that need validation.

### Phase 4: Solution & Verification

Now propose specific fixes for each identified issue.

**For each proposed solution:**
1. **Describe the fix** - what code/configuration changes are needed
2. **Explain why it resolves the root cause** - connect it back to Phase 3 analysis
3. **Consider side effects** - what else might this change affect
4. **Define verification steps** - how to confirm the fix works

**Verification Planning:**
- Specific test cases that would have caught this bug
- Manual verification steps
- Monitoring or logging to add
- Edge cases to validate

**Output:** A structured list of fixes with verification steps.

## Important Guidelines

- **Complete each phase thoroughly** before moving to the next
- **Think aloud** - verbalise your reasoning throughout
- **State assumptions explicitly** rather than making implicit ones
- **Flag unclear aspects** rather than guessing - if something is uncertain, say so
- **Use available tools** - read files, search code, run tests, check logs
- **Focus on systematic analysis** over quick fixes
- **Validate flagged aspects** - after completing all phases, revisit any unclear points and use the think tool with "ultra" depth if needed to clarify them

## Final Output

After completing all four phases, provide:

1. **Summary of findings** - key defects and root cause
2. **Proposed solutions** - prioritised list with rationale
3. **Verification plan** - how to confirm fixes work
4. **Next steps** - unless the user indicates otherwise, proceed to implement the proposed solutions

## When This Skill Should NOT Be Used

- For simple, obvious bugs with clear fixes
- When the first debugging attempt is still underway
- For new features (this is for debugging existing code)
- When the problem is clearly environmental (config, infrastructure) and doesn't require code inspection

Overview

This skill applies a modified Fagan Inspection methodology to systematically resolve persistent, hard-to-find bugs and complex system issues. It is designed to run after multiple failed fix attempts and enforces a disciplined, four-phase inspection-to-fix workflow to find root causes and produce verifiable solutions.

How this skill works

The process runs four sequential phases: (1) create a plain-language problem statement and capture symptoms, (2) perform a line-by-line inspection to catalogue defects without fixing, (3) trace defects to a root cause using the Five Whys and environmental checks, and (4) propose targeted fixes with verification steps. The skill emphasizes explicit assumptions, think-aloud reasoning, and verification planning before any code changes.

When to use it

  • After several failed attempts to fix the same complex bug
  • When issues involve intricate interactions between modules or services
  • When previous troubleshooting has been ad hoc or inconclusive
  • When a methodical root-cause analysis is required before changes
  • Not for simple or obvious bugs or initial troubleshooting

Best practices

  • Complete each inspection phase fully before moving on
  • Do not attempt fixes during inspection; document defects first
  • State assumptions explicitly and flag unclear areas for verification
  • Use concrete evidence (logs, traces, tests) to verify each claim
  • Plan verification tests that would have caught the bug earlier

Example use cases

  • A production crash that recurs despite multiple patches and lacks an obvious trigger
  • Intermittent failures across services where timing or race conditions are suspected
  • A persistent data corruption issue whose symptoms differ across environments
  • Complex API integration failures after a schema or version change
  • Hard-to-reproduce concurrency bugs in async workflows

FAQ

How long does a full inspection take?

Duration varies by codebase size and complexity; expect a few hours for a focused module and multiple days for large systems. The key is thoroughness, not speed.

Can fixes be applied during inspection?

No. Fixes come only after Phase 2 defects are catalogued and Phase 3 root causes are established. This prevents masking deeper issues.