home / skills / adaptationio / skrillz / multi-ai-debugging

multi-ai-debugging skill

/skills/multi-ai-debugging

npx playbooks add skill adaptationio/skrillz --skill multi-ai-debugging

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
14.4 KB
---
name: multi-ai-debugging
description: Systematic debugging using Claude, Gemini, and Codex as specialized agents. Multi-agent root cause analysis, log analysis, error classification, and auto-fix generation. Use when debugging production issues, analyzing error logs, performing root cause analysis, troubleshooting complex systems, or implementing self-healing patterns.
allowed-tools: Task, Read, Write, Edit, Glob, Grep, Bash
---

# Multi-AI Debugging

## Overview

multi-ai-debugging provides systematic debugging workflows using multiple AI models as specialized agents. Based on 2024-2025 best practices for AI-assisted debugging with multi-agent architectures.

**Purpose**: Systematic root cause analysis and fix generation using AI ensemble

**Pattern**: Task-based (6 independent debugging operations)

**Key Principles** (validated by tri-AI research):
1. **Multi-Agent Council** - Specialized agents debate root causes before consensus
2. **Evaluator/Critic Loops** - Fix agent + critic agent verify solutions
3. **Trace-Aware Analysis** - Full execution context, not just error messages
4. **Semantic Log Analysis** - LLM understanding beyond regex matching
5. **Cross-Stack Correlation** - Connect frontend, backend, infra issues
6. **Auto-Remediation** - Self-healing patterns where safe

**Quality Targets**:
- Root Cause Identification: >80% accuracy
- Time to Diagnosis: <30 minutes for common issues
- Fix Generation Success: >60% for known patterns
- False Positive Rate: <20% on error classification

---

## When to Use

Use multi-ai-debugging when:

- Debugging production incidents
- Analyzing error logs and stack traces
- Performing root cause analysis (RCA)
- Troubleshooting complex multi-service systems
- Investigating performance degradation
- Understanding cascading failures
- Writing post-mortem reports

**When NOT to Use**:
- Simple syntax errors (IDE handles these)
- Clear compile-time errors
- Well-documented known issues

---

## Prerequisites

### Required
- Error information (logs, stack trace, error message)
- Access to relevant code

### Recommended
- Execution traces (if available)
- System metrics/observability data
- Recent change history (git log)
- Gemini CLI for web research
- Codex CLI for deep code analysis

### Integration
- OpenTelemetry traces (ideal)
- Log aggregation (CloudWatch, Datadog, etc.)
- APM tools (optional)

---

## Operations

### Operation 1: Quick Error Diagnosis

**Time**: 2-5 minutes
**Automation**: 70%
**Purpose**: Fast initial diagnosis for common errors

**Process**:

1. **Analyze Error**:
```
Diagnose this error:

Error: [PASTE ERROR MESSAGE]
Stack trace: [PASTE STACK TRACE]

Provide:
1. What type of error is this?
2. Most likely root cause (1-2 sentences)
3. Immediate fix suggestion
4. Prevention recommendation
```

2. **Verify with Gemini** (for web research):
```bash
gemini -p "Search for solutions to this error:
[ERROR MESSAGE]

Find:
- Common causes
- Stack Overflow solutions
- GitHub issues with fixes"
```

3. **Output**: Quick diagnosis with fix suggestion

---

### Operation 2: Root Cause Analysis (RCA)

**Time**: 15-45 minutes
**Automation**: 50%
**Purpose**: Deep root cause analysis for complex issues

**Process**:

**Step 1: Gather Context (Context Agent)**
```bash
# Recent changes
git log --oneline -20
git diff HEAD~5..HEAD --stat

# Related logs
grep -r "ERROR\|Exception\|WARN" logs/ | tail -100

# System state
# Check for relevant metrics, traces, etc.
```

**Step 2: Hypothesis Generation (Analysis Agent)**
```
Perform root cause analysis:

**Error/Symptom**: [DESCRIPTION]

**Context**:
- Recent changes: [GIT LOG]
- Logs: [RELEVANT LOG ENTRIES]
- System state: [METRICS/OBSERVATIONS]
- When started: [TIMESTAMP]
- Affected scope: [USERS/SERVICES]

**Tasks**:
1. List 3-5 probable root causes ranked by likelihood
2. For each hypothesis:
   - Evidence supporting it
   - Evidence against it
   - Confidence level (High/Medium/Low)
3. Recommend investigation steps for top hypothesis
```

**Step 3: Cross-Validate (Verification Agent)**
```
Verify this root cause hypothesis:

Hypothesis: [TOP HYPOTHESIS]
Evidence: [SUPPORTING EVIDENCE]

Tasks:
1. What would we expect to see if this is correct?
2. What would disprove this hypothesis?
3. Design a reproduction test
4. Confidence assessment (0-100)
```

**Step 4: Generate RCA Report**
```markdown
## Root Cause Analysis Report

**Incident**: [DESCRIPTION]
**Date**: [DATE]
**Duration**: [DURATION]
**Impact**: [USERS/SERVICES AFFECTED]

### Timeline
- HH:MM - First error observed
- HH:MM - Investigation began
- HH:MM - Root cause identified
- HH:MM - Fix deployed

### Root Cause
[DETAILED EXPLANATION]

### Contributing Factors
1. [FACTOR 1]
2. [FACTOR 2]

### Resolution
[FIX APPLIED]

### Prevention
1. [ACTION ITEM 1]
2. [ACTION ITEM 2]
```

---

### Operation 3: Log Analysis & Classification

**Time**: 5-15 minutes
**Automation**: 80%
**Purpose**: Analyze and classify error logs

**Process**:

1. **Cluster Log Patterns**:
```
Analyze these log entries:

[PASTE 50-100 LOG LINES]

Tasks:
1. Identify unique error patterns (cluster similar errors)
2. Classify each pattern:
   - Type: (Bug/Config/Network/Resource/Security/User Error)
   - Severity: (Critical/High/Medium/Low)
   - Impact: (Data Loss/Service Down/Degraded/Minor)
3. Count occurrences per pattern
4. Identify the root pattern (original error vs cascading)
5. Recommend priority order for investigation
```

2. **Semantic Analysis**:
```
Perform semantic analysis on these logs:

[LOG ENTRIES]

Looking for:
- Anomalies in timing/sequence
- Correlation between events
- Hidden dependencies
- Patterns human might miss
```

3. **Output**: Classified and prioritized error report

---

### Operation 4: Multi-Agent Council Debugging

**Time**: 20-60 minutes
**Automation**: 40%
**Purpose**: Complex issues requiring multiple perspectives

**Process**:

**Launch Parallel Agents**:

```
Launch 4 debugging agents for this issue:

Issue: [DESCRIPTION]
Code: [RELEVANT CODE]
Logs: [RELEVANT LOGS]

Agent 1 (Code Reviewer):
"Analyze the code for bugs. Focus on:
- Logic errors
- Edge cases
- Race conditions
- Resource leaks"

Agent 2 (Log Analyzer):
"Analyze the logs for clues. Focus on:
- Error sequences
- Timing patterns
- State changes
- External dependencies"

Agent 3 (System Analyst):
"Analyze system context. Focus on:
- Resource constraints
- Configuration issues
- Dependency problems
- Infrastructure state"

Agent 4 (Historical Analyst):
"Analyze history. Focus on:
- Recent changes that could cause this
- Similar past incidents
- Regression indicators
- Pattern matching to known issues"
```

**Council Deliberation**:
```
Synthesize findings from all debugging agents:

Agent 1 (Code): [FINDINGS]
Agent 2 (Logs): [FINDINGS]
Agent 3 (System): [FINDINGS]
Agent 4 (History): [FINDINGS]

Tasks:
1. Find consensus root cause (where 2+ agents agree)
2. Resolve conflicting hypotheses
3. Combine evidence for strongest theory
4. Rate overall confidence (0-100)
5. Propose fix with verification steps
```

---

### Operation 5: Auto-Fix Generation

**Time**: 10-30 minutes
**Automation**: 60%
**Purpose**: Generate and verify fixes

**Process**:

**Step 1: Generate Fix (Fixer Agent)**
```
Generate a fix for this issue:

Issue: [ROOT CAUSE]
Code: [AFFECTED CODE]

Requirements:
1. Minimal change (fix only the issue)
2. Include error handling
3. Add comments explaining the fix
4. Suggest test cases to verify

Output format:
- File: [path]
- Before: [original code]
- After: [fixed code]
- Explanation: [why this fixes it]
```

**Step 2: Critique Fix (Critic Agent)**
```
Critique this proposed fix:

Issue: [ORIGINAL ISSUE]
Proposed Fix: [FIX CODE]

Evaluate:
1. Does it actually fix the root cause?
2. Could it introduce new bugs?
3. Edge cases not handled?
4. Performance implications?
5. Security implications?

Verdict: APPROVE / NEEDS_REVISION / REJECT
```

**Step 3: Generate Regression Test**
```
Generate a regression test for this fix:

Original Bug: [DESCRIPTION]
Fix Applied: [FIX CODE]

Create test that:
1. Would have caught the original bug
2. Verifies the fix works
3. Tests edge cases
4. Can run in CI/CD
```

---

### Operation 6: Self-Healing Patterns

**Time**: Variable
**Automation**: 90%
**Purpose**: Automated detection and remediation

**Process**:

**Define Remediation Playbooks**:
```python
# Example: Auto-remediation patterns
PLAYBOOKS = {
    "disk_space_low": {
        "detection": "disk_usage > 90%",
        "actions": [
            "compress_old_logs",
            "clear_temp_files",
            "alert_if_still_high"
        ]
    },
    "memory_leak_detected": {
        "detection": "memory_growth > 10%/hour",
        "actions": [
            "capture_heap_dump",
            "graceful_restart",
            "alert_team"
        ]
    },
    "error_rate_spike": {
        "detection": "error_rate > 5%",
        "actions": [
            "check_recent_deploys",
            "consider_rollback",
            "alert_on_call"
        ]
    }
}
```

**Configure Circuit Breakers**:
```python
# Intelligent circuit breaking
class AICircuitBreaker:
    def should_open(self, metrics):
        """AI predicts cascading failure risk."""
        prompt = f"""
        Given these metrics:
        - Error rate: {metrics['error_rate']}
        - Latency p99: {metrics['latency_p99']}
        - Dependencies health: {metrics['deps']}

        Should we open the circuit breaker?
        Risk of cascade: (Low/Medium/High)
        Recommendation: (OPEN/CLOSED/HALF_OPEN)
        """
        return analyze(prompt)
```

---

## Multi-AI Coordination

### Agent Assignment Strategy

| Task | Primary | Verification | Strength |
|------|---------|--------------|----------|
| Log analysis | Gemini | Claude | Fast, large context |
| Code analysis | Claude | Codex | Deep understanding |
| Root cause | Claude | Gemini | Reasoning + search |
| Fix generation | Claude | Codex | Code + review |
| Research | Gemini | Claude | Web search |

### Coordination Commands

**Gemini for Log Search**:
```bash
gemini -p "Analyze these logs and identify anomalies:
[LOGS]"
```

**Claude for Root Cause**:
```
Given this debugging context, what's the root cause?
[CONTEXT]
```

**Codex for Fix Validation**:
```bash
codex "Review this fix for correctness and edge cases:
[FIX]"
```

---

## Decision Trees

### Error Type Classification

```
Error Type Decision Tree:

1. Is there a stack trace?
   ├── Yes → Go to Code Error Analysis
   └── No → Go to System Error Analysis

2. Code Error Analysis:
   ├── NullPointer/TypeError → Missing null check
   ├── IndexOutOfBounds → Boundary condition
   ├── Timeout → Resource/network issue
   ├── Permission denied → Auth/authz issue
   └── Unknown → Multi-agent analysis

3. System Error Analysis:
   ├── Connection refused → Service down
   ├── Disk full → Resource exhaustion
   ├── Out of memory → Memory leak/sizing
   ├── CPU spike → Performance issue
   └── Unknown → Multi-agent analysis
```

### Severity Assessment

```
Severity Decision:

CRITICAL (P1):
- Data loss occurring
- Security breach active
- Service completely down
- Revenue impact immediate

HIGH (P2):
- Service degraded significantly
- Errors affecting >10% users
- Potential data integrity issues

MEDIUM (P3):
- Errors affecting <10% users
- Workaround available
- Non-critical feature broken

LOW (P4):
- Cosmetic issues
- Edge case errors
- No user impact
```

---

## Integration with Observability

### OpenTelemetry Pattern

```python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span("debug_session")
def debug_with_tracing(error):
    """Debug with full trace context."""
    span = trace.get_current_span()

    # Add debugging context to trace
    span.set_attribute("error.type", type(error).__name__)
    span.set_attribute("error.message", str(error))

    # Gather trace context for AI
    context = {
        "trace_id": span.get_span_context().trace_id,
        "parent_spans": get_parent_spans(),
        "related_logs": get_correlated_logs()
    }

    return multi_agent_debug(context)
```

### Log Correlation

```python
def correlate_logs_for_debugging(trace_id, time_window="5m"):
    """Gather all logs related to an incident."""
    # Query logs with trace correlation
    logs = query_logs(f"""
        trace_id:{trace_id} OR
        (level:ERROR AND timestamp:[now-{time_window} TO now])
    """)

    # Feed to AI for analysis
    return analyze_logs_with_ai(logs)
```

---

## Example Debugging Session

```
User: My API is returning 500 errors intermittently

Claude: Let me run multi-agent debugging to diagnose this.

[Launches Context, Analysis, and Verification agents]

## Debugging Report: Intermittent 500 Errors

### Agent Findings

**Log Analyzer**:
- Pattern: Errors spike every 5 minutes
- Correlation: Occurs during background job runs
- Log cluster: "Connection pool exhausted"

**Code Reviewer**:
- Found: Database connections not returned to pool in /api/users.py:145
- Missing: try/finally block for connection cleanup

**System Analyst**:
- Connection pool max: 20
- Concurrent requests during spike: 25+
- Background job opens 10 connections

**Historical**:
- Started after commit abc123 (2 days ago)
- Commit added new background job without connection limit

### Consensus Root Cause (Confidence: 92%)
Connection pool exhaustion due to:
1. New background job consuming connections
2. Missing connection cleanup in API handler

### Proposed Fix
```python
# /api/users.py:145
# Before
def get_user(id):
    conn = pool.get_connection()
    result = conn.execute("SELECT * FROM users WHERE id = ?", id)
    return result  # Connection never returned!

# After
def get_user(id):
    conn = pool.get_connection()
    try:
        result = conn.execute("SELECT * FROM users WHERE id = ?", id)
        return result
    finally:
        pool.return_connection(conn)  # Always return
```

### Verification Steps
1. Add connection tracking metric
2. Run load test
3. Monitor pool usage during background job

### Prevention
1. Add connection timeout
2. Implement connection leak detection
3. Review all pool usage
```

---

## Related Skills

- **multi-ai-code-review**: Review fixes before deployment
- **multi-ai-verification**: Verify fix effectiveness
- **multi-ai-testing**: Generate regression tests
- **ecs-troubleshooting**: Container-specific debugging
- **railway-troubleshooting**: Railway platform debugging

---

## References

- `references/log-analysis-patterns.md` - Log analysis techniques
- `references/self-healing-playbooks.md` - Auto-remediation patterns