home / skills / avivk5498 / my-claude-code-skills / agent-debugger

agent-debugger skill

safe

This skill helps diagnose and resolve AI agent issues fast by collecting logs, running automated analyses, and proposing concrete fixes.

npx playbooks add skill avivk5498/my-claude-code-skills --skill agent-debugger

Review the files below or copy the command above to add this skill to your agents.

Files (8)

SKILL.md

7.2 KB

---
name: agent-debugger
description: Systematic debugging toolkit for AI agentic workflows in customer support. Use when diagnosing issues with AI agents including wrong responses, tool/function calling problems, conversation loops, stuck states, or performance/latency issues. Works with any framework (LangChain, custom agents, Claude API) and accepts conversation logs, API logs, tool execution logs, and agent configurations.
---

# Agent Debugger

## Overview

Debug AI agent issues systematically using analysis scripts and proven debugging patterns. This skill helps identify root causes of common agent failures: incorrect responses, tool calling errors, conversation loops, performance problems, and more.

## When to Use This Skill

Trigger this skill when:
- Agent gives wrong or irrelevant responses
- Tools are not being called or are called incorrectly
- Conversation gets stuck in loops or repeated patterns
- Agent performance is slow or inconsistent
- Tool executions are failing or returning errors
- Need to analyze conversation logs or API traces

## Debugging Workflow

### Step 1: Gather Diagnostic Data

Collect these artifacts from the user:
- **Conversation logs** - Full transcript or chat history
- **API request/response logs** - Raw LLM API calls if available
- **Tool execution logs** - Records of tool calls and outputs
- **Agent configuration** - System prompts, tool schemas, settings
- **Description of the issue** - What's wrong and when it occurs

### Step 2: Run Automated Analysis

Use the appropriate analysis scripts based on symptoms:

**For general conversation issues:**
```bash
python scripts/analyze_conversation.py <log_file>
```
Analyzes role distribution, message patterns, detects potential issues, provides summary metrics.

**For suspected loops or stuck states:**
```bash
python scripts/detect_loops.py <log_file> [--threshold 2] [--window 5]
```
Detects exact loops, fuzzy patterns, stuck states, and ping-pong exchanges.

**For tool/function calling problems:**
```bash
python scripts/analyze_tool_calls.py <log_file> [--schema tool_schema.json]
```
Analyzes tool usage patterns, validates against schema, detects errors and retry loops.

**For performance/latency issues:**
```bash
python scripts/analyze_performance.py <log_file>
```
Calculates latency statistics, identifies slow responses, analyzes performance by role.

**Note:** Scripts accept JSON-formatted logs. For text logs, `analyze_conversation.py` can auto-detect and parse common formats.

### Step 3: Interpret Results

Review script outputs and identify patterns:
- Check for **warnings and issues** flagged by scripts
- Look at **metrics** (latency, token usage, tool call counts)
- Examine **repeated patterns** or anomalies
- Cross-reference with common failure modes

### Step 4: Match to Known Patterns

Consult the debugging patterns reference:

```
Read references/debugging-patterns.md
```

This comprehensive guide covers:
1. **Conversation Loops** - Symptoms, causes, solutions
2. **Tool Calling Failures** - Detection and fixes
3. **Context Window Exhaustion** - Management strategies
4. **Incorrect Responses** - Prompt engineering fixes
5. **Performance Issues** - Optimization techniques
6. **Tool Execution Errors** - Error handling approaches
7. **State Management Issues** - Tracking strategies

Each pattern includes:
- Observable symptoms
- Root causes
- Concrete solutions
- Detection methods

### Step 5: Recommend Solutions

Based on analysis and pattern matching:

1. **Identify root cause** - What's actually broken?
2. **Propose specific fixes** - Concrete changes to prompts, tools, or config
3. **Explain reasoning** - Why this will solve the problem
4. **Suggest testing** - How to verify the fix works
5. **Preventive measures** - How to avoid similar issues

### Step 6: Provide Best Practices

For broader improvements, reference:

```
Read references/agent-best-practices.md
```

Covers:
- System prompt design principles
- Tool design and implementation
- Conversation management strategies
- Error handling approaches
- Quality assurance and monitoring
- Optimization techniques

## Log Format Requirements

Scripts work best with structured JSON logs:

**Minimal format:**
```json
[
  {"role": "user", "content": "Hello"},
  {"role": "assistant", "content": "Hi there!"}
]
```

**With tool calls (OpenAI/Anthropic format):**
```json
[
  {
    "role": "assistant",
    "content": null,
    "tool_calls": [
      {
        "id": "call_123",
        "type": "function",
        "function": {
          "name": "search_kb",
          "arguments": "{\"query\": \"password reset\"}"
        }
      }
    ]
  },
  {
    "role": "tool",
    "tool_call_id": "call_123",
    "content": "Article: How to reset your password..."
  }
]
```

**With timestamps and metadata:**
```json
[
  {
    "role": "user",
    "content": "Hello",
    "timestamp": "2024-01-15T10:30:00Z",
    "message_id": "msg_1"
  },
  {
    "role": "assistant",
    "content": "Hi there!",
    "timestamp": "2024-01-15T10:30:02Z",
    "usage": {
      "prompt_tokens": 50,
      "completion_tokens": 10,
      "total_tokens": 60
    }
  }
]
```

Scripts auto-detect format and extract available information.

## Quick Diagnostic Checklist

**Agent not responding:**
- [ ] Check API connectivity and auth
- [ ] Review error logs
- [ ] Verify configuration is valid
- [ ] Check rate limits

**Wrong/irrelevant responses:**
- [ ] Review system prompt clarity
- [ ] Check if appropriate tools are called
- [ ] Verify necessary context is present
- [ ] Test with clearer user input

**Conversation stuck/looping:**
- [ ] Run `detect_loops.py`
- [ ] Check for repeated tool errors
- [ ] Review last few agent responses
- [ ] Add explicit loop break conditions

**Tool calling issues:**
- [ ] Run `analyze_tool_calls.py` with schema
- [ ] Validate tool descriptions are clear
- [ ] Check tool implementation for bugs
- [ ] Test tools independently

**Performance problems:**
- [ ] Run `analyze_performance.py`
- [ ] Check token usage and context length
- [ ] Review tool execution times
- [ ] Consider model/infrastructure

## Example Debugging Session

**User reports:** "Agent keeps asking for the same information repeatedly"

**Analysis approach:**
1. Collect conversation log
2. Run `detect_loops.py` → Confirms ping-pong pattern detected
3. Run `analyze_conversation.py` → Shows high repeated content
4. Review conversation → Agent not retaining context from earlier messages
5. Consult `debugging-patterns.md` → Matches "State Management Issues"
6. **Solution:** Add explicit state tracking to system prompt, include conversation summary
7. **Test:** Verify agent now references earlier information
8. **Document:** Record fix and add to monitoring

## Resources

### scripts/
Analysis utilities that can be run directly on log files:
- `analyze_conversation.py` - General conversation analysis
- `detect_loops.py` - Loop and pattern detection
- `analyze_tool_calls.py` - Tool usage analysis and validation
- `analyze_performance.py` - Performance and latency analysis

### references/
In-depth debugging knowledge:
- `debugging-patterns.md` - Common failure modes and solutions (read when interpreting analysis results)
- `agent-best-practices.md` - Design and implementation best practices (read when providing recommendations)

Overview

This skill is a systematic debugging toolkit for AI agentic workflows in customer support. It helps diagnose wrong responses, tool-calling failures, conversation loops, stuck states, and performance issues. The toolkit works with any agent framework and accepts conversation, API, and tool execution logs plus agent configuration.

How this skill works

The skill ingests structured or common-format conversation and API logs, then runs targeted analysis scripts to detect anomalies: role distribution, repeated patterns, loop detection, tool call validation, and latency metrics. Outputs include flagged warnings, quantitative metrics, and matched failure patterns that map to concrete root causes. Based on findings, it recommends fixes, testing steps, and preventive measures.

When to use it

Agent returns wrong, irrelevant, or hallucinated answers
Tools or functions are not being called or return incorrect outputs
Conversation is stuck in loops, repeated prompts, or ping-pong behavior
Agent responses are slow or latency is inconsistent
Tool executions fail, return errors, or trigger retries

Best practices

Collect full diagnostic artifacts: conversation logs, API traces, tool logs, and agent config before analysis
Use structured JSON logs when possible; scripts auto-detect many common formats
Run targeted scripts based on symptom: conversation, loops, tool calls, or performance
Match script findings against documented debugging patterns to pinpoint root causes
Propose focused fixes (prompt, tool schema, state tracking) and a clear test plan to verify changes

Example use cases

Diagnose an agent that repeatedly asks for the same customer info and implement state tracking to stop loops
Validate and fix function/tool calling mismatches by comparing calls to the declared tool schema
Analyze latency spikes across messages to identify slow tools or model-invocation bottlenecks
Detect and remedy context-window exhaustion by summarizing history or pruning irrelevant context
Investigate intermittent API errors by correlating tool execution logs with retry patterns

FAQ

What log formats are supported?

Scripts accept structured JSON and can auto-detect many common text formats; including message arrays with roles, tool_calls, timestamps, and usage metadata.

Can this work with any agent framework?

Yes. The analysis focuses on conversation and execution traces, so it works with LangChain, custom agents, Claude/OpenAI APIs, and similar setups.