home / skills / aj-geddes / useful-ai-prompts / intermittent-issue-debugging

intermittent-issue-debugging skill

/skills/intermittent-issue-debugging

This skill helps you diagnose intermittent issues through systematic logging, tracing, and monitoring to identify root causes and prevent flaky behavior.

npx playbooks add skill aj-geddes/useful-ai-prompts --skill intermittent-issue-debugging

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.7 KB
---
name: intermittent-issue-debugging
description: Debug issues that occur sporadically and are hard to reproduce. Use monitoring and systematic investigation to identify root causes of flaky behavior.
---

# Intermittent Issue Debugging

## Overview

Intermittent issues are the most difficult to debug because they don't occur consistently. Systematic approach and comprehensive monitoring are essential.

## When to Use

- Sporadic errors in logs
- Users report occasional issues
- Flaky tests
- Race conditions suspected
- Timing-dependent bugs
- Resource exhaustion issues

## Instructions

### 1. **Capturing Intermittent Issues**

```javascript
// Strategy 1: Comprehensive Logging
// Add detailed logging around suspected code

function processPayment(orderId) {
  const startTime = Date.now();
  console.log(`[${startTime}] Payment start: order=${orderId}`);

  try {
    const result = chargeCard(orderId);
    console.log(`[${Date.now()}] Payment success: ${orderId}`);
    return result;
  } catch (error) {
    const duration = Date.now() - startTime;
    console.error(`[${Date.now()}] Payment FAILED:`, {
      order: orderId,
      error: error.message,
      duration_ms: duration,
      error_type: error.constructor.name,
      stack: error.stack
    });
    throw error;
  }
}

// Strategy 2: Correlation IDs
// Track requests across systems

const correlationId = generateId();
logger.info({
  correlationId,
  action: 'payment_start',
  orderId: 123
});

chargeCard(orderId, {headers: {correlationId}});

logger.info({
  correlationId,
  action: 'payment_end',
  status: 'success'
});

// Later, can grep logs by correlationId to see full trace

// Strategy 3: Error Sampling
// Capture full error context when occurs

window.addEventListener('error', (event) => {
  const errorData = {
    message: event.message,
    url: event.filename,
    line: event.lineno,
    col: event.colno,
    stack: event.error?.stack,
    userAgent: navigator.userAgent,
    memory: performance.memory?.usedJSHeapSize,
    timestamp: new Date().toISOString()
  };

  sendToMonitoring(errorData);  // Send to error tracking
});
```

### 2. **Common Intermittent Issues**

```yaml
Issue: Race Condition

Symptom: Inconsistent behavior depending on timing

Example:
  Thread 1: Read count (5)
  Thread 2: Read count (5), increment to 6, write
  Thread 1: Increment to 6, write (overrides Thread 2)
  Result: Should be 7, but is 6

Debug:
  1. Add detailed timestamps
  2. Log all operations
  3. Look for overlapping operations
  4. Check if order matters

Solution:
  - Use locks/mutexes
  - Use atomic operations
  - Use message queues
  - Ensure single writer

---

Issue: Timing-Dependent Bug

Symptom: Test passes sometimes, fails others

Example:
  test_user_creation:
    1. Create user (sometimes slow)
    2. Check user exists
    3. Fails if create took too long

Debug:
  - Add timeout logging
  - Increase wait time
  - Add explicit waits
  - Mock slow operations

Solution:
  - Explicit wait for condition
  - Remove time-dependent assertions
  - Use proper test fixtures

---

Issue: Resource Exhaustion

Symptom: Works fine, but after time fails

Example:
  - Memory grows over time
  - Connections pool exhausted
  - Disk space fills up
  - Max open files reached

Debug:
  - Monitor resources continuously
  - Check for leaks (memory growth)
  - Monitor connection count
  - Check long-running processes

Solution:
  - Fix memory leak
  - Increase resource limits
  - Implement cleanup
  - Add monitoring/alerts

---

Issue: Intermittent Network Failure

Symptom: API calls occasionally fail

Debug:
  - Check network logs
  - Identify timeout patterns
  - Check if time-of-day dependent
  - Check if load dependent

Solution:
  - Implement exponential backoff retry
  - Add circuit breaker
  - Increase timeout
  - Add redundancy
```

### 3. **Systematic Investigation Process**

```yaml
Step 1: Understand the Pattern
  Questions:
    - How often does it occur? (1/100, 1/1000?)
    - When does it occur? (time of day, load, specific user?)
    - What are the conditions? (network, memory, load?)
    - Is it reproducible? (deterministic or random?)
    - Any recent changes?

  Analysis:
    - Review error logs
    - Check error rate trends
    - Identify patterns
    - Correlate with changes

Step 2: Reproduce Reliably
  Methods:
    - Increase test frequency (run 1000 times)
    - Stress test (heavy load)
    - Simulate poor conditions (network, memory)
    - Run on different machines
    - Run in production-like environment

  Goal: Make issue consistent to analyze

Step 3: Add Instrumentation
  - Add detailed logging
  - Add monitoring metrics
  - Add trace IDs
  - Capture errors fully
  - Log system state

Step 4: Capture the Issue
  - Recreate scenario
  - Capture full context
  - Note system state
  - Document conditions
  - Get reproduction case

Step 5: Analyze Data
  - Review logs
  - Look for patterns
  - Compare normal vs error cases
  - Check timing correlations
  - Identify root cause

Step 6: Implement Fix
  - Based on root cause
  - Verify with reproduction case
  - Test extensively
  - Add regression test
```

### 4. **Monitoring & Prevention**

```yaml
Monitoring Strategy:

Real User Monitoring (RUM):
  - Error rates by feature
  - Latency percentiles
  - User impact
  - Trend analysis

Application Performance Monitoring (APM):
  - Request traces
  - Database query performance
  - External service calls
  - Resource usage

Synthetic Monitoring:
  - Regular test execution
  - Simulate user flows
  - Alert on failures
  - Trend tracking

---

Alerting:

Setup alerts for:
  - Error rate spike
  - Response time >threshold
  - Memory growth trend
  - Failed transactions

---

Prevention Checklist:

[ ] Comprehensive logging in place
[ ] Error tracking configured
[ ] Performance monitoring active
[ ] Resource monitoring enabled
[ ] Correlation IDs used
[ ] Failed requests captured
[ ] Timeout values appropriate
[ ] Retry logic implemented
[ ] Circuit breakers in place
[ ] Load testing performed
[ ] Stress testing performed
[ ] Race conditions reviewed
[ ] Timing dependencies checked

---

Tools:

Monitoring:
  - New Relic / DataDog
  - Prometheus / Grafana
  - Sentry / Rollbar
  - Custom logging

Testing:
  - Load testing (k6, JMeter)
  - Chaos engineering (gremlin)
  - Property-based testing (hypothesis)
  - Fuzz testing

Debugging:
  - Distributed tracing (Jaeger)
  - Correlation IDs
  - Detailed logging
  - Debuggers
```

## Key Points

- Comprehensive logging is essential
- Add correlation IDs for tracing
- Monitor for patterns and trends
- Stress test to reproduce
- Use detailed error context
- Implement exponential backoff for retries
- Monitor resource exhaustion
- Add circuit breakers for external services
- Log system state with errors
- Implement proper monitoring/alerting

Overview

This skill helps debug intermittent, hard-to-reproduce issues by combining monitoring, instrumentation, and a systematic investigation workflow. It focuses on capturing full error context, correlating signals across systems, and turning sporadic failures into reproducible cases. The goal is to identify root causes and implement reliable fixes and preventative controls.

How this skill works

The skill inspects logs, traces, resource metrics, and test behavior to find patterns tied to timing, load, or environment. It prescribes adding detailed logging, correlation IDs, sampling full error payloads, and stress or synthetic tests to force recurrence. Once captured, the process guides analysis, root-cause validation, and deployment of fixes like retries, backoff, or synchronization.

When to use it

  • When errors appear sporadically in production logs
  • When users report occasional failures that can’t be reproduced locally
  • For flaky tests that pass or fail nondeterministically
  • When race conditions or timing-dependent bugs are suspected
  • If resources (memory, connections, disk) degrade over time

Best practices

  • Instrument key code paths with timestamps, system state, and correlation IDs
  • Capture full error context selectively (error sampling) to avoid noise
  • Continuously monitor resource metrics and set trend alerts before failures
  • Reproduce reliably with stress tests, increased iteration counts, and environment simulation
  • Add regression tests and explicit waits for timing-dependent behavior

Example use cases

  • Track a payment flow across services using correlation IDs to find where requests drop
  • Diagnose a flaky integration test by running it under simulated slow network and increased concurrency
  • Detect and fix a memory leak by enabling heap sampling and trend alerts
  • Mitigate intermittent API failures with exponential backoff and a circuit breaker
  • Confirm a race condition by logging operation timestamps and enforcing single-writer constraints

FAQ

How do I capture enough context without overwhelming storage?

Sample full error payloads and capture detailed context only on sampled failures. Store structured logs and use retention policies and alerts for high-impact signals.

What if I can’t reproduce the issue in staging?

Increase test frequency, run stress tests, simulate poor network or low-resource conditions, and run in a production-like environment. Collect traces and correlation IDs in production to reconstruct state.