home / skills / aj-geddes / useful-ai-prompts / log-analysis

log-analysis skill

/skills/log-analysis

This skill analyzes application and system logs to identify errors, patterns, and root causes, enabling faster debugging and reliable monitoring.

npx playbooks add skill aj-geddes/useful-ai-prompts --skill log-analysis

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.6 KB
---
name: log-analysis
description: Analyze application and system logs to identify errors, patterns, and root causes. Use log aggregation tools and structured logging for effective debugging.
---

# Log Analysis

## Overview

Logs are critical for debugging and monitoring. Effective log analysis quickly identifies issues and enables root cause analysis.

## When to Use

- Troubleshooting errors
- Performance investigation
- Security incident analysis
- Auditing user actions
- Monitoring application health

## Instructions

### 1. **Structured Logging**

```javascript
// Good: Structured logs (machine-readable)
logger.info({
  level: 'INFO',
  timestamp: '2024-01-15T10:30:00Z',
  service: 'auth-service',
  user_id: '12345',
  action: 'user_login',
  status: 'success',
  duration_ms: 150,
  ip_address: '192.168.1.1'
});

// Bad: Unstructured logs (hard to parse)
console.log('User 12345 logged in successfully in 150ms from 192.168.1.1');

// JSON Format (Elasticsearch friendly)
{
  "@timestamp": "2024-01-15T10:30:00Z",
  "level": "ERROR",
  "service": "api-gateway",
  "trace_id": "abc123",
  "message": "Database connection failed",
  "error": {
    "type": "ConnectionError",
    "code": "ECONNREFUSED"
  },
  "context": {
    "database": "users",
    "operation": "SELECT"
  }
}
```

### 2. **Log Levels & Patterns**

```yaml
Log Levels:

DEBUG: Detailed diagnostic info
  - Variable values
  - Function entry/exit
  - Intermediate calculations
  - Use: Development only

INFO: General informational messages
  - Startup/shutdown
  - User actions
  - Configuration changes
  - Use: Production (normal operations)

WARN: Warning messages (potential issues)
  - Deprecated API usage
  - Performance degradation
  - Resource limits approaching
  - Use: Production (investigate soon)

ERROR: Error conditions
  - Failed operations
  - Exceptions
  - Failed requests
  - Use: Production (action required)

FATAL/CRITICAL: System unusable
  - Critical failures
  - Out of memory
  - Data corruption
  - Use: Production (immediate action)

---

Log Patterns:

Request Logging:
  - Request ID (trace_id)
  - Method + Path
  - Status code
  - Duration
  - Request size / response size

Error Logging:
  - Error type/code
  - Error message
  - Stack trace
  - Context (user_id, session_id)
  - Timestamp

Business Events:
  - Event type
  - User involved
  - Impact/importance
  - Timestamp
  - Relevant context
```

### 3. **Log Analysis Tools**

```yaml
Log Aggregation:

ELK Stack (Elasticsearch, Logstash, Kibana):
  - Logstash: Parse and process logs
  - Elasticsearch: Search and analyze
  - Kibana: Visualization and dashboards
  - Use: Large scale, complex queries

Splunk:
  - Comprehensive log management
  - Real-time search and analysis
  - Dashboards and alerts
  - Use: Enterprise (expensive)

CloudWatch (AWS):
  - Integrated with AWS services
  - Log Insights for querying
  - Dashboards
  - Use: AWS-based systems

Datadog:
  - Application performance monitoring
  - Log management
  - Real-time alerts
  - Use: SaaS monitoring

---

Log Analysis Techniques:

Grep/Awk:
  grep "ERROR" app.log
  awk '{print $1, $4}' app.log

Filtering:
  Filter by timestamp
  Filter by service
  Filter by error type
  Filter by user

Searching:
  Search for error patterns
  Search for user actions
  Search trace IDs
  Search IP addresses

Aggregation:
  Count occurrences
  Group by error type
  Calculate duration percentiles
  Rate of errors over time
```

### 4. **Common Log Analysis Queries**

```yaml
Find errors in past hour:
  timestamp: last_1h AND level: ERROR

Track user activity:
  user_id: 12345 AND action: *

Find slow requests:
  duration_ms: >1000 AND level: INFO

Analyze error rate by service:
  level: ERROR | stats count by service

Find failed database operations:
  error.type: "DatabaseError" | stats count

Trace request flow:
  trace_id: "abc123" | sort by timestamp

---

Checklist:

[ ] Structured logging implemented
[ ] All errors logged with context
[ ] Request IDs/trace IDs used
[ ] Sensitive data not logged (passwords, tokens)
[ ] Log levels used appropriately
[ ] Log retention policy set
[ ] Log sampling for high-volume events
[ ] Alerts configured for errors
[ ] Dashboards created
[ ] Regular log review scheduled
[ ] Log analysis tools accessible
[ ] Team trained on querying logs
```

## Key Points

- Use structured JSON logging
- Include trace IDs for request tracking
- Log appropriate levels (DEBUG/INFO/ERROR)
- Never log sensitive data (passwords, tokens)
- Aggregate logs centrally
- Create dashboards for key metrics
- Alert on error rates and critical issues
- Retain logs appropriately
- Search logs by trace ID for troubleshooting
- Review logs regularly for patterns

Overview

This skill analyzes application and system logs to identify errors, patterns, and root causes quickly. It emphasizes structured JSON logging, centralized aggregation, and actionable queries to speed debugging and monitoring. The goal is to turn raw logs into searchable, visual insights and prioritized alerts.

How this skill works

The skill inspects logs for error patterns, request traces, and performance anomalies using tools like ELK, CloudWatch, Splunk, or simple grep/awk when appropriate. It enforces structured logs with trace IDs, contextual fields, and consistent levels, then applies filters, aggregations, and dashboards to surface trends and root causes. Outputs include error summaries, slow-request lists, and trace-based incident timelines.

When to use it

  • Troubleshooting production errors or exceptions
  • Investigating performance regressions or slow requests
  • Responding to security incidents and suspicious activity
  • Auditing user actions or compliance-related events
  • Setting up monitoring and alerting for application health

Best practices

  • Emit structured JSON logs with consistent fields (timestamp, service, trace_id, level, context).
  • Include trace/request IDs to link distributed traces and follow request flow.
  • Avoid logging sensitive data (passwords, tokens, PII) and use masking where needed.
  • Use appropriate log levels (DEBUG for dev, INFO for normal ops, WARN/ERROR for issues).
  • Centralize logs in an aggregation tool and create dashboards and alerts for key metrics.
  • Apply sampling for high-volume events and retain logs according to policy.

Example use cases

  • Find all ERROR-level events in the last hour and group by service to prioritize fixes.
  • Trace a failed user transaction using trace_id to identify the failing component.
  • Identify slow endpoints by querying duration_ms > 1000 and compute percentiles.
  • Detect unusual login patterns for a user_id across multiple services during a security review.
  • Create a dashboard that shows error rate, latency percentiles, and top error types for on-call triage.

FAQ

What fields should structured logs contain?

Include timestamp, level, service, trace_id or request_id, message, error type/code when present, user_id or session_id if relevant, and operation-specific context.

Which tool should I choose for log aggregation?

Choose based on scale and ecosystem: ELK for open-source flexibility, Splunk for enterprise features, CloudWatch for AWS-native, and Datadog for integrated APM and logs.