home / skills / akin-ozer / cc-devops-skills / k8s-debug

k8s-debug skill

This skill helps you diagnose Kubernetes cluster issues and streamline debugging of pods, services, deployments, and network problems with structured workflows.

npx playbooks add skill akin-ozer/cc-devops-skills --skill k8s-debug

Review the files below or copy the command above to add this skill to your agents.

Files (7)

SKILL.md

9.8 KB

---
name: k8s-debug
description: Comprehensive Kubernetes debugging and troubleshooting toolkit. Use this skill when diagnosing Kubernetes cluster issues, debugging failing pods, investigating network connectivity problems, analyzing resource usage, troubleshooting deployments, or performing cluster health checks.
---

# Kubernetes Debugging Skill

## Overview

Systematic toolkit for debugging and troubleshooting Kubernetes clusters, pods, services, and deployments. Provides scripts, workflows, and reference guides for identifying and resolving common Kubernetes issues efficiently.

## When to Use This Skill

Invoke this skill when encountering:
- Pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, OOMKilled)
- Service connectivity or DNS resolution issues
- Network policy or ingress problems
- Volume and storage mount failures
- Deployment rollout issues
- Cluster health or performance degradation
- Resource exhaustion (CPU/memory)
- Configuration problems (ConfigMaps, Secrets, RBAC)

## Debugging Workflow

Follow this systematic approach for any Kubernetes issue:

### 1. Identify the Problem Layer

Categorize the issue:
- **Application Layer**: Application crashes, errors, bugs
- **Pod Layer**: Pod not starting, restarting, or pending
- **Service Layer**: Network connectivity, DNS issues
- **Node Layer**: Node not ready, resource exhaustion
- **Cluster Layer**: Control plane issues, API problems
- **Storage Layer**: Volume mount failures, PVC issues
- **Configuration Layer**: ConfigMap, Secret, RBAC issues

### 2. Gather Diagnostic Information

Use the appropriate diagnostic script based on scope:

#### Pod-Level Diagnostics
Use `scripts/pod_diagnostics.py` for comprehensive pod analysis:

```bash
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>
```

This script gathers:
- Pod status and description
- Pod events
- Container logs (current and previous)
- Resource usage
- Node information
- YAML configuration

Output can be saved for analysis: `python3 scripts/pod_diagnostics.py <pod-name> -n <namespace> -o diagnostics.txt`

#### Cluster-Level Health Check
Use `scripts/cluster_health.sh` for overall cluster diagnostics:

```bash
./scripts/cluster_health.sh
```

This script checks:
- Cluster info and version
- Node status and resources
- Pods across all namespaces
- Failed/pending pods
- Recent events
- Deployments, services, statefulsets, daemonsets
- PVCs and PVs
- Component health
- Common error states (CrashLoopBackOff, ImagePullBackOff)

#### Network Diagnostics
Use `scripts/network_debug.sh` for connectivity issues:

```bash
./scripts/network_debug.sh <namespace> <pod-name>
```

This script analyzes:
- Pod network configuration
- DNS setup and resolution
- Service endpoints
- Network policies
- Connectivity tests
- CoreDNS logs

### 3. Follow Issue-Specific Workflow

Based on the identified issue, consult `references/troubleshooting_workflow.md` for detailed workflows:

- **Pod Pending**: Resource/scheduling workflow
- **CrashLoopBackOff**: Application crash workflow
- **ImagePullBackOff**: Image pull workflow
- **Service issues**: Network connectivity workflow
- **DNS failures**: DNS troubleshooting workflow
- **Resource exhaustion**: Performance investigation workflow
- **Storage issues**: PVC binding workflow
- **Deployment stuck**: Rollout workflow

### 4. Apply Targeted Fixes

Refer to `references/common_issues.md` for specific solutions to common problems.

## Common Debugging Patterns

### Pattern 1: Pod Not Starting

```bash
# Quick assessment
kubectl get pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

# Detailed diagnostics
python3 scripts/pod_diagnostics.py <pod-name> -n <namespace>

# Check common causes:
# - ImagePullBackOff: Verify image exists and credentials
# - CrashLoopBackOff: Check logs with --previous flag
# - Pending: Check node resources and scheduling
```

### Pattern 2: Service Connectivity Issues

```bash
# Verify service and endpoints
kubectl get svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>

# Network diagnostics
./scripts/network_debug.sh <namespace> <pod-name>

# Test connectivity from debug pod
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot -- /bin/bash
# Inside: curl <service-name>.<namespace>.svc.cluster.local:<port>

# Check network policies
kubectl get networkpolicies -n <namespace>
```

### Pattern 3: Application Performance Issues

```bash
# Check resource usage
kubectl top nodes
kubectl top pods -n <namespace> --containers

# Get pod metrics
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 resources

# Check for OOMKilled
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 lastState

# Review application logs
kubectl logs <pod-name> -n <namespace> --tail=100
```

### Pattern 4: Cluster Health Assessment

```bash
# Run comprehensive health check
./scripts/cluster_health.sh > cluster-health-$(date +%Y%m%d-%H%M%S).txt

# Review output for:
# - Node conditions and resource pressure
# - Failed or pending pods
# - Recent error events
# - Component health status
# - Resource quota usage
```

## Essential Manual Commands

While scripts automate diagnostics, understand these core commands:

### Pod Debugging
```bash
# View pod status
kubectl get pods -n <namespace> -o wide

# Detailed pod information
kubectl describe pod <pod-name> -n <namespace>

# View logs
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous  # Previous container
kubectl logs <pod-name> -n <namespace> -c <container>  # Specific container

# Execute commands in pod
kubectl exec <pod-name> -n <namespace> -it -- /bin/sh

# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml
```

### Service and Network Debugging
```bash
# Check services
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>

# Check endpoints
kubectl get endpoints -n <namespace>

# Test DNS
kubectl exec <pod-name> -n <namespace> -- nslookup kubernetes.default

# View events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```

### Resource Monitoring
```bash
# Node resources
kubectl top nodes
kubectl describe nodes

# Pod resources
kubectl top pods -n <namespace>
kubectl top pod <pod-name> -n <namespace> --containers
```

### Emergency Operations
```bash
# Restart deployment
kubectl rollout restart deployment/<name> -n <namespace>

# Rollback deployment
kubectl rollout undo deployment/<name> -n <namespace>

# Force delete stuck pod
kubectl delete pod <pod-name> -n <namespace> --force --grace-period=0

# Drain node (maintenance)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data

# Cordon node (prevent scheduling)
kubectl cordon <node-name>
```

## Reference Documentation

### Detailed Troubleshooting Guides

Consult `references/troubleshooting_workflow.md` for:
- Step-by-step workflows for each issue type
- Decision trees for diagnosis
- Command sequences for systematic debugging
- Quick reference command cheat sheet

### Common Issues Database

Consult `references/common_issues.md` for:
- Detailed explanations of each common issue
- Symptoms and causes
- Specific debugging steps
- Solutions and fixes
- Prevention strategies

## Best Practices

### Systematic Approach
1. **Observe**: Gather facts before making changes
2. **Analyze**: Use diagnostic scripts to collect comprehensive data
3. **Hypothesize**: Form theory about root cause
4. **Test**: Verify hypothesis with targeted commands
5. **Fix**: Apply appropriate solution
6. **Verify**: Confirm issue is resolved
7. **Document**: Record findings for future reference

### Data Collection
- Save diagnostic output to files for analysis
- Capture logs before restarting failing pods
- Record events timeline for incident reports
- Export resource metrics for trend analysis

### Prevention
- Set appropriate resource requests and limits
- Implement health checks (liveness/readiness probes)
- Use proper logging and monitoring
- Apply network policies incrementally
- Test changes in non-production environments
- Maintain documentation of cluster architecture

## Advanced Debugging Techniques

### Debug Containers (Kubernetes 1.23+)
```bash
# Attach ephemeral debug container
kubectl debug <pod-name> -n <namespace> -it --image=nicolaka/netshoot

# Create debug copy of pod
kubectl debug <pod-name> -n <namespace> -it --copy-to=<debug-pod-name> --container=<container>
```

### Port Forwarding for Testing
```bash
# Forward pod port to local machine
kubectl port-forward pod/<pod-name> -n <namespace> <local-port>:<pod-port>

# Forward service port
kubectl port-forward svc/<service-name> -n <namespace> <local-port>:<service-port>
```

### Proxy for API Access
```bash
# Start kubectl proxy
kubectl proxy --port=8080

# Access API
curl http://localhost:8080/api/v1/namespaces/<namespace>/pods/<pod-name>
```

### Custom Column Output
```bash
# Custom pod info
kubectl get pods -o custom-columns=NAME:.metadata.name,STATUS:.status.phase,IP:.status.podIP

# Node taints
kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints
```

## Troubleshooting Checklist

Before escalating issues, verify:

- [ ] Reviewed pod events: `kubectl describe pod`
- [ ] Checked pod logs (current and previous)
- [ ] Verified resource availability on nodes
- [ ] Confirmed image exists and is accessible
- [ ] Validated service selectors match pod labels
- [ ] Tested DNS resolution from pods
- [ ] Checked network policies
- [ ] Reviewed recent cluster events
- [ ] Confirmed ConfigMaps/Secrets exist
- [ ] Validated RBAC permissions
- [ ] Checked for resource quotas/limits
- [ ] Reviewed cluster component health

## Related Tools

Useful additional tools for Kubernetes debugging:
- **kubectl-debug**: Advanced debugging plugin
- **stern**: Multi-pod log tailing
- **kubectx/kubens**: Context and namespace switching
- **k9s**: Terminal UI for Kubernetes
- **lens**: Desktop IDE for Kubernetes
- **Prometheus/Grafana**: Monitoring and alerting
- **Jaeger/Zipkin**: Distributed tracing

Overview

This skill is a comprehensive Kubernetes debugging and troubleshooting toolkit for diagnosing cluster, node, pod, network, storage, and deployment issues. It bundles diagnostic scripts, workflows, and command patterns to quickly gather evidence, narrow root causes, and apply targeted fixes. Use it to standardize incident response and capture reproducible diagnostics for analysis.

How this skill works

The toolkit runs layered diagnostics from pod-level to cluster-level: pod diagnostics collect status, events, logs and resource usage; network checks validate DNS, endpoints and network policies; and cluster health scripts enumerate node conditions, failed pods, PVC/PV states and control-plane component health. It also provides step-by-step workflows and a common-issues database to guide remediation and verification.

When to use it

Pod failures such as CrashLoopBackOff, ImagePullBackOff, Pending, or OOMKilled
Service connectivity, DNS resolution, or ingress failures
Network policy or pod-to-pod connectivity investigations
Volume mount, PVC binding, or storage access problems
Deployment rollouts stuck, rollback or restart scenarios
Cluster health checks and resource exhaustion or performance degradation

Best practices

Follow a systematic workflow: observe, analyze, hypothesize, test, fix, verify, document
Collect and save diagnostic outputs (logs, describe, script output) before restarting or deleting resources
Use ephemeral debug containers and port-forwarding to replicate and test connectivity from inside the cluster
Apply fixes incrementally and test in non-production environments before rolling out cluster-wide changes
Set resource requests/limits, health probes, and monitoring/alerting to prevent common failures

Example use cases

Run pod_diagnostics.py to collect pod description, events, current and previous logs, node info and resource usage for a crashing pod
Execute cluster_health.sh to create a snapshot of node status, failing pods, component health, PVC/PV states and recent events for incident postmortem
Use network_debug.sh plus a debug pod (netshoot) to reproduce DNS failures and verify service endpoints from within the namespace
Follow the CrashLoopBackOff workflow: gather logs, check liveness/readiness probes, inspect image and environment variables, then test fixes in a debug copy of the pod
Port-forward a pod or service to your workstation to test API or app endpoints during a rolling update

FAQ

What core scripts are included and what do they collect?

Key scripts collect pod diagnostics (status, events, logs, YAML, resource usage), cluster health (nodes, pods, deployments, PVCs, component health) and network diagnostics (DNS, endpoints, policies, connectivity tests).

How should I preserve evidence during debugging?

Save script outputs and logs to timestamped files, capture kubectl describe/events before restarts, and export metrics snapshots for trend analysis to support root-cause investigation and post-incident reports.