home / skills / nik-kale / sre-skills / kubernetes-troubleshooting

kubernetes-troubleshooting skill

/skills/kubernetes-troubleshooting

This skill helps you debug Kubernetes issues quickly by guiding you through pod failures, resource constraints, and networking problems with actionable steps.

npx playbooks add skill nik-kale/sre-skills --skill kubernetes-troubleshooting

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
8.7 KB
---
name: kubernetes-troubleshooting
description: Systematic debugging workflows for Kubernetes issues including pod failures, resource problems, and networking. Use when debugging CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod not starting, k8s issues, or any Kubernetes troubleshooting.
---

# Kubernetes Troubleshooting

Systematic approach to debugging Kubernetes issues.

## When to Use This Skill

- Pod stuck in CrashLoopBackOff
- OOMKilled errors
- ImagePullBackOff failures
- Pod not starting or scheduling
- Service connectivity issues
- Resource constraint problems

## Quick Diagnostic Commands

Start with these commands to understand the current state:

```bash
# Cluster overview
kubectl get nodes
kubectl get pods -A | grep -v Running

# Specific namespace
kubectl get pods -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>
```

## Pod Debugging Workflow

### Step 1: Check Pod Status

```bash
kubectl get pod <pod-name> -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
```

Look for:

- **Status**: What state is the pod in?
- **Conditions**: Ready, ContainersReady, PodScheduled
- **Events**: Recent events at the bottom of describe output

### Step 2: Identify the Problem Category

| Symptom               | Likely Cause         | Go To Section                           |
| --------------------- | -------------------- | --------------------------------------- |
| Pending               | Scheduling issue     | [Scheduling Issues](#scheduling-issues) |
| CrashLoopBackOff      | Application crash    | [CrashLoopBackOff](#crashloopbackoff)   |
| ImagePullBackOff      | Image/registry issue | [Image Pull Issues](#image-pull-issues) |
| OOMKilled             | Memory exhaustion    | [OOMKilled](#oomkilled)                 |
| Running but not Ready | Health check failing | [Readiness Issues](#readiness-issues)   |
| Error                 | Container error      | [Container Errors](#container-errors)   |

## Common Issues

### Scheduling Issues

Pod stuck in **Pending** state.

**Diagnostic**:

```bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
```

**Common Causes**:

| Event Message                   | Cause                | Fix                              |
| ------------------------------- | -------------------- | -------------------------------- |
| Insufficient cpu/memory         | Not enough resources | Add nodes or reduce requests     |
| node(s) had taints              | Node taints          | Add tolerations or remove taints |
| no nodes available              | No matching nodes    | Check node selector/affinity     |
| persistentvolumeclaim not found | PVC missing          | Create the PVC                   |

**Fix Resource Issues**:

```bash
# Check resource requests vs available
kubectl describe nodes | grep -A 5 "Allocated resources"

# Check pending pod requests
kubectl get pod <pod> -o yaml | grep -A 10 resources
```

---

### CrashLoopBackOff

Container keeps crashing and restarting.

**Diagnostic**:

```bash
# Check container logs (current)
kubectl logs <pod-name> -n <namespace>

# Check previous container logs
kubectl logs <pod-name> -n <namespace> --previous

# Check exit code
kubectl describe pod <pod-name> -n <namespace> | grep -A 3 "Last State"
```

**Common Exit Codes**:

| Exit Code | Meaning           | Common Cause                                        |
| --------- | ----------------- | --------------------------------------------------- |
| 0         | Success           | Process completed (might be wrong for long-running) |
| 1         | Application error | Check application logs                              |
| 137       | SIGKILL (OOM)     | Memory limit exceeded                               |
| 139       | SIGSEGV           | Segmentation fault                                  |
| 143       | SIGTERM           | Graceful termination                                |

**Common Fixes**:

- Check application logs for startup errors
- Verify environment variables and secrets
- Check if dependencies are available
- Verify resource limits aren't too restrictive

---

### Image Pull Issues

**ImagePullBackOff** or **ErrImagePull**.

**Diagnostic**:

```bash
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Events
```

**Common Causes**:

| Error                     | Cause                | Fix                    |
| ------------------------- | -------------------- | ---------------------- |
| repository does not exist | Wrong image name     | Fix image name/tag     |
| unauthorized              | Auth failure         | Check imagePullSecrets |
| manifest unknown          | Tag doesn't exist    | Verify tag exists      |
| connection refused        | Registry unreachable | Check network/firewall |

**Fix Registry Auth**:

```bash
# Create image pull secret
kubectl create secret docker-registry regcred \
  --docker-server=<registry> \
  --docker-username=<user> \
  --docker-password=<password> \
  -n <namespace>

# Reference in pod spec
spec:
  imagePullSecrets:
  - name: regcred
```

---

### OOMKilled

Container killed due to memory exhaustion.

**Diagnostic**:

```bash
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 lastState
```

**Fix Options**:

1. **Increase memory limit** (if available):

```yaml
resources:
  limits:
    memory: '512Mi' # Increase this
  requests:
    memory: '256Mi'
```

2. **Profile memory usage**:

```bash
kubectl top pod <pod-name> -n <namespace> --containers
```

3. **Check for memory leaks** in application code

---

### Readiness Issues

Pod is Running but not Ready.

**Diagnostic**:

```bash
# Check readiness probe
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Readiness

# Check probe endpoint manually
kubectl exec <pod-name> -n <namespace> -- wget -qO- localhost:<port>/health
```

**Common Causes**:

- Application not listening on expected port
- Readiness endpoint returning non-200
- Probe timeout too short
- Dependencies not available

**Fix Readiness Probe**:

```yaml
readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 10 # Give app time to start
  periodSeconds: 5
  timeoutSeconds: 3 # Increase if needed
  failureThreshold: 3
```

---

### Container Errors

**Diagnostic**:

```bash
# Get detailed container status
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[*]}'

# Check init containers
kubectl logs <pod-name> -n <namespace> -c <init-container-name>
```

---

## Networking Troubleshooting

### Service Not Reachable

```bash
# Check service endpoints
kubectl get endpoints <service-name> -n <namespace>

# Check service selector matches pod labels
kubectl get svc <service-name> -n <namespace> -o yaml | grep selector -A 5
kubectl get pods -n <namespace> --show-labels

# Test connectivity from another pod
kubectl run debug --rm -it --image=busybox -- wget -qO- <service>:<port>
```

### DNS Issues

```bash
# Check DNS resolution from pod
kubectl exec <pod> -n <namespace> -- nslookup <service-name>
kubectl exec <pod> -n <namespace> -- nslookup <service-name>.<namespace>.svc.cluster.local

# Check CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns
```

---

## Resource Analysis

### Node Pressure

```bash
# Check node conditions
kubectl describe nodes | grep -A 5 Conditions

# Check node resource usage
kubectl top nodes

# Find resource-heavy pods
kubectl top pods -A --sort-by=memory | head -20
```

### PVC Issues

```bash
# Check PVC status
kubectl get pvc -n <namespace>

# Check PV status
kubectl get pv

# Describe for events
kubectl describe pvc <pvc-name> -n <namespace>
```

---

## Quick Reference Commands

```bash
# Pod debugging
kubectl logs <pod> -n <ns>                    # Current logs
kubectl logs <pod> -n <ns> --previous         # Previous container logs
kubectl logs <pod> -n <ns> -c <container>     # Specific container
kubectl logs <pod> -n <ns> --tail=100 -f      # Follow logs

# Interactive debugging
kubectl exec -it <pod> -n <ns> -- /bin/sh     # Shell into container
kubectl exec <pod> -n <ns> -- env             # Check environment
kubectl exec <pod> -n <ns> -- cat /etc/hosts  # Check DNS

# Resource inspection
kubectl get pod <pod> -n <ns> -o yaml         # Full pod spec
kubectl describe pod <pod> -n <ns>            # Events and status
kubectl get events -n <ns> --sort-by='.lastTimestamp'

# Cluster-wide
kubectl get pods -A | grep -v Running         # Non-running pods
kubectl top pods -A --sort-by=cpu             # CPU usage
kubectl top pods -A --sort-by=memory          # Memory usage
```

## Additional Resources

- [Error Message Decoder](references/error-decoder.md)
- [kubectl Cheat Sheet](references/kubectl-cheatsheet.md)

Overview

This skill provides systematic debugging workflows for common Kubernetes problems like pod failures, resource exhaustion, image pull errors, and networking issues. It gives a concise set of diagnostic commands, targeted troubleshooting steps, and practical fixes to accelerate incident resolution. Use it to triage CrashLoopBackOff, OOMKilled, ImagePullBackOff, readiness failures, scheduling problems, and service connectivity issues.

How this skill works

The skill walks through a layered diagnostic approach: cluster-level checks, pod inspection, logs and exit codes, probe validation, and resource analysis. It maps common symptoms to likely causes and prescribes concrete kubectl commands and configuration fixes (resources, probes, imagePullSecrets, node affinity/taints). It also includes quick commands for interactive debugging and network/DNS tests from within the cluster.

When to use it

  • Pod stuck in CrashLoopBackOff or continuously restarting
  • Pods terminated with OOMKilled or unexpected exit codes
  • ImagePullBackOff or ErrImagePull errors
  • Pod stuck Pending or unschedulable due to resource/taint/affinity issues
  • Service unreachable, DNS failures, or endpoints not matching selectors
  • Cluster resource pressure or PVC/PV provisioning problems

Best practices

  • Start with kubectl get pods -A and kubectl get events -n <ns> to capture recent failures
  • Examine current and previous container logs (kubectl logs --previous) and describe output for events and exit codes
  • Validate readiness/liveness probes and increase initialDelay/timeout before changing app code
  • Use resource requests/limits conservatively and profile memory/cpu with kubectl top before raising limits
  • Create and reference imagePullSecrets for private registries; verify image tags and registry reachability
  • Debug networking by exec-ing into a debug pod and testing DNS and service endpoints

Example use cases

  • Triage a CrashLoopBackOff: check kubectl logs, previous logs, and container Last State to find startup errors or OOMs
  • Resolve ImagePullBackOff: verify image name/tag, check imagePullSecrets, and confirm registry connectivity
  • Fix Pending pods: inspect node capacity, taints, and pod affinity/requests to decide whether to add nodes or adjust requests
  • Diagnose service reachability: confirm service selector matches pod labels, check endpoints, and test from a debug pod
  • Investigate OOMKilled: run kubectl top, increase memory limits if appropriate, and profile application memory usage

FAQ

What kubectl commands should I run first when a pod fails?

Start with kubectl get pods -n <ns>, kubectl describe pod <pod> -n <ns>, and kubectl logs <pod> -n <ns> (add --previous if restarting). Then check events with kubectl get events -n <ns> --sort-by='.lastTimestamp'.

How do I know if an OOMKilled was caused by the app or limits?

Check the pod describe output for OOM messages and the container exit code 137. Use kubectl top pod <pod> -n <ns> --containers to observe peak memory. If usage exceeds the limit, increase limits or profile the app for leaks.