home / skills / xiaomi / mone / k8s-troubleshoot

k8s-troubleshoot skill

safe

/jcommon/mcp/mcp-smartsre/.claude/skills/k8s-troubleshoot

This skill helps you diagnose Kubernetes apps by finding pods by label selectors and running diagnostic commands inside containers for quick issue resolution.

npx playbooks add skill xiaomi/mone --skill k8s-troubleshoot

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

3.1 KB

---
name: k8s-troubleshoot
description: Kubernetes troubleshooting toolkit - search pods by labels and execute diagnostic commands inside containers. Use when user reports service errors, exceptions, crashes, timeouts, or needs to check logs, processes, network, or resource usage in K8s pods.
---

# Kubernetes Troubleshooting Skill

A complete toolkit for diagnosing Kubernetes applications. Find pods by labels, then execute commands inside containers for deep diagnostics.

## When to Use

- User reports service errors, exceptions, failures, timeouts
- Need to check application logs or process status
- Diagnose network, memory, or disk issues
- Keywords: error, exception, failed, timeout, crash, not working, logs, troubleshoot, diagnose, pod, container

## Workflow

1. **Search pods** - Find target pods by label selector
2. **Execute diagnostics** - Run commands inside containers

## Scripts

### 1. Search Pods

Find pods by label selector:

```bash
uv run python .claude/skills/k8s-troubleshoot/scripts/search_pods.py -l "app=nginx" -n default
```

| Parameter | Required | Description |
|-----------|----------|-------------|
| `-l, --label-selector` | Yes | Label selector, e.g., `app=nginx` or `project-id=123,pipeline-id=456` |
| `-n, --namespace` | No | Namespace (default: `default`). Use `all` for all namespaces |

**Output**: JSON with `success`, `podCount`, `pods` (name, namespace, phase, containers)

### 2. Execute Command in Pod

Run diagnostic commands inside a container:

```bash
uv run python .claude/skills/k8s-troubleshoot/scripts/exec_pod.py -p "pod-name" -n default -cmd "tail -n 100 /root/logs/app.log"
```

| Parameter | Required | Description |
|-----------|----------|-------------|
| `-p, --pod` | Yes | Pod name |
| `-n, --namespace` | No | Namespace (default: `default`) |
| `-c, --container` | No | Container name (for multi-container pods) |
| `-cmd, --command` | Yes | Command to execute |

**Output**: JSON with `success`, `pod`, `namespace`, `command`, `output`

## Common Diagnostic Patterns

### View application logs
```bash
uv run python .claude/skills/k8s-troubleshoot/scripts/exec_pod.py -p my-pod -n default -cmd "tail -n 100 /root/logs/app.log"
```

### Check Nacos config (dubbo3 issues)
```bash
uv run python .claude/skills/k8s-troubleshoot/scripts/exec_pod.py -p my-pod -n default -cmd "cat /root/logs/nacos/config.log | grep nacos"
```

### Check processes
```bash
uv run python .claude/skills/k8s-troubleshoot/scripts/exec_pod.py -p my-pod -n default -cmd "ps aux | head -20"
```

### Check network
```bash
uv run python .claude/skills/k8s-troubleshoot/scripts/exec_pod.py -p my-pod -n default -cmd "netstat -tlnp"
```

### Check disk and memory
```bash
uv run python .claude/skills/k8s-troubleshoot/scripts/exec_pod.py -p my-pod -n default -cmd "df -h && free -m"
```

## Troubleshooting Tips

| Issue | Diagnostic Command |
|-------|-------------------|
| dubbo3 no provider | Check `/root/logs/nacos/config.log` for nacos address |
| Service not responding | Check process status with `ps aux` and logs |
| Connection issues | Check network with `netstat -tlnp` |
| OOM errors | Check memory with `free -m` |

Overview

This skill is a Kubernetes troubleshooting toolkit that helps you find pods by labels and run diagnostic commands inside container(s). It is designed for rapid diagnosis of service errors, crashes, timeouts, and resource or network problems in Kubernetes clusters. Use it to inspect logs, processes, network sockets, and resource usage without leaving your CLI.

How this skill works

First it searches the cluster for pods matching a label selector and returns structured JSON with pod names, namespaces, phases, and containers. Then it executes user-provided shell commands inside a chosen pod container and returns the command output in JSON. The workflow enables scripted checks (logs, ps, netstat, df/free) and ad-hoc interactive diagnostics.

When to use it

User reports service errors, exceptions, crashes, or timeouts
You need to tail or inspect application logs inside a pod
Verify process status or application health within a container
Diagnose network connectivity or listening ports in a pod
Check disk space, memory usage, or OOM-related issues

Best practices

Use precise label selectors to limit returned pods and avoid noisy results
Run read-only diagnostic commands by default; avoid destructive commands in production
Specify the container when pods are multi-container to target the right runtime
Combine search and exec in scripts for repeatable checks and alerts
Capture and store JSON outputs for post-mortem analysis and sharing with teams

Example use cases

Find all nginx pods in a namespace: search pods by label and inspect phases and containers
Tail recent application logs to identify errors: exec into pod and run tail -n 100 on the log file
Verify process list when a service is unresponsive: run ps aux inside the container
Check network listeners for connection problems: exec netstat -tlnp to confirm ports are bound
Quick resource checks for suspected OOMs: run df -h && free -m inside the pod to inspect disk and memory

FAQ

Can I search across all namespaces?

Yes. Use the namespace parameter set to 'all' to search across every namespace.

What if a pod has multiple containers?

Specify the container name with the container parameter to run commands in the correct container; otherwise the default container will be used.