home / skills / serendipityoneinc / srp-claude-code-marketplace / k8s-management

k8s-management skill

/plugins/srp-devops/skills/k8s-management

npx playbooks add skill serendipityoneinc/srp-claude-code-marketplace --skill k8s-management

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
8.5 KB
---
name: k8s-management
description: Kubernetes cluster management and monitoring (Kubernetes集群管理与监控)
---

# Kubernetes Management (Kubernetes 管理)

为运维人员提供 Kubernetes 集群管理和监控功能,包括查看 Pod 状态、日志、资源使用等。

Provides Kubernetes cluster management and monitoring for DevOps engineers, including pod status, logs, resource usage, and more.

## Quick Start

### View Pods (查看 Pod)

```
显示 production namespace 中的所有 pods
List all pods in the production namespace
```

### Check Pod Logs (查看 Pod 日志)

```
查看 pod api-server-abc123 的日志
Get logs from pod api-server-abc123
```

### Check Resource Usage (查看资源使用)

```
显示集群资源使用情况
Show cluster resource usage
```

## Key Features

### 1. Pod Management (Pod 管理)

View and monitor pods across namespaces:

**Common Operations:**
```bash
# List pods
kubectl get pods -n <namespace>

# Get pod details
kubectl describe pod <pod-name> -n <namespace>

# View pod logs
kubectl logs <pod-name> -n <namespace>

# Follow logs
kubectl logs -f <pod-name> -n <namespace>
```

### 2. Service and Deployment (服务与部署)

Monitor services and deployments:

**Common Operations:**
```bash
# List services
kubectl get services -n <namespace>

# List deployments
kubectl get deployments -n <namespace>

# Check deployment status
kubectl rollout status deployment/<name> -n <namespace>
```

### 3. Resource Monitoring (资源监控)

Check cluster and pod resource usage:

**Common Operations:**
```bash
# Node resource usage
kubectl top nodes

# Pod resource usage
kubectl top pods -n <namespace>

# Check cluster info
kubectl cluster-info
```

## Common Workflows

### Workflow 1: Troubleshoot Pod Issues

```markdown
1. User: "production namespace 中有哪些 pod 出问题了?"
2. List pods: kubectl get pods -n production
3. Identify pods with issues (Not Running, CrashLoopBackOff)
4. Get pod details: kubectl describe pod <problem-pod>
5. Check logs: kubectl logs <problem-pod>
6. Analyze error messages
7. Suggest solutions
```

### Workflow 2: Check Deployment Health

```markdown
1. User: "检查 api-server deployment 的健康状态"
2. Get deployment: kubectl get deployment api-server -n production
3. Check replicas: desired vs available
4. Check rollout status
5. View recent events
6. Report status and any issues
```

### Workflow 3: Monitor Resource Usage

```markdown
1. User: "集群资源使用情况如何?"
2. Get node metrics: kubectl top nodes
3. Get pod metrics: kubectl top pods --all-namespaces
4. Calculate utilization percentages
5. Identify resource-intensive pods
6. Provide recommendations
```

## Examples (示例)

### Example 1: List Pods (列出 Pods)

**User (Chinese):**
```
显示 production namespace 中的所有 pods
```

**Assistant Response:**
```markdown
我会列出 production namespace 中的所有 pods。

<executes: kubectl get pods -n production>

**Production Namespace - Pods:**

| Name | Status | Restarts | Age |
|------|--------|----------|-----|
| api-server-7d4f8b9c-abc123 | Running | 0 | 5d |
| api-server-7d4f8b9c-def456 | Running | 1 | 5d |
| web-app-5c8a7b-ghi789 | Running | 0 | 3d |
| worker-queue-9f3e2d-jkl012 | CrashLoopBackOff | 10 | 2h |
| database-6a4b5c-mno345 | Running | 0 | 15d |

**Summary:**
- Total: 5 pods
- Running: 4 pods
- Issues: 1 pod (worker-queue)

⚠️ **Alert**: worker-queue-9f3e2d-jkl012 is in CrashLoopBackOff状态。需要检查日志吗?
```

### Example 2: View Pod Logs (查看 Pod 日志)

**User (English):**
```
Get logs from pod worker-queue-9f3e2d-jkl012 in production namespace
```

**Assistant Response:**
```markdown
I'll retrieve the logs from the pod.

<executes: kubectl logs worker-queue-9f3e2d-jkl012 -n production --tail=50>

**Pod Logs: worker-queue-9f3e2d-jkl012**

```
[2026-01-16 10:15:23] INFO: Worker starting...
[2026-01-16 10:15:24] INFO: Connecting to Redis...
[2026-01-16 10:15:25] ERROR: Connection failed: redis://redis-service:6379
[2026-01-16 10:15:25] ERROR: Error: getaddrinfo ENOTFOUND redis-service
[2026-01-16 10:15:26] INFO: Retrying connection (1/10)...
[2026-01-16 10:15:27] ERROR: Connection failed: redis://redis-service:6379
[2026-01-16 10:15:28] FATAL: Max retries exceeded, exiting
```

**Analysis:**
- **Issue**: Cannot connect to Redis service
- **Root Cause**: DNS resolution failure for `redis-service`
- **Possible Reasons**:
  1. Redis service not running in the namespace
  2. Service name mismatch
  3. Network policy blocking connection

**Recommendations:**
1. Check if Redis service exists: `kubectl get svc redis-service -n production`
2. Verify service endpoints: `kubectl get endpoints redis-service -n production`
3. Check network policies
4. Review deployment configuration
```

### Example 3: Resource Usage (资源使用)

**User (Chinese):**
```
显示集群各节点的资源使用情况
```

**Assistant Response:**
```markdown
我会显示集群节点的资源使用情况。

<executes: kubectl top nodes>

**Cluster Resource Usage:**

| Node | CPU (cores) | CPU % | Memory (GB) | Memory % |
|------|-------------|-------|-------------|----------|
| node-1 | 3.2 / 8 | 40% | 12.5 / 32 | 39% |
| node-2 | 5.6 / 8 | 70% | 18.2 / 32 | 57% |
| node-3 | 2.1 / 8 | 26% | 8.7 / 32 | 27% |
| node-4 | 6.8 / 8 | 85% | 24.3 / 32 | 76% |

**Analysis:**
- **node-4** is running at high utilization (85% CPU, 76% Memory)
- **node-3** has the most available capacity
- Average cluster utilization: 55% CPU, 50% Memory

**Recommendations:**
1. Consider redistributing pods from node-4
2. Monitor node-4 for potential issues
3. Check if any pods need resource limit adjustments
4. Review horizontal pod autoscaling settings
```

## Important Notes

### Access Control (访问控制)
- Requires kubectl configured with cluster access
- User permissions defined by RBAC roles
- Some operations may be restricted based on role
- Always verify namespace access before operations

### Safety Guidelines (安全指南)
- **Read operations only** through this skill
- No pod deletion or modification
- No deployment changes
- No configuration updates
- For changes, use standard deployment processes

### Best Practices (最佳实践)
- Always specify namespace
- Use labels for filtering
- Check resource limits before scaling
- Monitor metrics before making decisions
- Document all investigations

## Prerequisites

### kubectl Configuration

Ensure kubectl is configured to access your clusters:

```bash
# Verify kubectl is installed
kubectl version --client

# List available contexts
kubectl config get-contexts

# Switch context if needed
kubectl config use-context <context-name>

# Verify access
kubectl get nodes
```

### GKE Authentication (GCP)

For GKE clusters:
```bash
# Authenticate with GCP
gcloud auth login

# Get cluster credentials
gcloud container clusters get-credentials <cluster-name> --region=<region>

# Verify access
kubectl cluster-info
```

### Required Permissions

Minimum RBAC permissions needed:
- `pods/list` - List pods
- `pods/get` - Get pod details
- `pods/log` - Read pod logs
- `services/list` - List services
- `deployments/list` - List deployments
- `nodes/list` - List nodes

## Limitations

### Current Limitations
- Read-only operations via kubectl commands
- No direct MCP server integration (uses kubectl CLI)
- No real-time monitoring dashboards
- No automated remediation
- Limited to configured clusters

### Future Enhancements
- Direct Kubernetes API integration
- Real-time pod monitoring
- Automated issue detection
- Integration with monitoring systems (Prometheus, Grafana)
- Pod restart and scaling capabilities
- Log aggregation and search

## Troubleshooting

### Issue 1: "kubectl: command not found"

**Solutions:**
1. Install kubectl: https://kubernetes.io/docs/tasks/tools/
2. Add to PATH
3. Restart Claude Code

### Issue 2: "Unable to connect to cluster"

**Solutions:**
1. Verify kubectl context: `kubectl config current-context`
2. Check cluster credentials
3. Verify network access
4. Re-authenticate if needed

### Issue 3: "Forbidden: User cannot list pods"

**Solutions:**
1. Check RBAC permissions
2. Contact cluster admin
3. Verify service account
4. Switch to correct namespace

## Related Skills

- `cloud-resources`: Manage cloud infrastructure
- Future: `monitoring-alerts`, `incident-response`

## Safety & Compliance

This skill provides **read-only** access to Kubernetes clusters:
- ✅ View pod status and logs
- ✅ Monitor resource usage
- ✅ Check service health
- ❌ Cannot delete pods
- ❌ Cannot modify deployments
- ❌ Cannot change configurations

For any modifications, use established CI/CD pipelines and change management processes.