home / skills / mjunaidca / mjs-agent-skills / kubernetes

kubernetes skill

safe

This skill generates production-grade Kubernetes manifests with security-first defaults and educational comments to streamline deployment and debugging.

npx playbooks add skill mjunaidca/mjs-agent-skills --skill kubernetes

Review the files below or copy the command above to add this skill to your agents.

Files (21)

SKILL.md

17.2 KB

---
name: kubernetes
description: |-
  Production-grade Kubernetes manifests and debugging for containerized applications.
  This skill should be used when users ask to deploy to Kubernetes, create K8s manifests,
  containerize for K8s, set up Deployments/Services/Jobs/StatefulSets/CronJobs, create
  namespaces with resource quotas, set up multi-team isolation, configure ResourceQuota/
  LimitRange, secure with RBAC (ServiceAccount, Role, RoleBinding), configure init
  containers (model download, db wait, migrations), set up sidecars (logging, metrics),
  or debug pods (CrashLoopBackOff, logs, exec, describe, events). Auto-detects from
  Dockerfile/code, generates hardened manifests with educational comments. CKAD-aligned.
hooks:
  PreToolUse:
    - matcher: "Bash"
      hooks:
        - type: command
          command: "bash \"$CLAUDE_PROJECT_DIR\"/.claude/hooks/verify-kubectl-context.sh"
---

# Kubernetes

Production-grade K8s manifests with security-first defaults and educational comments.

---

## Resource Detection & Adaptation

**Before generating manifests, detect the target environment:**

```bash
# Detect node resources
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}: {.status.capacity.memory}, {.status.capacity.cpu}{"\n"}{end}'

# Detect if Docker Desktop (local) or real cluster
kubectl get nodes -o jsonpath='{.items[0].metadata.labels.node\.kubernetes\.io/instance-type}' 2>/dev/null || echo "local"

# Detect available resources
kubectl describe nodes | grep -A 5 "Allocated resources"
```

**Adapt configurations based on detection:**

| Detected Environment | Profile | Default Limits | Agent Action |
|---------------------|---------|----------------|--------------|
| Docker Desktop < 6GB | Minimal | 128Mi-256Mi | Warn, reduce replicas |
| Docker Desktop 6-10GB | Standard | 256Mi-512Mi | Normal deployment |
| Cloud/Real cluster | Production | Based on node size | Full features |

### Agent Behavior

1. **Detect** cluster type and resources before generating manifests
2. **Adapt** resource requests/limits to cluster capacity
3. **Warn** if requested workload exceeds available resources
4. **Calculate** safe limits: `(node_memory * 0.7) / expected_pod_count`

### Adaptive Resource Templates

**Local/Constrained (< 6GB allocatable):**
```yaml
resources:
  requests:
    memory: 128Mi
    cpu: 100m
  limits:
    memory: 256Mi
    cpu: 500m
```

**Standard (6-16GB allocatable):**
```yaml
resources:
  requests:
    memory: 256Mi
    cpu: 100m
  limits:
    memory: 512Mi
    cpu: 1000m
```

**Production (> 16GB or cloud):**
```yaml
resources:
  requests:
    memory: 512Mi
    cpu: 250m
  limits:
    memory: 1Gi
    cpu: 2000m
```

### Pre-Deployment Validation

Before applying manifests, agent should verify:
```bash
# Check if deployment would exceed node capacity
kubectl get nodes -o jsonpath='{.items[0].status.allocatable.memory}'
```

If insufficient: warn user and suggest scaling down or increasing Docker Desktop resources.

---

## What This Skill Does

**Analysis & Detection:**
- Auto-detects from Dockerfile: ports, health endpoints, resources
- Identifies workload type from project structure
- Reads existing manifests to understand patterns
- Detects GPU requirements from dependencies

**Generation:**
- Creates production-hardened manifests (non-root, read-only, resource limits)
- Generates all supporting resources (Service, ConfigMap, HPA, PDB)
- Creates namespace governance (ResourceQuota, LimitRange, NetworkPolicy)
- Supports multi-team isolation with environment progression (dev → staging → prod)
- Adds educational comments explaining WHY each config choice
- Outputs ArgoCD-compatible directory structure

**Validation:**
- Verifies kubectl context exists
- Creates namespace if needed
- Deploys to local cluster (kind/minikube)
- Confirms pods are running before delivering

**Security:**
- Non-root user by default (runAsNonRoot: true)
- Read-only root filesystem
- No privilege escalation
- Dropped capabilities
- Resource limits always set
- **Unprivileged ports only** (>=1024) - privileged ports (<1024) require root

## What This Skill Does NOT Do

- Generate Helm charts (document in references for future)
- Create Kustomize overlays (document in references for future)
- Handle Dapr sidecar injection (separate skill)
- Deploy Kafka/Strimzi operators (separate skill)
- Generate ArgoCD Application CRDs (separate skill)

---

## Before Implementation

Gather context to ensure successful implementation:

| Source | Gather |
|--------|--------|
| **Codebase** | Dockerfile, existing manifests, port/health patterns |
| **Conversation** | Target environment, namespace, special requirements |
| **Skill References** | Security contexts, health probes, resource limits |
| **User Guidelines** | Cluster conventions, naming standards |

---

## Required Clarifications

After auto-detection, confirm with user if ambiguous:

| Question | When to Ask |
|----------|-------------|
| Target environment | "Deploying to local (kind/minikube) or remote cluster?" |
| Namespace | "Use existing namespace or create new?" |
| Image availability | "Is image in registry or needs to be built/loaded?" |
| Service exposure | "Internal only (ClusterIP) or external access needed?" |
| Namespace governance | "Need ResourceQuota/LimitRange for resource isolation?" |
| Multi-team setup | "Single team or multi-team with namespace isolation?" |
| Environment progression | "Creating dev/staging/prod namespaces with quota progression?" |

---

## Pre-flight Checks (CRITICAL)

Before generating manifests, verify:

```bash
# 1. Cluster access
kubectl cluster-info

# 2. Current context
kubectl config current-context

# 3. Target namespace (create if needed)
kubectl get namespace $NAMESPACE || kubectl create namespace $NAMESPACE

# 4. Image exists (or build it)
docker images | grep $IMAGE_NAME || docker build -t $IMAGE_NAME .

# 5. For local clusters: load image
kind load docker-image $IMAGE_NAME  # or minikube image load
```

**If any check fails → stop and report. Don't generate manifests for broken state.**

---

## Auto-Detection Matrix

### From Dockerfile

| Detect | How | Example |
|--------|-----|---------|
| **Port** | EXPOSE instruction | `EXPOSE 8000` → containerPort: 8000 |
| **Health** | CMD with health endpoint | `uvicorn` → /health or /healthz |
| **User** | USER instruction | `USER 1000` → runAsUser: 1000 |
| **Workdir** | WORKDIR instruction | Context for volume mounts |

### Port Selection (CRITICAL for Security)

**Privileged ports (<1024) conflict with `runAsNonRoot: true`.**

| Detected Port | Action |
|---------------|--------|
| 80, 443 | ⚠️ Use unprivileged variant (nginx-unprivileged:8080) or remap |
| 8080, 8000, 3000+ | ✅ Compatible with non-root |

**Common remappings:**
| Standard Image | Security-Compatible Alternative |
|----------------|--------------------------------|
| `nginx` (port 80) | `nginxinc/nginx-unprivileged` (port 8080) |
| `httpd` (port 80) | Configure `Listen 8080` or use unprivileged image |
| `redis` (port 6379) | ✅ Already unprivileged |
| `postgres` (port 5432) | ✅ Already unprivileged |

**Service abstracts this:** Service `port: 80` → `targetPort: 8080` keeps external API stable.

### From Code

| Detect | How | Example |
|--------|-----|---------|
| **Framework health** | Route definitions | FastAPI `/health`, Express `/healthz` |
| **Readiness** | DB connection check | `/health/ready` with DB ping |
| **Startup time** | Heavy imports | ML models → startupProbe needed |

### Workload Type Decision

```
Is this a one-time task that completes?
  → Job (or CronJob if scheduled)

Does it need stable network identity or ordered deployment?
  → StatefulSet

Must run on every node?
  → DaemonSet

Otherwise → Deployment (default)
```

---

## Workflow

```
1. PRE-FLIGHT
   - Verify kubectl context
   - Check namespace exists
   - Verify image exists or build it
         ↓
2. ANALYZE PROJECT
   - Read Dockerfile for EXPOSE, HEALTHCHECK, USER
   - Scan code for health endpoints
   - Check existing k8s/ directory
   - Detect GPU requirements (torch, tensorflow)
         ↓
3. DETERMINE WORKLOAD TYPE
   - Deployment (default)
   - Job/CronJob (batch processing)
   - StatefulSet (databases, ordered)
   - DaemonSet (node-level agents)
         ↓
4. GENERATE MANIFESTS
   - Deployment/Job/StatefulSet with hardened security
   - Service (ClusterIP, NodePort, or LoadBalancer)
   - ConfigMap for non-secret config
   - HPA if autoscaling needed
   - PDB for availability
   - All with educational comments
         ↓
5. VALIDATE
   - kubectl apply --dry-run=server
   - kubectl apply -n $NAMESPACE
   - kubectl wait --for=condition=Ready pod
   - kubectl logs to verify startup
         ↓
6. DELIVER
   - Files in k8s/base/ directory
   - Summary of what was created
   - Next steps for production
```

---

## Generated Directory Structure

```
k8s/
├── base/                         # Raw manifests (ArgoCD-compatible)
│   ├── namespace.yaml            # Optional, if new namespace
│   ├── resourcequota.yaml        # Namespace-wide resource caps
│   ├── limitrange.yaml           # Per-container defaults and bounds
│   ├── networkpolicy.yaml        # Namespace isolation rules
│   ├── deployment.yaml           # Or job.yaml, statefulset.yaml
│   ├── service.yaml              # ClusterIP by default
│   ├── configmap.yaml            # Non-secret configuration
│   ├── hpa.yaml                  # If autoscaling enabled
│   ├── pdb.yaml                  # Pod Disruption Budget
│   └── kustomization.yaml        # For future Kustomize use
└── README.md                     # Deployment instructions
```

---

## Manifest Patterns

### Deployment (Default)

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ${APP_NAME}
  labels:
    # Standard K8s labels (see references/labels-annotations.md)
    app.kubernetes.io/name: ${APP_NAME}
    app.kubernetes.io/instance: ${APP_NAME}-${ENV}
    app.kubernetes.io/version: "${VERSION}"
    app.kubernetes.io/component: api  # or worker, frontend
    app.kubernetes.io/part-of: ${PROJECT}
    app.kubernetes.io/managed-by: kubectl
spec:
  replicas: 2  # WHY: Minimum for availability during rolling updates
  selector:
    matchLabels:
      app.kubernetes.io/name: ${APP_NAME}
  template:
    metadata:
      labels:
        app.kubernetes.io/name: ${APP_NAME}
    spec:
      # WHY: Security hardening - never run as root
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        runAsGroup: 1000
        fsGroup: 1000
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: ${APP_NAME}
        image: ${IMAGE}:${TAG}
        # WHY: Never use :latest - breaks reproducibility
        imagePullPolicy: IfNotPresent
        ports:
        # WHY: Port must be >=1024 for runAsNonRoot (privileged ports need root)
        # Use Service port:80 → targetPort:8080 to expose standard ports externally
        - containerPort: ${PORT}  # Must be >=1024 (e.g., 8080, 8000, 3000)
          protocol: TCP
        # WHY: Container-level security context
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop: ["ALL"]
        # WHY: Prevent resource starvation, enable HPA
        resources:
          requests:
            cpu: "100m"      # 0.1 CPU cores
            memory: "128Mi"
          limits:
            cpu: "500m"      # 0.5 CPU cores
            memory: "512Mi"
        # WHY: K8s restarts if app deadlocks
        livenessProbe:
          httpGet:
            path: /health/live
            port: ${PORT}
          initialDelaySeconds: 10
          periodSeconds: 15
          failureThreshold: 3
        # WHY: Only route traffic when ready
        readinessProbe:
          httpGet:
            path: /health/ready
            port: ${PORT}
          initialDelaySeconds: 5
          periodSeconds: 10
        # WHY: Slow-starting apps (ML models) need longer startup
        startupProbe:
          httpGet:
            path: /health/live
            port: ${PORT}
          initialDelaySeconds: 0
          periodSeconds: 10
          failureThreshold: 30  # 5 minutes to start
        # WHY: Graceful shutdown for in-flight requests
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 5"]
        # WHY: Allow time for graceful shutdown
      terminationGracePeriodSeconds: 30
```

### Service

```yaml
apiVersion: v1
kind: Service
metadata:
  name: ${APP_NAME}
  labels:
    app.kubernetes.io/name: ${APP_NAME}
spec:
  # WHY: ClusterIP is safest default - internal only
  # Use NodePort for dev/testing, LoadBalancer for prod external access
  type: ClusterIP
  ports:
  # WHY: Service abstracts internal port - clients connect to :80, Pod runs on :8080
  # This allows standard external ports while container runs unprivileged
  - port: 80              # WHY: Service port (what clients connect to)
    targetPort: ${PORT}   # WHY: Pod port (>=1024, e.g., 8080)
    protocol: TCP
    name: http
  selector:
    # CRITICAL: Must EXACTLY match Pod template labels from Deployment
    # Mismatch = zero endpoints = Service routes to nothing
    app.kubernetes.io/name: ${APP_NAME}
```

**Verify Service→Pod connection**: `kubectl get endpoints ${APP_NAME}`
- Shows Pod IPs if selector matches
- Shows `<none>` if selector MISMATCHES Pod labels

---

## Security Context (Always Applied)

See `references/security-contexts.md` for full patterns.

```yaml
# Pod level
securityContext:
  runAsNonRoot: true           # WHY: Never run as root
  runAsUser: 1000              # WHY: Consistent non-root UID
  runAsGroup: 1000             # WHY: Consistent GID
  fsGroup: 1000                # WHY: Volume permissions
  seccompProfile:
    type: RuntimeDefault       # WHY: Block dangerous syscalls

# Container level
securityContext:
  allowPrivilegeEscalation: false  # WHY: Prevent root escalation
  readOnlyRootFilesystem: true     # WHY: Immutable container
  capabilities:
    drop: ["ALL"]                  # WHY: Minimal capabilities
```

---

## Output Checklist

Before delivering, verify:

### Pre-flight
- [ ] kubectl context is valid
- [ ] Namespace exists or was created
- [ ] Image exists locally or in registry
- [ ] For kind/minikube: image loaded into cluster

### Manifests
- [ ] All manifests have `app.kubernetes.io/*` labels
- [ ] Security context applied (runAsNonRoot, readOnlyRootFilesystem)
- [ ] **containerPort >= 1024** (privileged ports incompatible with runAsNonRoot)
- [ ] Resource requests AND limits defined
- [ ] Liveness and readiness probes configured
- [ ] No hardcoded secrets (use Secret references or env vars)

### Namespace Governance (if applicable)
- [ ] ResourceQuota sets namespace-wide CPU/memory/pod limits
- [ ] LimitRange provides default requests/limits for containers
- [ ] LimitRange max prevents single container from consuming quota
- [ ] NetworkPolicy isolates namespace (default-deny + explicit allows)
- [ ] Monitoring namespace allowed to scrape metrics

### Validation
- [ ] `kubectl apply --dry-run=server` passes
- [ ] Deployed to cluster successfully
- [ ] Pods reach Running state
- [ ] Health endpoints respond
- [ ] Service has endpoints (`kubectl get endpoints` shows Pod IPs, not `<none>`)

### Documentation
- [ ] Comments explain WHY for each config choice
- [ ] README.md with deployment instructions

---

## Reference Files

### Always Read First

| File | Purpose |
|------|---------|
| `references/security-contexts.md` | **CRITICAL**: Hardened security patterns |
| `references/health-probes.md` | **CRITICAL**: Liveness/readiness/startup |
| `references/resource-limits.md` | **CRITICAL**: CPU/memory guidance |
| `references/namespace-governance.md` | **CRITICAL**: ResourceQuota, LimitRange, NetworkPolicy, multi-team isolation |

### Debugging & Operations

| File | When to Read |
|------|--------------|
| `references/debugging-workflow.md` | **CRITICAL**: CrashLoopBackOff, command safety, logs, exec, debug containers |
| `references/deployment-gotchas.md` | **CRITICAL**: Architecture mismatch, ImagePull failures, pre-deploy validation, Helm gotchas |
| `references/networking-patterns.md` | **DEBUGGING**: Service has no endpoints, selector mismatch, DNS issues |
| `references/control-plane.md` | **DEBUGGING**: When deployments fail, pods stuck, rollback needed |

### Workload-Specific

| File | When to Read |
|------|--------------|
| `references/workload-types.md` | Choosing Deployment vs Job vs StatefulSet |
| `references/init-sidecar-patterns.md` | Init containers (model download, db wait), sidecars (logging, metrics) |
| `references/autoscaling-patterns.md` | HPA, custom metrics, KEDA |
| `references/gpu-workloads.md` | AI/ML workloads with GPU |
| `references/keda-patterns.md` | Event-driven scale-to-zero |

### Infrastructure

| File | When to Read |
|------|--------------|
| `references/networking-patterns.md` | Service types, Ingress, mesh |
| `references/storage-patterns.md` | PVC, ephemeral, shared storage |
| `references/configmap-patterns.md` | ConfigMap creation, env vars, volumes, hot-reload |
| `references/secrets-patterns.md` | ESO, Sealed Secrets, K8s Secrets |
| `references/rbac-patterns.md` | **SECURITY**: ServiceAccount, Role, RoleBinding, least privilege |
| `references/labels-annotations.md` | Standard labels, ArgoCD compat |

Overview

This skill provides production-grade Kubernetes manifests and pragmatic debugging guidance for containerized applications. It auto-detects project and Dockerfile signals, generates hardened, CKAD-aligned manifests with educational comments, and validates deployments before delivering files ready for ArgoCD-style directories.

How this skill works

It inspects the codebase and Dockerfile to detect ports, health endpoints, user instructions, and GPU needs. The agent queries kubectl to determine cluster type and allocatable resources, adapts resource requests/limits and replica counts accordingly, and emits Deployment/Service/Job/StatefulSet/CronJob manifests plus namespace governance (ResourceQuota, LimitRange, NetworkPolicy). It runs pre-flight checks and can apply and verify manifests on local clusters.

When to use it

Deploy an app to Kubernetes from source or a Docker image
Generate hardened manifests with non-root, readOnlyRootFilesystem, and dropped capabilities
Create namespaces with ResourceQuota and LimitRange for multi-team isolation
Set up init containers and sidecars (model download, migrations, logging, metrics)
Debug pod failures: CrashLoopBackOff, describe, logs, events, exec

Best practices

Detect cluster resources first and adapt requests/limits to avoid OOMs or scheduling failures
Use non-privileged container ports (>=1024) and map Service ports to maintain external compatibility
Always set requests and limits, liveness/readiness probes, and a reasonable terminationGracePeriod
Apply Pod securityContext and container securityContext (runAsNonRoot, drop capabilities, readOnlyRootFilesystem)
Validate with kubectl --dry-run=server and wait for Ready pods before considering deployment successful

Example use cases

Auto-generate manifests for a FastAPI app by reading EXPOSE and CMD to set containerPort and probes
Create dev/staging/prod namespaces with progressive ResourceQuota and LimitRange for team isolation
Produce a StatefulSet for a database with PDBs and persistent volumes
Add an initContainer to download models before the main container starts and a sidecar for logs
Diagnose CrashLoopBackOff: gather logs, kubectl describe, events, and suggest probe/resource adjustments

FAQ

Will the skill deploy directly to my cluster?

It can apply manifests after running pre-flight checks (kubectl context, namespace, image availability). If any check fails it stops and reports.

Does it produce Helm charts or Kustomize overlays?

No. It generates raw manifests and an ArgoCD-compatible k8s/base layout; Helm and Kustomize are noted as future options.