home / skills / 404kidwiz / claude-supercode-skills / kubernetes-specialist-skill

kubernetes-specialist-skill skill

/kubernetes-specialist-skill

This skill delivers expert Kubernetes orchestration guidance for architecture, Helm, operators, and multi-cluster strategies across cloud and on-prem

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill kubernetes-specialist-skill

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
5.6 KB
---
name: kubernetes-specialist
description: "Expert Kubernetes Specialist with deep expertise in container orchestration, cluster management, and cloud-native applications. Proficient in Kubernetes architecture, Helm charts, operators, and multi-cluster management across EKS, AKS, GKE, and on-premises deployments."
---

# Kubernetes Specialist

## Purpose

Provides expert Kubernetes orchestration and cloud-native application expertise with deep knowledge of container orchestration, cluster management, and production-grade deployments. Specializes in Kubernetes architecture, Helm charts, operators, multi-cluster management, and GitOps workflows across EKS, AKS, GKE, and on-premises deployments.

## When to Use

- Designing Kubernetes cluster architecture for production workloads
- Implementing Helm charts, operators, or GitOps workflows (ArgoCD, Flux)
- Troubleshooting cluster issues (networking, storage, performance)
- Planning Kubernetes upgrades or multi-cluster strategies
- Optimizing resource utilization and cost in Kubernetes environments
- Setting up service mesh (Istio, Linkerd) and observability
- Implementing Kubernetes security and RBAC policies

## Quick Start

**Invoke this skill when:**
- Designing Kubernetes cluster architecture for production workloads
- Implementing Helm charts, operators, or GitOps workflows
- Troubleshooting cluster issues (networking, storage, performance)
- Planning Kubernetes upgrades or multi-cluster strategies
- Optimizing resource utilization and cost in Kubernetes environments

**Do NOT invoke when:**
- Simple Docker container needs (use docker commands directly)
- Cloud infrastructure provisioning (use cloud-architect instead)
- Application code debugging (use backend-developer/frontend-developer)
- Database-specific issues (use database-administrator instead)

## Decision Framework

### Deployment Strategy Selection

```
├─ Zero downtime required?
│   ├─ Instant rollback needed → Blue-Green Deployment
│   │   Pros: Instant switch, easy rollback
│   │   Cons: 2x resources during deployment
│   │
│   ├─ Gradual rollout → Canary Deployment
│   │   Pros: Test with subset of traffic
│   │   Cons: Complex routing setup
│   │
│   └─ Simple updates → Rolling Update (default)
│       Pros: Built-in, no extra resources
│       Cons: Rollback takes time
│
├─ Stateful application?
│   ├─ Database → StatefulSet + PVC
│   │   Pros: Stable network IDs, ordered deployment
│   │   Cons: Complex scaling
│   │
│   └─ Stateless → Deployment
│       Pros: Easy scaling, self-healing
│
└─ Batch processing?
    ├─ One-time → Job
    ├─ Scheduled → CronJob
    └─ Parallel processing → Job with parallelism
```

### Resource Configuration Matrix

| Workload Type | CPU Request | CPU Limit | Memory Request | Memory Limit |
|---------------|-------------|-----------|----------------|--------------|
| **Web API** | 100m-500m | 1000m | 256Mi-512Mi | 1Gi |
| **Worker** | 500m-1000m | 2000m | 512Mi-1Gi | 2Gi |
| **Database** | 1000m-2000m | 4000m | 2Gi-4Gi | 8Gi |
| **Cache** | 100m-250m | 500m | 1Gi-4Gi | 8Gi |
| **Batch Job** | 500m-2000m | 4000m | 1Gi-4Gi | 8Gi |

### Node Pool Strategy

| Use Case | Instance Type | Scaling | Cost |
|----------|--------------|---------|------|
| **System pods** | t3.large (3 nodes) | Fixed | Low |
| **Applications** | m5.xlarge | Auto 3-20 | Medium |
| **Batch/Spot** | m5.large-2xlarge | Auto 0-50 | Very Low |
| **GPU workloads** | p3.2xlarge | Manual | High |

### Red Flags → Escalate

**STOP and escalate if:**
- Cluster upgrade with breaking API changes (deprecated versions)
- Multi-region active-active requirements
- Compliance requirements (PCI-DSS, HIPAA) need validation
- Custom scheduler or controller development needed
- etcd corruption or cluster state issues

## Quality Checklist

### Cluster Configuration
- [ ] Multi-AZ deployment (nodes spread across availability zones)
- [ ] Node autoscaling configured (Cluster Autoscaler or Karpenter)
- [ ] System node pool with taints (separate critical addons from apps)
- [ ] Encryption enabled (secrets at rest with KMS)
- [ ] Audit logging enabled (API server logs)

### Security
- [ ] Pod Security Standards enforced (restricted or baseline)
- [ ] Network policies configured (default deny + explicit allow)
- [ ] RBAC configured (least privilege for all service accounts)
- [ ] Image scanning enabled (scan for vulnerabilities)
- [ ] Private container registry configured

### Resource Management
- [ ] All pods have resource requests and limits
- [ ] HorizontalPodAutoscalers configured for scalable workloads
- [ ] PodDisruptionBudgets defined (prevent too many pods down)
- [ ] ResourceQuotas set per namespace
- [ ] LimitRanges defined (default limits for pods)

### High Availability
- [ ] Deployments have ≥2 replicas
- [ ] Anti-affinity rules prevent pod co-location
- [ ] Readiness and liveness probes configured
- [ ] PodDisruptionBudgets allow for rolling updates
- [ ] Multi-region cluster (if global scale required)

### Observability
- [ ] Metrics server installed (kubectl top works)
- [ ] Prometheus monitoring application metrics
- [ ] Centralized logging (CloudWatch, Elasticsearch, Loki)
- [ ] Distributed tracing (Jaeger, Tempo)
- [ ] Dashboards for cluster and application health

### Disaster Recovery
- [ ] Velero installed for cluster backups
- [ ] Backup schedule configured (daily minimum)
- [ ] Restore tested (annual drill)
- [ ] etcd backups automated (cloud-managed clusters)

## Additional Resources

- **Detailed Technical Reference**: See [REFERENCE.md](REFERENCE.md)
- **Code Examples & Patterns**: See [EXAMPLES.md](EXAMPLES.md)

Overview

This skill provides expert Kubernetes orchestration and cloud-native operations guidance for designing, managing, and optimizing production clusters across EKS, AKS, GKE, and on-premises environments. It combines architecture patterns, deployment strategies, Helm and operator best practices, multi-cluster management, and GitOps workflows to deliver reliable, secure, and cost-effective platforms. Use it to make pragmatic decisions, reduce downtime, and harden cluster operations for production workloads.

How this skill works

The skill inspects the intended workload profile, availability and compliance requirements, and current cluster constraints to recommend deployment strategies, resource sizing, and node pool designs. It applies a decision framework that selects between blue-green, canary, or rolling updates, maps workload types to resource request/limit suggestions, and proposes node pool and autoscaling patterns. It also runs a quality checklist covering security, observability, HA, resource management, and disaster recovery to surface gaps and next steps.

When to use it

  • Designing production-ready Kubernetes architecture and node pools
  • Implementing or reviewing Helm charts, operators, or GitOps pipelines (ArgoCD/Flux)
  • Troubleshooting cluster issues: networking, storage, scheduling, or performance
  • Planning cluster upgrades, multi-cluster strategies, or migration to managed services
  • Optimizing resource utilization, cost, and autoscaling policies
  • Hardening cluster security: RBAC, network policies, image scanning, and secrets encryption

Best practices

  • Choose deployment strategy based on rollback and traffic-safety needs (blue-green, canary, rolling)
  • Set CPU/memory requests and limits for every pod and configure HorizontalPodAutoscalers
  • Separate system and application node pools and use taints/tolerations for critical workloads
  • Enforce Pod Security Standards, default-deny network policies, and least-privilege RBAC
  • Enable observability: metrics, centralized logging, and distributed tracing before incidents
  • Automate backups (Velero), test restore procedures, and keep etcd backups current

Example use cases

  • Designing an EKS cluster for a high-traffic web API with autoscaling and multi-AZ availability
  • Converting legacy deployments into Helm charts and adding GitOps delivery with ArgoCD
  • Creating a node pool strategy mixing spot/spot-like instances for batch jobs and on-demand for stateful services
  • Performing a pre-upgrade checklist to identify deprecated APIs and required migration steps
  • Implementing network policies and image scanning in a regulated environment (PCI/HIPAA)

FAQ

When should I use an operator instead of Helm?

Use an operator when you need custom lifecycle automation, complex stateful workflows, or self-healing beyond templated installs; use Helm for templated deployments and simpler upgrades.

How do I choose node instance types and scaling limits?

Map workload characteristics to instance types: system pods on small fixed nodes, latency-sensitive apps on stable medium instances, batch on cheaper autoscaled pools; set autoscaler min/max by expected spike and cost constraints.