home / skills / amnadtaowsoam / cerebraskills / kubernetes-platform

kubernetes-platform skill

/62-scale-operations/kubernetes-platform

This skill helps you design and operate secure, GitOps-driven Kubernetes platforms with scalable patterns and guardrails for reliable production.

npx playbooks add skill amnadtaowsoam/cerebraskills --skill kubernetes-platform

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
6.7 KB
---
name: Kubernetes Platform Engineering
description: Production Kubernetes platform patterns covering cluster architecture, security, GitOps, observability, autoscaling, and operational guardrails
---

# Kubernetes Platform Engineering

## Overview

Platform engineering on Kubernetes is about making the “golden path” easy: secure-by-default workloads, consistent delivery via GitOps, and predictable operations (capacity, upgrades, incident response).

## Why This Matters

- **Reliability**: self-healing + controlled rollouts reduce incidents
- **Security**: least privilege and isolation by default
- **Developer velocity**: standard templates + paved roads
- **Cost control**: right-sizing and autoscaling without surprises

---

## Core Concepts

### 1. Cluster Architecture

- Separate node pools by workload class (stateless, stateful, GPU, batch, system).
- Define upgrade strategy (surge capacity, maintenance windows, version skew policy).
- Decide multi-cluster approach (per env/region/tenant) and shared services (ingress, monitoring, secrets).

### 2. Workload Patterns

- **Deployment** for stateless services; use `PodDisruptionBudget`, `readinessProbe`, and `topologySpreadConstraints`.
- **StatefulSet** for stable identity/storage (databases, queues) with explicit backup/restore.
- **Job/CronJob** for batch; enforce concurrency policies and deadlines.
- **DaemonSet** for node-level agents (logging, CNI, security).

### 3. Networking

- Use `Ingress`/Gateway with TLS termination and standardized auth/rate limiting at the edge.
- Apply **NetworkPolicies** (default-deny + explicit allow) for zero-trust inside the cluster.
- Consider service mesh only when you need it (mTLS, traffic shifting, retries, L7 telemetry); keep complexity intentional.

### 4. Storage

- Standardize `StorageClass` per tier (fast SSD, standard, archival) and backup policies.
- For stateful apps: set anti-affinity/zone spread; verify recovery time objectives (RTO/RPO).
- Avoid “pet” PVs by having a documented restore path and regular restore drills.

### 5. Security Baseline

- Enforce Pod Security (restricted baseline): non-root, read-only root filesystem, drop capabilities.
- RBAC least privilege: namespace-scoped roles, separate human vs automation identities.
- Secrets: prefer external secret managers (e.g., External Secrets + AWS/GCP/Azure secret store); avoid long-lived static secrets in Git.
- Admission policies: Kyverno / Gatekeeper for guardrails (image registries, resource limits, required labels).

### 6. Observability & Ops

- Metrics: Prometheus + dashboards per service (latency, errors, saturation, restarts).
- Logs: structured JSON with trace IDs; centralized retention policies.
- Traces: OpenTelemetry for request graphs across services.
- SLOs: define error budgets and link alerts to user impact.

### 7. GitOps & Delivery

- Git is source of truth: Argo CD / Flux reconciles desired state.
- Use Helm/Kustomize overlays per environment; keep secrets out-of-band.
- Progressive delivery: canary/blue-green via Argo Rollouts / Flagger (or mesh-based traffic shifting).

### 8. Cost & Capacity Management

- Require resource requests/limits; use VPA recommendations carefully (watch for restarts).
- Use HPA/KEDA for demand-based scaling; use cluster autoscaler/Karpenter for nodes.
- Separate spot/preemptible pools for tolerant workloads; enforce pod placement policies.

## Quick Start (Minimal “Production-ish” Deployment)

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
  selector: { matchLabels: { app: app } }
  template:
    metadata: { labels: { app: app } }
    spec:
      securityContext: { runAsNonRoot: true }
      containers:
        - name: app
          image: your-registry/app:1.0.0
          ports: [{ containerPort: 3000 }]
          resources:
            requests: { cpu: "100m", memory: "256Mi" }
            limits: { cpu: "500m", memory: "512Mi" }
          readinessProbe:
            httpGet: { path: /readyz, port: 3000 }
            initialDelaySeconds: 5
          livenessProbe:
            httpGet: { path: /healthz, port: 3000 }
            initialDelaySeconds: 10
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: app
spec:
  minAvailable: 1
  selector: { matchLabels: { app: app } }
```

## Production Checklist

- [ ] Requests/limits set; namespaces have `LimitRange`/`ResourceQuota`
- [ ] `readinessProbe` + `livenessProbe` + graceful shutdown configured
- [ ] PDB + topology spread/anti-affinity to survive node/zone events
- [ ] NetworkPolicies enforce default-deny and least access
- [ ] RBAC least privilege; service accounts scoped per app
- [ ] Metrics/logs/traces exported; alerts tied to SLOs
- [ ] Backup/restore drills for stateful components
- [ ] Upgrade playbook exists (cluster + addons + workload compatibility)

## Tools & Libraries

| Tool | Purpose |
|------|---------|
| Argo CD / Flux | GitOps continuous delivery |
| Helm / Kustomize | Packaging and environment overlays |
| Prometheus + Grafana | Metrics + dashboards |
| Loki/ELK | Central logging |
| OpenTelemetry | Distributed tracing |
| External Secrets | Secret manager integration |
| Kyverno / Gatekeeper | Policy enforcement |

## Anti-patterns

1. **No requests/limits**: unpredictable scheduling and noisy-neighbor incidents
2. **Hand-applied changes**: `kubectl apply` drift instead of GitOps reconciliation
3. **Flat network**: no NetworkPolicies; lateral movement is trivial
4. **Single replica**: planned/unplanned disruption becomes downtime

## Real-World Examples

### Example: Progressive Delivery (Canary)

- Deploy new version at 5% traffic; watch SLOs; ramp to 25/50/100%; auto-rollback on regressions.

### Example: Autoscaling

- HPA on CPU/requests-per-second + cluster autoscaler for nodes; cap max replicas to protect dependencies.

### Example: Zero-Trust Networking

- Default-deny in each namespace; only allow traffic from ingress controller to app and app to explicit dependencies.

## Common Mistakes

1. Missing readiness probes (traffic hits pods before they’re ready)
2. Running containers as root (wider blast radius on compromise)
3. Unbounded egress (data exfiltration paths and surprise costs)
4. Treating the cluster as the product (no SLOs, no runbooks, no ownership)

## Integration Points

- CI (image build, SBOM, scanning, signing)
- CD (GitOps reconciler + progressive delivery)
- Cloud provider (EKS/GKE/AKS, IAM, load balancers, DNS)
- Observability + incident management (paging, runbooks, postmortems)

## Further Reading

- [Kubernetes Documentation](https://kubernetes.io/docs/)
- [Kubernetes Production Best Practices](https://learnk8s.io/production-best-practices)
- [Kubernetes Patterns](https://k8spatterns.io/)

Overview

This skill captures production Kubernetes platform engineering patterns that make the golden path easy: secure-by-default workloads, predictable delivery via GitOps, and operational guardrails for reliability, security, and cost control. It focuses on cluster architecture, workload patterns, networking, storage, observability, autoscaling, and delivery practices to run Kubernetes at scale.

How this skill works

The skill inspects and codifies platform decisions and templates for clusters, node pools, workload manifests, policies, GitOps workflows, and observability configuration. It provides concrete patterns: deployment/statefulset/job semantics, NetworkPolicy defaults, StorageClass tiers, RBAC and pod security baselines, plus GitOps reconciliation and progressive delivery recipes. It also includes operational checklists for upgrades, backups, capacity, and SLO-driven alerting.

When to use it

  • Designing or revising a production Kubernetes platform architecture
  • Onboarding a platform team to a secure-by-default, GitOps delivery model
  • Standardizing workload manifests, security policies, and observability across clusters
  • Defining autoscaling, cost controls, and node pool strategies for mixed workloads
  • Preparing runbooks, upgrade playbooks, and backup/restore drills

Best practices

  • Separate node pools by workload class (stateless, stateful, GPU, batch, system) and enforce pod placement
  • Enforce pod security (non-root, read-only FS, drop capabilities) and RBAC least privilege with namespace-scoped roles
  • Make Git the source of truth (Argo CD/Flux); keep secrets out-of-band with external secret managers
  • Require resource requests/limits, use HPA/KEDA and cluster autoscaler, and limit max replicas to protect dependencies
  • Apply default-deny NetworkPolicies and use service mesh intentionally when you need mTLS or advanced traffic control
  • Define SLOs, export metrics/logs/traces, and link alerts to user impact with runbooks and error budgets

Example use cases

  • Progressive delivery: canary rollouts with Argo Rollouts and auto-rollback on SLO regressions
  • Production workload template: Deployment with PDB, readiness/liveness probes, resource limits, and topology spread
  • Stateful services: StatefulSet with StorageClass tiering, backup/restore drills, and anti-affinity across zones
  • Autoscaling: HPA on CPU/RPS combined with cluster autoscaler or Karpenter and separate spot pools for tolerant jobs
  • Zero-trust networking: default-deny per namespace and explicit allow rules for ingress and service-to-service calls

FAQ

How do I choose single-cluster vs multi-cluster?

Use single-cluster for simplicity at small scale; prefer multi-cluster for isolation by environment/region/tenant, blast-radius control, and independent upgrades.

When should I adopt a service mesh?

Introduce a mesh when you need mTLS, advanced traffic shifting, L7 retries, or detailed per-call telemetry; avoid adding mesh complexity unless those needs are clear.