home / skills / bobmatnyc / claude-mpm-skills / kubernetes

kubernetes skill

safe

This skill guides Kubernetes deployments with safe rollouts, proper probes, resource sizing, and fast debugging using kubectl and best practices.

npx playbooks add skill bobmatnyc/claude-mpm-skills --skill kubernetes

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

1.6 KB

---
name: kubernetes
description: "Kubernetes operations playbook for deploying services: core objects, probes, resource sizing, safe rollouts, and fast kubectl debugging"
version: 1.0.0
category: universal
author: Claude MPM Team
license: MIT
progressive_disclosure:
  entry_point:
    summary: "Operate Kubernetes workloads with safe rollouts, health probes, resource sizing, and fast kubectl debugging"
    when_to_use: "When deploying services to Kubernetes, diagnosing cluster/runtime issues, or hardening workloads for production readiness"
    quick_start: "1. Inspect: kubectl get/describe 2. Check events/logs 3. Add probes + requests/limits 4. Roll out safely 5. Validate endpoints"
  token_estimate:
    entry: 150
    full: 9000
context_limit: 900
tags:
  - kubernetes
  - k8s
  - infrastructure
  - deployment
  - operations
  - reliability
requires_tools:
  - kubectl
---

# Kubernetes

## Quick Start (kubectl)

```bash
kubectl describe pod/<pod> -n <ns>
kubectl get events -n <ns> --sort-by=.lastTimestamp | tail -n 30
kubectl logs pod/<pod> -n <ns> --previous --tail=200
```

## Production Minimums

- Health: `readinessProbe` and `startupProbe` for safe rollouts
- Resources: set `requests`/`limits` to prevent noisy-neighbor failures
- Security: run as non-root and grant least privilege

## Load Next (References)

- `references/core-objects.md` — choose the right workload/controller and service type
- `references/rollouts-and-probes.md` — probes, rollouts, graceful shutdown, rollback
- `references/debugging-runbook.md` — common failure modes and a fast triage flow
- `references/security-hardening.md` — pod security, RBAC, network policy, supply chain

Overview

This skill is a Kubernetes operations playbook focused on deploying and running services reliably in production. It provides pragmatic guidance for core objects, health probes, resource sizing, safe rollouts, and fast kubectl debugging. The goal is to shorten mean time to resolution and reduce deployment risk with repeatable defaults.

How this skill works

The skill inspects deployment patterns and recommends concrete manifests and operational checks: proper workload/controller choices, service types, and security posture. It prescribes probes and resource settings to enable safe rollouts, and it provides a compact kubectl troubleshooting flow to quickly triage pods and recent events. Reference topics cover rollouts, debugging, and hardening for deeper investigation.

When to use it

Deploying new services or changing rollout strategy in production
Hardening pod security, RBAC, or network policies
Diagnosing failed pods, crashes, or noisy-neighbor resource contention
Creating or reviewing Kubernetes manifests for stability and observability

Best practices

Define readinessProbe and startupProbe to prevent routing to unready pods and to allow safe rollouts
Set CPU/memory requests and limits to avoid noisy-neighbor failures and enable scheduler decisions
Run containers as non-root and apply least-privilege RBAC and network policies
Use controlled rollouts with maxUnavailable/maxSurge and automated rollback on failures
Collect and inspect recent events and previous pod logs during triage to identify root causes

Example use cases

Rapid triage of a crashing pod: describe the pod, inspect latest events, and fetch previous logs to find startup failures
Preparing a service for production: choose the correct controller (Deployment/StatefulSet/CronJob), add probes, and set resources
Mitigating noisy-neighbor issues by auditing and tuning requests/limits across a namespace
Hardening a cluster: enforce pod security constraints, tighten RBAC, and verify supply-chain provenance for container images

FAQ

What minimal probes should I add?

Use a startupProbe to verify the app becomes healthy at startup and a readinessProbe to control traffic during rollouts; liveness probes are optional if startup/readiness cover failure modes.

How do I quickly triage a pod that restarted?

Describe the pod to see events, list recent namespace events sorted by timestamp, and fetch the previous container logs with --previous and a tail to inspect the crash output.