home / skills / pluginagentmarketplace / custom-plugin-data-engineer / containerization
npx playbooks add skill pluginagentmarketplace/custom-plugin-data-engineer --skill containerizationReview the files below or copy the command above to add this skill to your agents.
---
name: containerization
description: Docker, Kubernetes, container orchestration, and cloud-native deployment for data applications
sasmp_version: "1.3.0"
bonded_agent: 03-devops-engineer
bond_type: PRIMARY_BOND
skill_version: "2.0.0"
last_updated: "2025-01"
complexity: intermediate
estimated_mastery_hours: 120
prerequisites: [python-programming, cloud-platforms]
unlocks: [mlops, big-data]
---
# Containerization & Kubernetes
Production-grade container orchestration for data engineering workloads with Docker and Kubernetes.
## Quick Start
```dockerfile
# Dockerfile for PySpark data application
FROM python:3.12-slim
# Install Java for Spark
RUN apt-get update && apt-get install -y openjdk-17-jdk-headless && \
apt-get clean && rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Install dependencies first (cache optimization)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY src/ ./src/
COPY config/ ./config/
# Non-root user for security
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
ENV PYTHONPATH=/app
ENV JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
ENTRYPOINT ["python", "-m", "src.main"]
```
## Core Concepts
### 1. Multi-Stage Builds
```dockerfile
# Build stage
FROM python:3.12 AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip wheel --no-cache-dir --wheel-dir /wheels -r requirements.txt
# Runtime stage
FROM python:3.12-slim AS runtime
COPY --from=builder /wheels /wheels
RUN pip install --no-cache-dir /wheels/* && rm -rf /wheels
COPY src/ /app/src/
WORKDIR /app
USER 1000
CMD ["python", "-m", "src.main"]
```
### 2. Kubernetes Deployment
```yaml
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: etl-worker
labels:
app: etl-worker
spec:
replicas: 3
selector:
matchLabels:
app: etl-worker
template:
metadata:
labels:
app: etl-worker
spec:
containers:
- name: etl-worker
image: company/etl-worker:v1.2.0
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
- name: LOG_LEVEL
value: "INFO"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: etl-worker
topologyKey: kubernetes.io/hostname
```
### 3. Kubernetes CronJob for ETL
```yaml
# cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: daily-etl
spec:
schedule: "0 2 * * *" # 2 AM daily
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
backoffLimit: 2
activeDeadlineSeconds: 7200 # 2 hour timeout
template:
spec:
restartPolicy: Never
containers:
- name: etl-job
image: company/etl-pipeline:v1.0.0
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
env:
- name: EXECUTION_DATE
value: "{{ .Date }}"
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
volumes:
- name: config
configMap:
name: etl-config
```
### 4. Helm Chart Structure
```yaml
# Chart.yaml
apiVersion: v2
name: data-pipeline
version: 1.0.0
appVersion: "2.0.0"
description: Data pipeline Helm chart
# values.yaml
replicaCount: 3
image:
repository: company/data-pipeline
tag: "latest"
pullPolicy: IfNotPresent
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
env:
LOG_LEVEL: INFO
BATCH_SIZE: "1000"
secrets:
- name: DATABASE_URL
secretName: db-credentials
key: url
```
### 5. Docker Compose for Local Dev
```yaml
# docker-compose.yml
version: '3.8'
services:
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: datawarehouse
POSTGRES_USER: admin
POSTGRES_PASSWORD: ${DB_PASSWORD}
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U admin"]
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:7-alpine
ports:
- "6379:6379"
airflow-webserver:
image: apache/airflow:2.8.0-python3.11
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
environment:
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://admin:${DB_PASSWORD}@postgres/datawarehouse
AIRFLOW__CELERY__BROKER_URL: redis://redis:6379/0
ports:
- "8080:8080"
volumes:
- ./dags:/opt/airflow/dags
- ./plugins:/opt/airflow/plugins
volumes:
postgres_data:
```
## Tools & Technologies
| Tool | Purpose | Version (2025) |
|------|---------|----------------|
| **Docker** | Containerization | 25+ |
| **Kubernetes** | Orchestration | 1.29+ |
| **Helm** | K8s package manager | 3.14+ |
| **ArgoCD** | GitOps deployment | 2.10+ |
| **Kustomize** | K8s config management | Built-in |
| **containerd** | Container runtime | 1.7+ |
| **Podman** | Docker alternative | 4.8+ |
## Troubleshooting Guide
| Issue | Symptoms | Root Cause | Fix |
|-------|----------|------------|-----|
| **OOMKilled** | Pod restarts, exit code 137 | Memory limit exceeded | Increase limits, optimize code |
| **CrashLoopBackOff** | Pod keeps restarting | App crash, bad config | Check logs: `kubectl logs pod` |
| **ImagePullBackOff** | Pod stuck in Pending | Image not found, auth | Check image name, pull secrets |
| **Pending Pod** | Pod won't schedule | No resources, node selector | Check resources, affinity rules |
### Debug Commands
```bash
# Check pod status and events
kubectl describe pod <pod-name>
# View container logs
kubectl logs <pod-name> -c <container-name> --previous
# Execute shell in container
kubectl exec -it <pod-name> -- /bin/sh
# Check resource usage
kubectl top pods
# Debug networking
kubectl run debug --image=busybox -it --rm -- sh
```
## Best Practices
```dockerfile
# ✅ DO: Use specific image tags
FROM python:3.12.1-slim
# ✅ DO: Use non-root user
USER 1000
# ✅ DO: Use multi-stage builds
# ✅ DO: Set resource limits
# ✅ DO: Use health checks
# ❌ DON'T: Run as root
# ❌ DON'T: Use latest tag
# ❌ DON'T: Store secrets in images
```
## Resources
- [Docker Best Practices](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/)
- [Kubernetes Docs](https://kubernetes.io/docs/)
- [Helm Charts](https://helm.sh/docs/)
---
**Skill Certification Checklist:**
- [ ] Can write production Dockerfiles
- [ ] Can deploy applications to Kubernetes
- [ ] Can create Helm charts
- [ ] Can debug container issues
- [ ] Can implement health checks and probes