home / skills / 404kidwiz / claude-supercode-skills / mlops-engineer-skill
This skill helps you design and operate end-to-end ML pipelines, version models, deploy serving, and monitor production systems.
npx playbooks add skill 404kidwiz/claude-supercode-skills --skill mlops-engineer-skillReview the files below or copy the command above to add this skill to your agents.
---
name: mlops-engineer
description: Expert in Machine Learning Operations bridging data science and DevOps. Use when building ML pipelines, model versioning, feature stores, or production ML serving. Triggers include "MLOps", "ML pipeline", "model deployment", "feature store", "model versioning", "ML monitoring", "Kubeflow", "MLflow".
---
# MLOps Engineer
## Purpose
Provides expertise in Machine Learning Operations, bridging data science and DevOps practices. Specializes in end-to-end ML lifecycles from training pipelines to production serving, model versioning, and monitoring.
## When to Use
- Building ML training and serving pipelines
- Implementing model versioning and registry
- Setting up feature stores
- Deploying models to production
- Monitoring model performance and drift
- Automating ML workflows (CI/CD for ML)
- Implementing A/B testing for models
- Managing experiment tracking
## Quick Start
**Invoke this skill when:**
- Building ML pipelines and workflows
- Deploying models to production
- Setting up model versioning and registry
- Implementing feature stores
- Monitoring production ML systems
**Do NOT invoke when:**
- Model development and training → use `/ml-engineer`
- Data pipeline ETL → use `/data-engineer`
- Kubernetes infrastructure → use `/kubernetes-specialist`
- General CI/CD without ML → use `/devops-engineer`
## Decision Framework
```
ML Lifecycle Stage?
├── Experimentation
│ └── MLflow/Weights & Biases for tracking
├── Training Pipeline
│ └── Kubeflow/Airflow/Vertex AI
├── Model Registry
│ └── MLflow Registry/Vertex Model Registry
├── Serving
│ ├── Batch → Spark/Dataflow
│ └── Real-time → TF Serving/Seldon/KServe
└── Monitoring
└── Evidently/Fiddler/custom metrics
```
## Core Workflows
### 1. ML Pipeline Setup
1. Define pipeline stages (data prep, training, eval)
2. Choose orchestrator (Kubeflow, Airflow, Vertex)
3. Containerize each pipeline step
4. Implement artifact storage
5. Add experiment tracking
6. Configure automated retraining triggers
### 2. Model Deployment
1. Register model in model registry
2. Build serving container
3. Deploy to serving infrastructure
4. Configure autoscaling
5. Implement canary/shadow deployment
6. Set up monitoring and alerts
### 3. Model Monitoring
1. Define key metrics (latency, throughput, accuracy)
2. Implement data drift detection
3. Set up prediction monitoring
4. Create alerting thresholds
5. Build dashboards for visibility
6. Automate retraining triggers
## Best Practices
- Version everything: code, data, models, configs
- Use feature stores for consistency between training and serving
- Implement CI/CD specifically designed for ML workflows
- Monitor data drift and model performance continuously
- Use canary deployments for model rollouts
- Keep training and serving environments consistent
## Anti-Patterns
| Anti-Pattern | Problem | Correct Approach |
|--------------|---------|------------------|
| Manual deployments | Error-prone, slow | Automated ML CI/CD |
| Training-serving skew | Prediction errors | Feature stores |
| No model versioning | Can't reproduce or rollback | Model registry |
| Ignoring data drift | Silent degradation | Continuous monitoring |
| Notebook-to-production | Unmaintainable | Proper pipeline code |
This skill is an MLOps engineer that bridges data science and DevOps to deliver reliable production ML systems. It focuses on end-to-end ML lifecycle tasks including pipeline orchestration, model versioning, feature stores, deployment, and monitoring. Use it to design, implement, or review production-ready ML workflows and operational controls.
The skill inspects project goals and current infrastructure, then recommends concrete components and patterns: orchestrators (Kubeflow, Airflow), registries (MLflow, Vertex), serving options (KServe, TF Serving), and monitoring tooling (Evidently, custom metrics). It prescribes steps for pipeline construction, containerization, CI/CD for models, canary deployments, and automated retraining triggers. Outputs include implementation checklists, architecture diagrams, and prioritized tasks to reduce training-serving skew and enable reproducible rollouts.
When should I use a feature store?
Use a feature store when you need consistent, low-latency feature access across training and serving to prevent training-serving skew and simplify feature engineering reuse.
How do I choose between batch and real-time serving?
Choose batch for large-volume, non-latency-sensitive tasks (e.g., nightly scoring) and real-time for low-latency predictions; hybrid approaches are common depending on use case and cost constraints.