home / skills / 404kidwiz / claude-supercode-skills / ml-engineer-skill

ml-engineer-skill skill

safe

This skill helps you design and operationalize scalable ML pipelines, deployment, and monitoring to accelerate production-ready AI systems.

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill ml-engineer-skill

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

11.2 KB

---
name: ml-engineer
description: Expert in building scalable ML systems, from data pipelines and model training to production deployment and monitoring.
---

# Machine Learning Engineer

## Purpose

Provides MLOps and production ML engineering expertise specializing in end-to-end ML pipelines, model deployment, and infrastructure automation. Bridges data science and production engineering with robust, scalable machine learning systems.

## When to Use

- Building end-to-end ML pipelines (Data → Train → Validate → Deploy)
- Deploying models to production (Real-time API, Batch, or Edge)
- Implementing MLOps practices (CI/CD for ML, Experiment Tracking)
- Optimizing model performance (Latency, Throughput, Resource usage)
- Setting up feature stores and model registries
- Implementing model monitoring (Drift detection, Performance tracking)
- Scaling training workloads (Distributed training)

---
---

## 2. Decision Framework

### Model Serving Strategy

```
Need to serve predictions?
│
├─ Real-time (Low Latency)?
│  │
│  ├─ High Throughput? → **Kubernetes (KServe/Seldon)**
│  ├─ Low/Medium Traffic? → **Serverless (Lambda/Cloud Run)**
│  └─ Ultra-low latency (<10ms)? → **C++/Rust Inference Server (Triton)**
│
├─ Batch Processing?
│  │
│  ├─ Large Scale? → **Spark / Ray**
│  └─ Scheduled Jobs? → **Airflow / Prefect**
│
└─ Edge / Client-side?
   │
   ├─ Mobile? → **TFLite / CoreML**
   └─ Browser? → **TensorFlow.js / ONNX Runtime Web**
```

### Training Infrastructure

```
Training Environment?
│
├─ Single Node?
│  │
│  ├─ Interactive? → **JupyterHub / SageMaker Notebooks**
│  └─ Automated? → **Docker Container on VM**
│
└─ Distributed?
   │
   ├─ Data Parallelism? → **Ray Train / PyTorch DDP**
   └─ Pipeline orchestration? → **Kubeflow / Airflow / Vertex AI**
```

### Feature Store Decision

| Need | Recommendation | Rationale |
|------|----------------|-----------|
| **Simple / MVP** | **No Feature Store** | Use SQL/Parquet files. Overhead of FS is too high. |
| **Team Consistency** | **Feast** | Open source, manages online/offline consistency. |
| **Enterprise / Managed** | **Tecton / Hopsworks** | Full governance, lineage, managed SLA. |
| **Cloud Native** | **Vertex/SageMaker FS** | Tight integration if already in that cloud ecosystem. |

**Red Flags → Escalate to `oracle`:**
- "Real-time" training requirements (online learning) without massive infrastructure budget
- Deploying LLMs (7B+ params) on CPU-only infrastructure
- Training on PII/PHI data without privacy-preserving techniques (Federated Learning, Differential Privacy)
- No validation set or "ground truth" feedback loop mechanism

---
---

## 3. Core Workflows

### Workflow 1: End-to-End Training Pipeline

**Goal:** Automate model training, validation, and registration using MLflow.

**Steps:**

1.  **Setup Tracking**
    ```python
    import mlflow
    import mlflow.sklearn
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, precision_score

    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment("churn-prediction-prod")
    ```

2.  **Training Script (`train.py`)**
    ```python
    def train(max_depth, n_estimators):
        with mlflow.start_run():
            # Log params
            mlflow.log_param("max_depth", max_depth)
            mlflow.log_param("n_estimators", n_estimators)
            
            # Train
            model = RandomForestClassifier(
                max_depth=max_depth, 
                n_estimators=n_estimators,
                random_state=42
            )
            model.fit(X_train, y_train)
            
            # Evaluate
            preds = model.predict(X_test)
            acc = accuracy_score(y_test, preds)
            prec = precision_score(y_test, preds)
            
            # Log metrics
            mlflow.log_metric("accuracy", acc)
            mlflow.log_metric("precision", prec)
            
            # Log model artifact with signature
            from mlflow.models.signature import infer_signature
            signature = infer_signature(X_train, preds)
            
            mlflow.sklearn.log_model(
                model, 
                "model",
                signature=signature,
                registered_model_name="churn-model"
            )
            
            print(f"Run ID: {mlflow.active_run().info.run_id}")
    
    if __name__ == "__main__":
        train(max_depth=5, n_estimators=100)
    ```

3.  **Pipeline Orchestration (Bash/Airflow)**
    ```bash
    #!/bin/bash
    # Run training
    python train.py
    
    # Check if model passed threshold (e.g. via MLflow API)
    # If yes, transition to Staging
    ```

---
---

### Workflow 3: Drift Detection (Monitoring)

**Goal:** Detect if production data distribution has shifted from training data.

**Steps:**

1.  **Baseline Generation (During Training)**
    ```python
    import evidently
    from evidently.report import Report
    from evidently.metric_preset import DataDriftPreset

    # Calculate baseline profile on training data
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=train_df, current_data=test_df)
    report.save_json("baseline_drift.json")
    ```

2.  **Production Monitoring Job**
    ```python
    # Scheduled daily job
    def check_drift():
        # Load production logs (last 24h)
        current_data = load_production_logs()
        reference_data = load_training_data()
        
        report = Report(metrics=[DataDriftPreset()])
        report.run(reference_data=reference_data, current_data=current_data)
        
        result = report.as_dict()
        dataset_drift = result['metrics'][0]['result']['dataset_drift']
        
        if dataset_drift:
            trigger_alert("Data Drift Detected!")
            trigger_retraining()
    ```

---
---

### Workflow 5: RAG Pipeline with Vector Database

**Goal:** Build a production retrieval pipeline using Pinecone/Weaviate and LangChain.

**Steps:**

1.  **Ingestion (Chunking & Embedding)**
    ```python
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_openai import OpenAIEmbeddings
    from langchain_pinecone import PineconeVectorStore
    
    # Chunking
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    docs = text_splitter.split_documents(raw_documents)
    
    # Embedding & Indexing
    embeddings = OpenAIEmbeddings()
    vectorstore = PineconeVectorStore.from_documents(
        docs, 
        embeddings, 
        index_name="knowledge-base"
    )
    ```

2.  **Retrieval & Generation**
    ```python
    from langchain.chains import RetrievalQA
    from langchain_openai import ChatOpenAI
    
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
    )
    
    response = qa_chain.invoke("How do I reset my password?")
    print(response['result'])
    ```

3.  **Optimization (Hybrid Search)**
    -   Combine **Dense Retrieval** (Vectors) with **Sparse Retrieval** (BM25/Keywords).
    -   Use **Reranking** (Cohere/Cross-Encoder) on the top 20 results to select best 5.

---
---

## 5. Anti-Patterns & Gotchas

### ❌ Anti-Pattern 1: Training-Serving Skew

**What it looks like:**
-   Feature logic implemented in SQL for training, but re-implemented in Java/Python for serving.
-   "Mean imputation" value calculated on training set but not saved; serving uses a different default.

**Why it fails:**
-   Model behaves unpredictably in production.
-   Debugging is extremely difficult.

**Correct approach:**
-   Use a **Feature Store** or shared library for transformations.
-   Wrap preprocessing logic **inside** the model artifact (e.g., Scikit-Learn Pipeline, TensorFlow Transform).

### ❌ Anti-Pattern 2: Manual Deployments

**What it looks like:**
-   Data Scientist emails a `.pkl` file to an engineer.
-   Engineer manually copies it to a server and restarts the flask app.

**Why it fails:**
-   No version control.
-   No reproducibility.
-   High risk of human error.

**Correct approach:**
-   **CI/CD Pipeline:** Git push triggers build → test → deploy.
-   **Model Registry:** Deploy specific version hash from registry.

### ❌ Anti-Pattern 3: Silent Failures

**What it looks like:**
-   Model API returns `200 OK` but prediction is garbage because input data was corrupted (e.g., all Nulls).
-   Model returns default class `0` for everything.

**Why it fails:**
-   Application keeps running, but business value is lost.
-   Incident detected weeks later by business stakeholders.

**Correct approach:**
-   **Input Schema Validation:** Reject bad requests (Pydantic/TFX).
-   **Output Monitoring:** Alert if prediction distribution shifts (e.g., if model predicts "Fraud" 0% of time for 1 hour).

---
---

## 7. Quality Checklist

**Reliability:**
-   [ ] **Health Checks:** `/health` endpoint implemented (liveness/readiness).
-   [ ] **Retries:** Client has retry logic with exponential backoff.
-   [ ] **Fallback:** Default heuristic exists if model fails or times out.
-   [ ] **Validation:** Inputs validated against schema before inference.

**Performance:**
-   [ ] **Latency:** P99 latency meets SLA (e.g., < 100ms).
-   [ ] **Throughput:** System autoscales with load.
-   [ ] **Batching:** Inference requests batched if using GPU.
-   [ ] **Image Size:** Docker image optimized (slim base, multi-stage build).

**Reproducibility:**
-   [ ] **Versioning:** Code, Data, and Model versions linked.
-   [ ] **Artifacts:** Saved in object storage (S3/GCS), not local disk.
-   [ ] **Environment:** Dependencies pinned (`requirements.txt` / `conda.yaml`).

**Monitoring:**
-   [ ] **Technical:** Latency, Error Rate, CPU/Memory/GPU usage.
-   [ ] **Functional:** Prediction distribution, Input data drift.
-   [ ] **Business:** (If possible) Attribution of prediction to outcome.

## Anti-Patterns

### Training-Serving Skew

- **Problem**: Feature logic differs between training and serving environments
- **Symptoms**: Model performs well in testing but poorly in production
- **Solution**: Use feature stores or embed preprocessing in model artifacts
- **Warning Signs**: Different code paths for feature computation, hardcoded constants

### Manual Deployment

- **Problem**: Deploying models without automation or version control
- **Symptoms**: No traceability, human errors, deployment failures
- **Solution**: Implement CI/CD pipelines with model registry integration
- **Warning Signs**: Email/file transfers of model files, manual server restarts

### Silent Failures

- **Problem**: Model failures go undetected
- **Symptoms**: Bad predictions returned without error indication
- **Solution**: Implement input validation, output monitoring, and alerting
- **Warning Signs**: 200 OK responses with garbage data, no anomaly detection

### Data Leakage

- **Problem**: Training data contains information not available at prediction time
- **Symptoms**: Unrealistically high training accuracy, poor generalization
- **Solution**: Careful feature engineering and validation split review
- **Warning Signs**: Features that would only be known after prediction

Overview

This skill provides MLOps and production ML engineering expertise for building scalable, reliable machine learning systems. It covers end-to-end pipelines from data ingestion and training to deployment, monitoring, and scaling. The focus is practical: reproducible artifacts, automated CI/CD, and robust production guarantees.

How this skill works

The skill inspects system requirements and recommends strategies for serving, training, feature storage, and monitoring. It encodes decision frameworks (real-time vs batch vs edge), training infrastructure choices (single-node vs distributed), and feature store trade-offs. It also supplies concrete workflows for training pipelines, drift detection, RAG ingestion, and quality checklists to operationalize models.

When to use it

Building an end-to-end ML pipeline (ingest → train → validate → deploy).
Deploying models to production (real-time APIs, batch jobs, or edge devices).
Implementing MLOps practices: CI/CD, experiment tracking, and model registry.
Setting up monitoring for drift, performance, and business metrics.
Scaling training workloads or selecting distributed training and orchestration tools.
Designing retrieval pipelines or vector search for knowledge-centered applications.

Best practices

Embed preprocessing inside model artifacts or use a feature store to avoid training-serving skew.
Automate deployments: CI/CD pipelines with model registry and versioned artifacts.
Implement input schema validation and output monitoring to catch silent failures early.
Define clear SLAs and test for latency, throughput, and autoscaling behavior.
Store artifacts and metadata in object storage and tracking systems for reproducibility.

Example use cases

Churn prediction pipeline: track experiments with MLflow, register models, and promote to staging automatically.
Production drift monitoring: schedule daily checks with Evidently and trigger retraining on detected drift.
Low-latency serving: choose Kubernetes (KServe/Seldon) for high throughput or serverless for moderate traffic.
RAG knowledge base: chunk, embed, index documents into Pinecone/Weaviate and build a retrieval+LLM QA chain.
Distributed training: use Ray Train or PyTorch DDP with Kubeflow/Airflow orchestration for large datasets.

FAQ

What serving option for ultra-low latency inference?

Use specialized inference servers (Triton) or C++/Rust implementations and colocate model GPUs to meet sub-10ms SLAs.

When should I adopt a feature store?

Skip for simple MVPs. Adopt Feast for team consistency and managed solutions (Tecton/Hopsworks) for enterprise governance and SLAs.

How do I avoid silent failures in production?

Validate inputs (Pydantic/TFX), monitor prediction distributions, set alerts, and provide fallbacks or heuristics when models fail.