home / skills / 404kidwiz / claude-supercode-skills / ml-engineer-skill

ml-engineer-skill skill

/ml-engineer-skill

This skill helps you design and operationalize scalable ML pipelines, deployment, and monitoring to accelerate production-ready AI systems.

npx playbooks add skill 404kidwiz/claude-supercode-skills --skill ml-engineer-skill

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
11.2 KB
---
name: ml-engineer
description: Expert in building scalable ML systems, from data pipelines and model training to production deployment and monitoring.
---

# Machine Learning Engineer

## Purpose

Provides MLOps and production ML engineering expertise specializing in end-to-end ML pipelines, model deployment, and infrastructure automation. Bridges data science and production engineering with robust, scalable machine learning systems.

## When to Use

- Building end-to-end ML pipelines (Data → Train → Validate → Deploy)
- Deploying models to production (Real-time API, Batch, or Edge)
- Implementing MLOps practices (CI/CD for ML, Experiment Tracking)
- Optimizing model performance (Latency, Throughput, Resource usage)
- Setting up feature stores and model registries
- Implementing model monitoring (Drift detection, Performance tracking)
- Scaling training workloads (Distributed training)

---
---

## 2. Decision Framework

### Model Serving Strategy

```
Need to serve predictions?
│
├─ Real-time (Low Latency)?
│  │
│  ├─ High Throughput? → **Kubernetes (KServe/Seldon)**
│  ├─ Low/Medium Traffic? → **Serverless (Lambda/Cloud Run)**
│  └─ Ultra-low latency (<10ms)? → **C++/Rust Inference Server (Triton)**
│
├─ Batch Processing?
│  │
│  ├─ Large Scale? → **Spark / Ray**
│  └─ Scheduled Jobs? → **Airflow / Prefect**
│
└─ Edge / Client-side?
   │
   ├─ Mobile? → **TFLite / CoreML**
   └─ Browser? → **TensorFlow.js / ONNX Runtime Web**
```

### Training Infrastructure

```
Training Environment?
│
├─ Single Node?
│  │
│  ├─ Interactive? → **JupyterHub / SageMaker Notebooks**
│  └─ Automated? → **Docker Container on VM**
│
└─ Distributed?
   │
   ├─ Data Parallelism? → **Ray Train / PyTorch DDP**
   └─ Pipeline orchestration? → **Kubeflow / Airflow / Vertex AI**
```

### Feature Store Decision

| Need | Recommendation | Rationale |
|------|----------------|-----------|
| **Simple / MVP** | **No Feature Store** | Use SQL/Parquet files. Overhead of FS is too high. |
| **Team Consistency** | **Feast** | Open source, manages online/offline consistency. |
| **Enterprise / Managed** | **Tecton / Hopsworks** | Full governance, lineage, managed SLA. |
| **Cloud Native** | **Vertex/SageMaker FS** | Tight integration if already in that cloud ecosystem. |

**Red Flags → Escalate to `oracle`:**
- "Real-time" training requirements (online learning) without massive infrastructure budget
- Deploying LLMs (7B+ params) on CPU-only infrastructure
- Training on PII/PHI data without privacy-preserving techniques (Federated Learning, Differential Privacy)
- No validation set or "ground truth" feedback loop mechanism

---
---

## 3. Core Workflows

### Workflow 1: End-to-End Training Pipeline

**Goal:** Automate model training, validation, and registration using MLflow.

**Steps:**

1.  **Setup Tracking**
    ```python
    import mlflow
    import mlflow.sklearn
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import accuracy_score, precision_score

    mlflow.set_tracking_uri("http://localhost:5000")
    mlflow.set_experiment("churn-prediction-prod")
    ```

2.  **Training Script (`train.py`)**
    ```python
    def train(max_depth, n_estimators):
        with mlflow.start_run():
            # Log params
            mlflow.log_param("max_depth", max_depth)
            mlflow.log_param("n_estimators", n_estimators)
            
            # Train
            model = RandomForestClassifier(
                max_depth=max_depth, 
                n_estimators=n_estimators,
                random_state=42
            )
            model.fit(X_train, y_train)
            
            # Evaluate
            preds = model.predict(X_test)
            acc = accuracy_score(y_test, preds)
            prec = precision_score(y_test, preds)
            
            # Log metrics
            mlflow.log_metric("accuracy", acc)
            mlflow.log_metric("precision", prec)
            
            # Log model artifact with signature
            from mlflow.models.signature import infer_signature
            signature = infer_signature(X_train, preds)
            
            mlflow.sklearn.log_model(
                model, 
                "model",
                signature=signature,
                registered_model_name="churn-model"
            )
            
            print(f"Run ID: {mlflow.active_run().info.run_id}")
    
    if __name__ == "__main__":
        train(max_depth=5, n_estimators=100)
    ```

3.  **Pipeline Orchestration (Bash/Airflow)**
    ```bash
    #!/bin/bash
    # Run training
    python train.py
    
    # Check if model passed threshold (e.g. via MLflow API)
    # If yes, transition to Staging
    ```

---
---

### Workflow 3: Drift Detection (Monitoring)

**Goal:** Detect if production data distribution has shifted from training data.

**Steps:**

1.  **Baseline Generation (During Training)**
    ```python
    import evidently
    from evidently.report import Report
    from evidently.metric_preset import DataDriftPreset

    # Calculate baseline profile on training data
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=train_df, current_data=test_df)
    report.save_json("baseline_drift.json")
    ```

2.  **Production Monitoring Job**
    ```python
    # Scheduled daily job
    def check_drift():
        # Load production logs (last 24h)
        current_data = load_production_logs()
        reference_data = load_training_data()
        
        report = Report(metrics=[DataDriftPreset()])
        report.run(reference_data=reference_data, current_data=current_data)
        
        result = report.as_dict()
        dataset_drift = result['metrics'][0]['result']['dataset_drift']
        
        if dataset_drift:
            trigger_alert("Data Drift Detected!")
            trigger_retraining()
    ```

---
---

### Workflow 5: RAG Pipeline with Vector Database

**Goal:** Build a production retrieval pipeline using Pinecone/Weaviate and LangChain.

**Steps:**

1.  **Ingestion (Chunking & Embedding)**
    ```python
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_openai import OpenAIEmbeddings
    from langchain_pinecone import PineconeVectorStore
    
    # Chunking
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    docs = text_splitter.split_documents(raw_documents)
    
    # Embedding & Indexing
    embeddings = OpenAIEmbeddings()
    vectorstore = PineconeVectorStore.from_documents(
        docs, 
        embeddings, 
        index_name="knowledge-base"
    )
    ```

2.  **Retrieval & Generation**
    ```python
    from langchain.chains import RetrievalQA
    from langchain_openai import ChatOpenAI
    
    llm = ChatOpenAI(model="gpt-4o", temperature=0)
    
    qa_chain = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=vectorstore.as_retriever(search_kwargs={"k": 5})
    )
    
    response = qa_chain.invoke("How do I reset my password?")
    print(response['result'])
    ```

3.  **Optimization (Hybrid Search)**
    -   Combine **Dense Retrieval** (Vectors) with **Sparse Retrieval** (BM25/Keywords).
    -   Use **Reranking** (Cohere/Cross-Encoder) on the top 20 results to select best 5.

---
---

## 5. Anti-Patterns & Gotchas

### ❌ Anti-Pattern 1: Training-Serving Skew

**What it looks like:**
-   Feature logic implemented in SQL for training, but re-implemented in Java/Python for serving.
-   "Mean imputation" value calculated on training set but not saved; serving uses a different default.

**Why it fails:**
-   Model behaves unpredictably in production.
-   Debugging is extremely difficult.

**Correct approach:**
-   Use a **Feature Store** or shared library for transformations.
-   Wrap preprocessing logic **inside** the model artifact (e.g., Scikit-Learn Pipeline, TensorFlow Transform).

### ❌ Anti-Pattern 2: Manual Deployments

**What it looks like:**
-   Data Scientist emails a `.pkl` file to an engineer.
-   Engineer manually copies it to a server and restarts the flask app.

**Why it fails:**
-   No version control.
-   No reproducibility.
-   High risk of human error.

**Correct approach:**
-   **CI/CD Pipeline:** Git push triggers build → test → deploy.
-   **Model Registry:** Deploy specific version hash from registry.

### ❌ Anti-Pattern 3: Silent Failures

**What it looks like:**
-   Model API returns `200 OK` but prediction is garbage because input data was corrupted (e.g., all Nulls).
-   Model returns default class `0` for everything.

**Why it fails:**
-   Application keeps running, but business value is lost.
-   Incident detected weeks later by business stakeholders.

**Correct approach:**
-   **Input Schema Validation:** Reject bad requests (Pydantic/TFX).
-   **Output Monitoring:** Alert if prediction distribution shifts (e.g., if model predicts "Fraud" 0% of time for 1 hour).

---
---

## 7. Quality Checklist

**Reliability:**
-   [ ] **Health Checks:** `/health` endpoint implemented (liveness/readiness).
-   [ ] **Retries:** Client has retry logic with exponential backoff.
-   [ ] **Fallback:** Default heuristic exists if model fails or times out.
-   [ ] **Validation:** Inputs validated against schema before inference.

**Performance:**
-   [ ] **Latency:** P99 latency meets SLA (e.g., < 100ms).
-   [ ] **Throughput:** System autoscales with load.
-   [ ] **Batching:** Inference requests batched if using GPU.
-   [ ] **Image Size:** Docker image optimized (slim base, multi-stage build).

**Reproducibility:**
-   [ ] **Versioning:** Code, Data, and Model versions linked.
-   [ ] **Artifacts:** Saved in object storage (S3/GCS), not local disk.
-   [ ] **Environment:** Dependencies pinned (`requirements.txt` / `conda.yaml`).

**Monitoring:**
-   [ ] **Technical:** Latency, Error Rate, CPU/Memory/GPU usage.
-   [ ] **Functional:** Prediction distribution, Input data drift.
-   [ ] **Business:** (If possible) Attribution of prediction to outcome.

## Anti-Patterns

### Training-Serving Skew

- **Problem**: Feature logic differs between training and serving environments
- **Symptoms**: Model performs well in testing but poorly in production
- **Solution**: Use feature stores or embed preprocessing in model artifacts
- **Warning Signs**: Different code paths for feature computation, hardcoded constants

### Manual Deployment

- **Problem**: Deploying models without automation or version control
- **Symptoms**: No traceability, human errors, deployment failures
- **Solution**: Implement CI/CD pipelines with model registry integration
- **Warning Signs**: Email/file transfers of model files, manual server restarts

### Silent Failures

- **Problem**: Model failures go undetected
- **Symptoms**: Bad predictions returned without error indication
- **Solution**: Implement input validation, output monitoring, and alerting
- **Warning Signs**: 200 OK responses with garbage data, no anomaly detection

### Data Leakage

- **Problem**: Training data contains information not available at prediction time
- **Symptoms**: Unrealistically high training accuracy, poor generalization
- **Solution**: Careful feature engineering and validation split review
- **Warning Signs**: Features that would only be known after prediction

Overview

This skill provides MLOps and production ML engineering expertise for building scalable, reliable machine learning systems. It covers end-to-end pipelines from data ingestion and training to deployment, monitoring, and scaling. The focus is practical: reproducible artifacts, automated CI/CD, and robust production guarantees.

How this skill works

The skill inspects system requirements and recommends strategies for serving, training, feature storage, and monitoring. It encodes decision frameworks (real-time vs batch vs edge), training infrastructure choices (single-node vs distributed), and feature store trade-offs. It also supplies concrete workflows for training pipelines, drift detection, RAG ingestion, and quality checklists to operationalize models.

When to use it

  • Building an end-to-end ML pipeline (ingest → train → validate → deploy).
  • Deploying models to production (real-time APIs, batch jobs, or edge devices).
  • Implementing MLOps practices: CI/CD, experiment tracking, and model registry.
  • Setting up monitoring for drift, performance, and business metrics.
  • Scaling training workloads or selecting distributed training and orchestration tools.
  • Designing retrieval pipelines or vector search for knowledge-centered applications.

Best practices

  • Embed preprocessing inside model artifacts or use a feature store to avoid training-serving skew.
  • Automate deployments: CI/CD pipelines with model registry and versioned artifacts.
  • Implement input schema validation and output monitoring to catch silent failures early.
  • Define clear SLAs and test for latency, throughput, and autoscaling behavior.
  • Store artifacts and metadata in object storage and tracking systems for reproducibility.

Example use cases

  • Churn prediction pipeline: track experiments with MLflow, register models, and promote to staging automatically.
  • Production drift monitoring: schedule daily checks with Evidently and trigger retraining on detected drift.
  • Low-latency serving: choose Kubernetes (KServe/Seldon) for high throughput or serverless for moderate traffic.
  • RAG knowledge base: chunk, embed, index documents into Pinecone/Weaviate and build a retrieval+LLM QA chain.
  • Distributed training: use Ray Train or PyTorch DDP with Kubeflow/Airflow orchestration for large datasets.

FAQ

What serving option for ultra-low latency inference?

Use specialized inference servers (Triton) or C++/Rust implementations and colocate model GPUs to meet sub-10ms SLAs.

When should I adopt a feature store?

Skip for simple MVPs. Adopt Feast for team consistency and managed solutions (Tecton/Hopsworks) for enterprise governance and SLAs.

How do I avoid silent failures in production?

Validate inputs (Pydantic/TFX), monitor prediction distributions, set alerts, and provide fallbacks or heuristics when models fail.