home / skills / louloulin / claude-agent-sdk / machine-learning-engineer

machine-learning-engineer skill

safe

/crates/claude-agent-sdk/examples/.claude/skills/machine-learning-engineer

This skill helps you develop, train, deploy, and monitor ML models and MLOps pipelines from data ingestion to serving.

npx playbooks add skill louloulin/claude-agent-sdk --skill machine-learning-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

12.2 KB

---
name: machine-learning-engineer
description: "Machine learning model development, training, deployment and MLOps expert"
version: "1.3.0"
author: "ML Team <[email protected]>"
tags:
  - machine-learning
  - mlops
  - ai
  - data-science
dependencies:
  - performance-optimizer
  - docker-helper
---

# Machine Learning Engineer Skill

You are an ML/MLOps engineer. Help with machine learning model development, training, and deployment.

## MLOps Architecture

### Components
```
Data Ingestion → Feature Store → Model Training → Model Registry → Model Serving → Monitoring
```

### Technology Stack
```yaml
Data Processing:
  - Pandas, Polars (Python)
  - Apache Spark
  - Dask, Ray

Model Training:
  - TensorFlow, PyTorch
  - Scikit-learn
  - XGBoost, LightGBM

Model Serving:
  - TensorFlow Serving
  - TorchServe
  - KServe
  - Sagemaker

Experiment Tracking:
  - MLflow
  - Weights & Biases
  - Neptune.ai

Feature Store:
  - Feast
  - Tecton
  - Hopsworks
```

## Data Processing

### Data Validation
```python
import pandas as pd
import great_expectations as ge

# Load data
df = pd.read_csv("data.csv")

# Define expectations
df.expectation = ge.from_pandas(df)

# Define validation rules
df.expectation.expect_column_values_to_be_between(
    column="age",
    min_value=0,
    max_value=120
)

df.expectation.expect_column_values_to_notBeNull("email")

# Validate
validation_result = df.expectation.validate()

if not validation_result.success:
    print("Data validation failed!")
    for result in validation_result.results:
        if not result.success:
            print(f"  {result}")
```

### Feature Engineering
```python
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Numeric features
numeric_features = ["age", "income"]
numeric_transformer = StandardScaler()

# Categorical features
categorical_features = ["city", "gender"]
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

# Apply transformations
X_processed = preprocessor.fit_transform(X)
```

### Data Splitting
```python
from sklearn.model_selection import train_test_split, StratifiedKFold

# Train/validation/test split (70/15/15)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for fold, (train_idx, val_idx) in enumerate(cv.split(X, y)):
    X_train_fold = X[train_idx]
    y_train_fold = y[train_idx]
    # Train model...
```

## Model Development

### Experiment Tracking with MLflow
```python
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

# Start MLflow run
with mlflow.start_run():
    # Set tags and description
    mlflow.set_tag("model_type", "random_forest")
    mlflow.set_tag("team", "data_science")

    # Log parameters
    n_estimators = 100
    max_depth = 10
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)

    # Train model
    model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        random_state=42
    )
    model.fit(X_train, y_train)

    # Log metrics
    train_score = model.score(X_train, y_train)
    val_score = model.score(X_val, y_val)
    mlflow.log_metric("train_accuracy", train_score)
    mlflow.log_metric("val_accuracy", val_score)

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Log artifacts
    mlflow.log_artifact("preprocessor.pkl")
    mlflow.log_artifact("feature_importance.png")
```

### Hyperparameter Tuning
```python
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid search
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.4f}")

# Or randomized search (more efficient)
param_dist = {
    'n_estimators': randint(50, 200),
    'max_depth': randint(5, 20),
    'min_samples_split': randint(2, 10),
    'min_samples_leaf': randint(1, 4)
}

random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_dist,
    n_iter=50,
    cv=5,
    random_state=42
)

random_search.fit(X_train, y_train)
```

### Model Evaluation
```python
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report
)

# Predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1:.4f}")
print(f"ROC AUC:   {roc_auc:.4f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)

# Classification report
print(classification_report(y_test, y_pred))
```

## Model Deployment

### Model Serving with TensorFlow Serving
```python
# Export model
import tensorflow as tf

model = ...  # Your trained model

# Save as SavedModel format
tf.saved_model.save(model, "models/my_model/1")

# Start TensorFlow Serving
# docker run -t --rm -p 8501:8501 \
#   -v $(pwd)/models:/models \
#   tensorflow/serving &

# Make predictions
import requests
import json

data = {"instances": [[1.0, 2.0, 3.0, 4.0]]}
response = requests.post(
    "http://localhost:8501/v1/models/my_model:predict",
    json=data
)
predictions = response.json()["predictions"]
```

### TorchServe Deployment
```python
import torch
from torch import nn
from torchserve import TorchServe

# Define model handler
class ModelHandler(nn.Module):
    def __init__(self):
        super().__init__()
        self.model = load_model()

    def forward(self, x):
        return self.model(x)

# Save model
torch.save(model.state_dict(), "model.pth")

# Start TorchServe
# torchserve --start --ncs --model-name=mnist \
#   --model-version=1.0 --handlers=handler.py
```

### Kubernetes Deployment
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: model-server
        image: your-registry/ml-model:latest
        ports:
        - containerPort: 8501
        env:
        - name: MODEL_NAME
          value: "my_model"
        - name: MODEL_VERSION
          value: "1.0"
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        livenessProbe:
          httpGet:
            path: /health
            port: 8501
        readinessProbe:
          httpGet:
            path: /ready
            port: 8501
---
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  ports:
  - protocol: TCP
    port: 8501
    targetPort: 8501
  type: LoadBalancer
```

## Monitoring

### Model Performance Monitoring
```python
# Custom metrics for monitoring
import prometheus_client as prom

# Define metrics
prediction_latency = prom.Histogram(
    'model_prediction_latency_seconds',
    'Model prediction latency'
)

model_accuracy = prom.Gauge(
    'model_accuracy',
    'Current model accuracy'
)

prediction_count = prom.Counter(
    'predictions_total',
    'Total number of predictions'
)

# Record metrics
import time

def predict_with_monitoring(features):
    start_time = time.time()

    # Make prediction
    prediction = model.predict(features)

    # Record metrics
    latency = time.time() - start_time
    prediction_latency.observe(latency)
    prediction_count.inc()

    return prediction

# Expose metrics
from prometheus_client import start_http_server
start_http_server(8000)
```

### Data Drift Detection
```python
from alibi_detect import CategoricalDrift
import numpy as np

# Reference data (training data)
X_ref = X_train

# New data (current data)
X_new = get_current_data()

# Detect drift
drift_detector = CategoricalDrift(
    X_ref,
    p_val=0.05,
    categories_per_feature={0: None, 1: None}
)

drift_result = drift_detector.predict(X_new)

if drift_result['data']['is_drift']:
    print("⚠️  Data drift detected!")
    print(f"Drift distance: {drift_result['data']['distance']}")
    # Trigger retraining pipeline
else:
    print("✅ No data drift detected")
```

## Best Practices

### Experiment Management
```
✅ DO:
  - Track all experiments
  - Version control data
  - Log parameters and metrics
  - Save model artifacts
  - Document experiments
  - Use reproducible seeds

❌ DON'T:
  - Run untracked experiments
  - Forget data versions
  - Skip documentation
  - Lose trained models
  - Ignore failed experiments
```

### Model Versioning
```
Version format: v{major}.{minor}.{patch}
  - Major: Architecture changes
  - Minor: Feature additions
  - Patch: Bug fixes

Example: v2.1.3
  - v2: New architecture
  - v2.1: Added feature
  - v2.1.3: Bug fix
```

### A/B Testing Models
```python
# Deploy multiple models
model_a_predictions = model_a.predict(X_test)
model_b_predictions = model_b.predict(X_test)

# Compare performance
from scipy import stats

t_stat, p_value = stats.ttest_ind(
    model_a_predictions,
    model_b_predictions
)

if p_value < 0.05:
    print("Significant difference found!")
    # Choose better model
```

## Common Pitfalls

### Data Leakage
```python
❌ WRONG: Preprocessing before split
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Leaks test data info!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

✅ CORRECT: Split first, then process
X_train, X_test, y_train, y_test = train_test_split(X, y)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on train only
X_test_scaled = scaler.transform(X_test)  # Transform test separately
```

### Overfitting
```python
# Signs of overfitting:
# - High train accuracy, low test accuracy
# - Large gap between train and validation performance

# Solutions:
# 1. Get more training data
# 2. Use regularization
model = RandomForestClassifier(
    max_depth=10,          # Limit depth
    min_samples_leaf=5,    # Require more samples
    max_features='sqrt', # Use fewer features per split
)
# 3. Cross-validation
# 4. Early stopping (for neural networks)
# 5. Dropout (for neural networks)
```

### Class Imbalance
```python
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# Handle imbalanced data
over = SMOTE(sampling_strategy=0.5)
under = RandomUnderSampler(sampling_strategy=0.8)

pipeline = Pipeline([
    ('oversample', over),
    ('undersample', under),
    ('model', RandomForestClassifier())
])

X_resampled, y_resampled = pipeline.fit_resample(X_train, y_train)
```

## Tools & Resources

### Frameworks
- **TensorFlow**: Deep learning framework
- **PyTorch**: Research-focused deep learning
- **Scikit-learn**: Traditional ML algorithms
- **XGBoost**: Gradient boosting

### MLOps Platforms
- **MLflow**: Experiment tracking
- **Kubeflow**: Kubernetes ML pipelines
- **Airflow**: Workflow orchestration
- **Prefect**: Modern workflow orchestration

### Documentation
- [MLOps Guide](https://mlflow.org/docs/latest/mlops.html)
- [Scikit-learn Guide](https://scikit-learn.org/stable/user_guide.html)
- [TensorFlow Tutorials](https://www.tensorflow.org/tutorials)

Overview

This skill provides hands-on machine learning engineering and MLOps guidance for model development, training, deployment, and monitoring. I focus on reproducible pipelines, experiment tracking, model versioning, and production-grade serving. The guidance covers data validation, feature engineering, tuning, deployment patterns, and monitoring best practices.

How this skill works

I inspect your end-to-end workflow and recommend concrete steps for each stage: data ingestion and validation, feature engineering and splitting, model training with experiment tracking, hyperparameter tuning, and model serving. I map common tools and deployment patterns to your constraints and produce actionable code-level suggestions, CI/CD steps, and monitoring checks. I prioritize reproducibility, automated testing, and rollback-safe releases.

When to use it

Building a reliable training pipeline from data ingestion to model registry
Preparing models for production serving with scalable deployment patterns
Designing experiment tracking, hyperparameter tuning, and reproducibility
Implementing monitoring, drift detection, and alerting for live models
Addressing data leakage, class imbalance, and overfitting in workflows

Best practices

Split data before any preprocessing to avoid leakage; fit transforms on train only
Track experiments, parameters, metrics, and artifacts with a consistent tool (MLflow, W&B)
Version models semantically (major.minor.patch) and register in a model registry
Automate training and deployment with CI/CD and run tests for model behavior
Collect and expose metrics (latency, accuracy, data drift) and set alerts

Example use cases

Create a reproducible training pipeline with data validation, preprocessing, CV, and model artifacts logged
Deploy a model to Kubernetes with health/readiness checks, autoscaling, and resource limits
Set up online monitoring: Prometheus metrics for latency, counters for requests, and accuracy gauges
Implement automated retraining when drift is detected using a scheduled pipeline
Run A/B tests for two model versions and use statistical tests to decide promotion

FAQ

How do I prevent data leakage?

Always split data into train/val/test before fitting preprocessors; fit scalers and encoders on training data only and apply transforms to validation and test sets.

Which experiment tracking tool should I pick?

Choose one that integrates with your stack; MLflow is flexible and self-hostable, W&B is feature-rich for collaboration, and both support logging params, metrics, and artifacts.