home / skills / 89jobrien / steve / machine-learning

machine-learning skill

/steve/skills/machine-learning

This skill guides building end-to-end machine learning pipelines, from feature engineering to production deployment, boosting model quality and reliability.

npx playbooks add skill 89jobrien/steve --skill machine-learning

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
6.1 KB
---
name: machine-learning
description: Machine learning development patterns, model training, evaluation, and
  deployment. Use when building ML pipelines, training models, feature engineering,
  model evaluation, or deploying ML systems to production.
author: Joseph OBrien
status: unpublished
updated: '2025-12-23'
version: 1.0.1
tag: skill
type: skill
---

# Machine Learning

Comprehensive machine learning skill covering the full ML lifecycle from experimentation to production deployment.

## When to Use This Skill

- Building machine learning pipelines
- Feature engineering and data preprocessing
- Model training, evaluation, and selection
- Hyperparameter tuning and optimization
- Model deployment and serving
- ML experiment tracking and versioning
- Production ML monitoring and maintenance

## ML Development Lifecycle

### 1. Problem Definition

**Classification Types:**

- Binary classification (spam/not spam)
- Multi-class classification (image categories)
- Multi-label classification (document tags)
- Regression (price prediction)
- Clustering (customer segmentation)
- Ranking (search results)
- Anomaly detection (fraud detection)

**Success Metrics by Problem Type:**

| Problem Type | Primary Metrics | Secondary Metrics |
|--------------|-----------------|-------------------|
| Binary Classification | AUC-ROC, F1 | Precision, Recall, PR-AUC |
| Multi-class | Macro F1, Accuracy | Per-class metrics |
| Regression | RMSE, MAE | R², MAPE |
| Ranking | NDCG, MAP | MRR |
| Clustering | Silhouette, Calinski-Harabasz | Davies-Bouldin |

### 2. Data Preparation

**Data Quality Checks:**

- Missing value analysis and imputation strategies
- Outlier detection and handling
- Data type validation
- Distribution analysis
- Target leakage detection

**Feature Engineering Patterns:**

- Numerical: scaling, binning, log transforms, polynomial features
- Categorical: one-hot, target encoding, frequency encoding, embeddings
- Temporal: lag features, rolling statistics, cyclical encoding
- Text: TF-IDF, word embeddings, transformer embeddings
- Geospatial: distance features, clustering, grid encoding

**Train/Test Split Strategies:**

- Random split (standard)
- Stratified split (imbalanced classes)
- Time-based split (temporal data)
- Group split (prevent data leakage)
- K-fold cross-validation

### 3. Model Selection

**Algorithm Selection Guide:**

| Data Size | Problem | Recommended Models |
|-----------|---------|-------------------|
| Small (<10K) | Classification | Logistic Regression, SVM, Random Forest |
| Small (<10K) | Regression | Linear Regression, Ridge, SVR |
| Medium (10K-1M) | Classification | XGBoost, LightGBM, Neural Networks |
| Medium (10K-1M) | Regression | XGBoost, LightGBM, Neural Networks |
| Large (>1M) | Any | Deep Learning, Distributed training |
| Tabular | Any | Gradient Boosting (XGBoost, LightGBM, CatBoost) |
| Images | Classification | CNN, ResNet, EfficientNet, Vision Transformers |
| Text | NLP | Transformers (BERT, RoBERTa, GPT) |
| Sequential | Time Series | LSTM, Transformer, Prophet |

### 4. Model Training

**Hyperparameter Tuning:**

- Grid Search: exhaustive, good for small spaces
- Random Search: efficient, good for large spaces
- Bayesian Optimization: smart exploration (Optuna, Hyperopt)
- Early stopping: prevent overfitting

**Common Hyperparameters:**

| Model | Key Parameters |
|-------|---------------|
| XGBoost | learning_rate, max_depth, n_estimators, subsample |
| LightGBM | num_leaves, learning_rate, n_estimators, feature_fraction |
| Random Forest | n_estimators, max_depth, min_samples_split |
| Neural Networks | learning_rate, batch_size, layers, dropout |

### 5. Model Evaluation

**Evaluation Best Practices:**

- Always use held-out test set for final evaluation
- Use cross-validation during development
- Check for overfitting (train vs validation gap)
- Evaluate on multiple metrics
- Analyze errors qualitatively

**Handling Imbalanced Data:**

- Resampling: SMOTE, undersampling
- Class weights: weighted loss functions
- Threshold tuning: optimize decision threshold
- Evaluation: use PR-AUC over ROC-AUC

### 6. Production Deployment

**Model Serving Patterns:**

- REST API (Flask, FastAPI, TF Serving)
- Batch inference (scheduled jobs)
- Streaming (real-time predictions)
- Edge deployment (mobile, IoT)

**Production Considerations:**

- Latency requirements (p50, p95, p99)
- Throughput (requests per second)
- Model size and memory footprint
- Fallback strategies
- A/B testing framework

### 7. Monitoring & Maintenance

**What to Monitor:**

- Prediction latency
- Input feature distributions (data drift)
- Prediction distributions (concept drift)
- Model performance metrics
- Error rates and types

**Retraining Triggers:**

- Performance degradation below threshold
- Significant data drift detected
- Scheduled retraining (daily, weekly)
- New training data available

## MLOps Best Practices

### Experiment Tracking

Track for every experiment:

- Code version (git commit)
- Data version (hash or version ID)
- Hyperparameters
- Metrics (train, validation, test)
- Model artifacts
- Environment (packages, versions)

### Model Versioning

```
models/
├── model_v1.0.0/
│   ├── model.pkl
│   ├── metadata.json
│   ├── requirements.txt
│   └── metrics.json
├── model_v1.1.0/
└── model_v2.0.0/
```

### CI/CD for ML

1. **Continuous Integration:**
   - Data validation tests
   - Model training tests
   - Performance regression tests

2. **Continuous Deployment:**
   - Staging environment validation
   - Shadow mode testing
   - Gradual rollout (canary)
   - Automatic rollback

## Reference Files

For detailed patterns and code examples, load reference files as needed:

- **`references/preprocessing.md`** - Data preprocessing patterns and feature engineering techniques
- **`references/model_patterns.md`** - Model architecture patterns and implementation examples
- **`references/evaluation.md`** - Comprehensive evaluation strategies and metrics

## Integration with Other Skills

- **performance** - For optimizing inference latency
- **testing** - For ML-specific testing patterns
- **database-optimization** - For feature store queries
- **debugging** - For model debugging and error analysis

Overview

This skill covers practical machine learning development patterns across the full lifecycle: problem definition, data preparation, model training, evaluation, deployment, and monitoring. It consolidates tactics for feature engineering, algorithm selection, hyperparameter tuning, and production-ready serving. Use it to accelerate reliable ML pipelines and reduce time-to-production for models.

How this skill works

The skill inspects your problem type and recommends metrics, data-splitting strategies, and model families appropriate for data size and modality. It guides preprocessing and feature engineering patterns, outlines hyperparameter tuning approaches, and provides deployment and monitoring patterns for production systems. It also embeds MLOps best practices like experiment tracking, model versioning, and CI/CD steps.

When to use it

  • Designing an ML pipeline from data ingestion to serving
  • Choosing algorithms and evaluation metrics for a new problem
  • Engineering features, handling missing data, and preventing leakage
  • Tuning hyperparameters and validating model generalization
  • Deploying models with latency, throughput, and rollback considerations
  • Setting up monitoring, drift detection, and retraining triggers

Best practices

  • Define problem type and success metrics before modeling
  • Use stratified, group, or time-based splits to avoid leakage
  • Track experiments with code, data, hyperparameters, and artifacts
  • Prefer cross-validation and held-out test sets for final evaluation
  • Automate validation, shadow testing, and gradual rollouts in CI/CD
  • Monitor input distributions, model outputs, and latency in production

Example use cases

  • Build a tabular classification pipeline using LightGBM with target encoding and cross-validation
  • Train an image classifier with transfer learning (ResNet/EfficientNet) and track experiments in an ML registry
  • Deploy a sentiment analysis model as a REST API and implement canary rollout
  • Detect data drift in streaming features and trigger automated retraining
  • Optimize inference latency by profiling and model quantization for edge deployment

FAQ

Which model should I try first for tabular data?

Start with gradient boosting (XGBoost/LightGBM/CatBoost) and a simple baseline like logistic or linear regression to set expectations.

How do I handle imbalanced classes?

Use resampling (SMOTE or undersampling), class weights, and evaluate with PR-AUC or F1; tune decision thresholds rather than relying solely on accuracy.