home / skills / doanchienthangdev / omgkit / ml-workflow

ml-workflow skill

safe

This skill helps you design, baseline, and iteratively improve ML experiments with tracking and evaluation guidance across the full development lifecycle.

npx playbooks add skill doanchienthangdev/omgkit --skill ml-workflow

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

5.0 KB

---
name: ml-workflow
description: ML development workflow covering experiment design, baseline establishment, iterative improvement, and experiment tracking best practices.
---

# ML Workflow

Systematic approach to ML model development.

## Development Lifecycle

```
┌─────────────────────────────────────────────────────────────┐
│                  ML DEVELOPMENT WORKFLOW                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. PROBLEM      2. BASELINE     3. EXPERIMENT              │
│     SETUP           MODEL           ITERATE                  │
│     ↓               ↓               ↓                       │
│  Define metrics  Simple model   Hypothesis                  │
│  Success criteria Benchmark     Test ideas                  │
│  Constraints     Comparison     Track results               │
│                                                              │
│  4. EVALUATE     5. VALIDATE    6. DEPLOY                   │
│     ↓               ↓               ↓                       │
│  Full metrics    Production    Ship to prod                │
│  Error analysis  validation    Monitor                     │
│  Fairness        A/B test      Iterate                     │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

## Experiment Design

```python
import mlflow
from dataclasses import dataclass

@dataclass
class Experiment:
    name: str
    hypothesis: str
    metrics: list
    success_criteria: dict

experiment = Experiment(
    name="feature_engineering_v2",
    hypothesis="Adding temporal features improves prediction",
    metrics=["accuracy", "f1", "latency_ms"],
    success_criteria={"f1": 0.85, "latency_ms": 50}
)

# Track experiment
mlflow.set_experiment(experiment.name)
with mlflow.start_run():
    mlflow.log_param("hypothesis", experiment.hypothesis)
    # ... training code ...
    mlflow.log_metrics(results)
```

## Baseline Models

```python
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

baselines = {
    "majority": DummyClassifier(strategy="most_frequent"),
    "logistic": LogisticRegression(),
    "random_forest": RandomForestClassifier(n_estimators=100)
}

results = {}
for name, model in baselines.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results[name] = {
        "accuracy": accuracy_score(y_test, y_pred),
        "f1": f1_score(y_test, y_pred, average="macro")
    }

# Best baseline
best = max(results.items(), key=lambda x: x[1]["f1"])
print(f"Best baseline: {best[0]} with F1={best[1]['f1']:.3f}")
```

## Experiment Tracking

```python
import mlflow
import mlflow.pytorch

# Start experiment
mlflow.set_tracking_uri("http://mlflow.example.com")
mlflow.set_experiment("churn_prediction")

with mlflow.start_run(run_name="xgboost_v3"):
    # Log parameters
    mlflow.log_params({
        "model_type": "xgboost",
        "max_depth": 6,
        "learning_rate": 0.1
    })

    # Train model
    model = train_model(X_train, y_train, params)

    # Log metrics
    mlflow.log_metrics({
        "train_accuracy": train_acc,
        "val_accuracy": val_acc,
        "f1_score": f1
    })

    # Log model
    mlflow.sklearn.log_model(model, "model")

    # Log artifacts
    mlflow.log_artifact("feature_importance.png")
```

## Iterative Improvement

```python
class ExperimentIterator:
    def __init__(self, baseline_metrics):
        self.baseline = baseline_metrics
        self.experiments = []

    def run_experiment(self, name, model_fn, hypothesis):
        with mlflow.start_run(run_name=name):
            mlflow.log_param("hypothesis", hypothesis)
            model, metrics = model_fn()
            mlflow.log_metrics(metrics)

            improvement = {k: metrics[k] - self.baseline[k]
                          for k in metrics}
            mlflow.log_metrics({f"{k}_improvement": v
                              for k, v in improvement.items()})

            self.experiments.append({
                "name": name,
                "hypothesis": hypothesis,
                "metrics": metrics,
                "improvement": improvement
            })

            return model, metrics
```

## Commands
- `/omgml:init` - Initialize project
- `/omgtrain:baseline` - Train baselines

## Best Practices

1. Always start with a baseline
2. Change one thing at a time
3. Track all experiments
4. Document hypotheses
5. Validate before deploying

Overview

This skill provides a structured ML development workflow that covers problem setup, baseline establishment, experiment design, iterative improvement, and production validation. It prescribes concrete steps and tracking practices so teams can run reproducible experiments and move models to production with confidence.

How this skill works

The skill guides you to define success criteria and metrics, implement simple baseline models, and design experiments with explicit hypotheses. It integrates experiment tracking (example uses MLflow) to log parameters, metrics, artifacts, and model versions, and provides an iterator pattern to compare improvements against baseline metrics.

When to use it

Starting a new ML project and needing a reproducible process
Establishing performance baselines before complex modeling
Running controlled experiments to validate single changes
Tracking experiments and artifacts for auditability and reproducibility
Preparing models for production with validation and monitoring steps

Best practices

Always start with a clear problem statement, metrics, and constraints
Implement a simple baseline (dummy or linear) as a comparison point
Change one variable at a time so attribution of effects is clear
Log hypotheses, parameters, metrics, artifacts, and model binaries consistently
Validate in production-like conditions (A/B tests, fairness checks, error analysis)

Example use cases

Churn prediction: establish logistic regression baseline, then iterate with gradient boosting while tracking metrics with MLflow
Feature engineering experiments: add temporal features and measure F1 and latency against baseline
Model selection: compare multiple baseline algorithms (majority, logistic, random forest) to pick a candidate
Continuous improvement: automate experiment runs and record per-run improvements versus baseline metrics
Deployment readiness: run production validation, A/B test, and monitor post-deploy metrics

FAQ

How do I pick baseline models?

Choose simple, fast models that reflect trivial solutions (majority class, linear models) and one reasonably strong model like a small random forest to set a realistic benchmark.

What should I log for each experiment?

Log hypothesis text, hyperparameters, training and validation metrics, model artifacts, important plots (feature importance, confusion matrix), and any upstream data version identifiers.