home / skills / eyadsibai / ltk / experiment-tracking

experiment-tracking skill

/plugins/ltk-data/skills/experiment-tracking

This skill helps you track ML experiments, metrics, and models across platforms like MLflow and W&B for better reproducibility and collaboration.

npx playbooks add skill eyadsibai/ltk --skill experiment-tracking

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.6 KB
---
name: experiment-tracking
description: Use when "experiment tracking", "MLflow", "Weights & Biases", "wandb", "model registry", "hyperparameter logging", "ML experiments", "training metrics"
version: 1.0.0
---

# Experiment Tracking

Track ML experiments, metrics, and models.

## Comparison

| Platform | Best For | Self-hosted | Visualization |
|----------|----------|-------------|---------------|
| **MLflow** | Open-source, model registry | Yes | Basic |
| **W&B** | Collaboration, sweeps | Limited | Excellent |
| **Neptune** | Team collaboration | No | Good |
| **ClearML** | Full MLOps | Yes | Good |

---

## MLflow

Open-source platform from Databricks.

**Core components:**

- **Tracking**: Log parameters, metrics, artifacts
- **Projects**: Reproducible runs (MLproject file)
- **Models**: Package and deploy models
- **Registry**: Model versioning and staging

**Strengths**: Self-hosted, open-source, model registry, framework integrations
**Limitations**: Basic visualization, less collaborative features

**Key concept**: Autologging for major frameworks - automatic metric capture with one line.

---

## Weights & Biases (W&B)

Cloud-first experiment tracking with excellent visualization.

**Core features:**

- **Experiment tracking**: Metrics, hyperparameters, system stats
- **Sweeps**: Hyperparameter search (grid, random, Bayesian)
- **Artifacts**: Dataset and model versioning
- **Reports**: Shareable documentation

**Strengths**: Beautiful visualizations, team collaboration, hyperparameter sweeps
**Limitations**: Cloud-dependent, limited self-hosting

**Key concept**: `wandb.init()` + `wandb.log()` - simple API, powerful features.

---

## What to Track

| Category | Examples |
|----------|----------|
| **Hyperparameters** | Learning rate, batch size, architecture |
| **Metrics** | Loss, accuracy, F1, per-epoch values |
| **Artifacts** | Model checkpoints, configs, datasets |
| **System** | GPU usage, memory, runtime |
| **Code** | Git commit, diff, requirements |

---

## Model Registry Concepts

| Stage | Purpose |
|-------|---------|
| **None** | Just logged, not registered |
| **Staging** | Testing, validation |
| **Production** | Serving live traffic |
| **Archived** | Deprecated, kept for reference |

---

## Decision Guide

| Scenario | Recommendation |
|----------|----------------|
| Self-hosted requirement | MLflow |
| Team collaboration | W&B |
| Model registry focus | MLflow |
| Hyperparameter sweeps | W&B |
| Beautiful dashboards | W&B |
| Full MLOps pipeline | MLflow + deployment tools |

## Resources

- MLflow: <https://mlflow.org/docs/latest/>
- W&B: <https://docs.wandb.ai/>

Overview

This skill helps teams track machine learning experiments, log hyperparameters and metrics, manage artifacts, and maintain model versions across platforms like MLflow and Weights & Biases. It is focused on practical experiment tracking choices, trade-offs between self-hosted and cloud options, and how to implement reliable logging for reproducible ML workflows. Use it to pick tools and patterns that fit your deployment, collaboration, and registry needs.

How this skill works

The skill compares core experiment-tracking platforms and explains their primary components: tracking APIs for parameters and metrics, artifact storage for checkpoints and datasets, hyperparameter sweeps, and model registries for versioning. It outlines key integrations (autologging, wandb.init/wandb.log) and shows what types of data to capture (hyperparameters, metrics, system stats, code). It also maps stages used in registries (staging, production, archived) to common workflows.

When to use it

  • When you need reproducible runs and consistent logging of hyperparameters and metrics.
  • When choosing between self-hosted (MLflow) and cloud-first (W&B) tracking solutions.
  • When implementing model versioning and promotion workflows (staging → production).
  • When setting up automated hyperparameter sweeps and experiment comparison dashboards.
  • When you need to capture artifacts (checkpoints, datasets) alongside code and system stats.

Best practices

  • Log hyperparameters, per-epoch metrics, and final evaluation metrics for every run.
  • Store model checkpoints and dataset snapshots as artifacts tied to the run ID or commit.
  • Record code provenance: Git commit, diff summary, and dependency requirements.
  • Use a consistent naming and tagging scheme (project, experiment, run) for easier queries.
  • Prefer autologging for supported frameworks to reduce boilerplate and missed metrics.

Example use cases

  • A researcher runs grid and Bayesian sweeps in W&B to optimize hyperparameters and visualize results.
  • An engineering team self-hosts MLflow to maintain a model registry and promote models to production.
  • A data scientist logs GPU usage and per-epoch metrics to diagnose training instability.
  • A team archives older model versions in the registry while promoting validated models to production.
  • A project captures model artifacts and dataset versions to enable reproducible retraining and audits.

FAQ

Which platform is best if I must self-host?

MLflow is the preferred choice for self-hosting and includes a built-in model registry for versioning.

When should I use Weights & Biases instead of MLflow?

Use W&B if you prioritize collaborative dashboards, rich visualizations, and managed hyperparameter sweeps.