home / skills / openclaw / skills / sw-ml-engineer

sw-ml-engineer skill

/skills/anton-abyzov/sw-ml-engineer

This skill helps you build robust ML systems by enforcing best practices like baseline comparison, cross-validation, experiment tracking, and explainability.

npx playbooks add skill openclaw/skills --skill sw-ml-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
431 B
---
name: ml-engineer
description: ML system builder enforcing best practices - baseline comparison, cross-validation, experiment tracking, explainability (SHAP/LIME). Use for ML pipelines, model training, production ML.
model: opus
context: fork
---

# ML Engineer Agent

## ⚠️ Chunking Rule

Large ML pipelines = 1000+ lines. Generate ONE stage per response: Data/EDA → Features → Training → Evaluation → Deployment.

Overview

This skill is an ML system builder that enforces production-grade best practices across the model lifecycle. It helps teams set baselines, run rigorous cross-validation, track experiments, and add model explainability (SHAP/LIME) for transparent decisions. The agent is optimized for constructing repeatable ML pipelines from data to deployment with clear stage separation.

How this skill works

The agent inspects your dataset, pipeline configuration, and training code to generate one actionable pipeline stage per response (Data/EDA → Features → Training → Evaluation → Deployment). It enforces baseline comparisons, automated cross-validation, and hooks for experiment tracking systems (e.g., MLflow). Explainability modules are integrated to compute SHAP or LIME explanations and surface interpretable model behavior.

When to use it

  • Building or refactoring an ML pipeline to follow production best practices
  • Setting up baseline models and systematic cross-validation for fair comparison
  • Instrumenting experiments with reproducible tracking and metadata
  • Adding explainability to models before deployment for auditing or compliance
  • Preparing models and artifacts for staging or production deployment

Best practices

  • Follow the one-stage-per-response rule for very large pipelines: produce only a single pipeline stage per agent response to keep outputs focused and reviewable
  • Always start with a simple baseline model and deterministic validation splits before increasing complexity
  • Use nested or repeated cross-validation for robust performance estimates and report variance, not just mean metrics
  • Log datasets, hyperparameters, random seeds, and metrics to an experiment tracker for reproducibility
  • Integrate SHAP/LIME post-training and include global and per-sample explanations in evaluation reports

Example use cases

  • Create a reproducible Data/EDA stage that outputs cleaned datasets, summary statistics, and issue reports
  • Implement feature engineering stage with pipelines that support production transformation code
  • Run training stage that performs hyperparameter sweeps, cross-validation, and registers models in a model store
  • Produce evaluation stage with test-set metrics, calibration checks, fairness assessments, and SHAP explanations
  • Generate deployment stage artifacts: saved model, inference signature, CI checks, and rollout plan

FAQ

How does the one-stage-per-response rule help?

It keeps each response focused and reviewable for very large pipelines, reduces cognitive load, and simplifies testing and iteration on individual pipeline components.

Which explainability methods are supported?

The skill integrates SHAP and LIME for both global and local explanations and recommends SHAP for tree-based models and LIME for model-agnostic local checks.

Is experiment tracking mandatory?

While not mandatory, the skill strongly recommends using an experiment tracker to ensure reproducibility, enable comparisons, and simplify model promotion to production.