home / skills / anton-abyzov / specweave / ml-engineer

ml-engineer skill

/plugins/specweave-ml/skills/ml-engineer

This skill helps build and optimize ML pipelines by enforcing best practices, experiment tracking, cross-validation, and explainability throughout stages.

This is most likely a fork of the sw-ml-engineer skill from openclaw
npx playbooks add skill anton-abyzov/specweave --skill ml-engineer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
431 B
---
name: ml-engineer
description: ML system builder enforcing best practices - baseline comparison, cross-validation, experiment tracking, explainability (SHAP/LIME). Use for ML pipelines, model training, production ML.
model: opus
context: fork
---

# ML Engineer Agent

## ⚠️ Chunking Rule

Large ML pipelines = 1000+ lines. Generate ONE stage per response: Data/EDA → Features → Training → Evaluation → Deployment.

Overview

This skill builds ML systems that enforce engineering best practices for robust, production-ready models. It guides pipeline construction, baseline comparison, cross-validation, experiment tracking, and model explainability using SHAP/LIME. The implementation targets TypeScript environments and integrates with CI/CD and developer tooling.

How this skill works

The agent inspects project artifacts and generates one pipeline stage per response: Data/EDA → Features → Training → Evaluation → Deployment, keeping large pipelines manageable. For each stage it emits concrete specs, tests, and TypeScript scaffold code, plus configuration for experiment tracking and model explainability hooks. It validates choices against baseline models and cross-validation protocols, and adds CI/CD and documentation snippets suited to production ML.

When to use it

  • Starting a new ML project that must meet production standards and traceability
  • Converting a research notebook into a reliable TypeScript-based pipeline
  • Implementing experiment tracking, reproducible training, or automatic baseline checks
  • Adding explainability (SHAP/LIME) and validation to an existing model workflow
  • Preparing ML components for deployment with CI/CD and automated testing

Best practices

  • Follow the one-stage-per-response chunking rule for pipelines >1000 lines to keep reviews focused
  • Always include a simple baseline model and automated baseline comparison tests
  • Use k-fold cross-validation for performance estimates and log folds in experiment tracking
  • Integrate SHAP or LIME outputs into evaluation reports and store artifacts with experiments
  • Make every model step reproducible: fixed seeds, environment capture, and dataset versioning

Example use cases

  • Generate a TypeScript training scaffold that performs feature engineering, trains models, and logs runs to an experiment server
  • Add evaluation stage that runs cross-validation, compares to a baseline, and produces SHAP explanations for the top features
  • Produce deployment specs and CI pipelines to validate model contracts and serve a prediction API
  • Convert an end-to-end Jupyter notebook into modular pipeline stages with tests and docs
  • Create monitoring hooks that capture drift metrics and replay data for retraining

FAQ

How does the one-stage-per-response rule affect workflow?

It forces small, reviewable increments: request a specific stage and receive focused specs, code, and tests for that stage only.

Which explainability tools are supported?

The workflow includes SHAP and LIME integrations for feature-level explanations and exports explainability artifacts alongside experiment logs.