home / skills / doanchienthangdev / omgkit / ml-systems-fundamentals

ml-systems-fundamentals skill

/plugin/skills/ml-systems/ml-systems-fundamentals

This skill helps you grasp production ML fundamentals, architecture, lifecycle, and best practices for reliable, scalable, and maintainable ML systems.

npx playbooks add skill doanchienthangdev/omgkit --skill ml-systems-fundamentals

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.7 KB
---
name: ml-systems-fundamentals
description: Core ML systems concepts including ML lifecycle, system architecture, requirements, and design principles for production ML.
---

# ML Systems Fundamentals

Foundation concepts for building production ML systems.

## ML System Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    ML SYSTEM ARCHITECTURE                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  DATA LAYER                                                  │
│  ├── Data Collection    ├── Data Storage                    │
│  ├── Data Processing    └── Feature Store                   │
│                                                              │
│  MODEL LAYER                                                 │
│  ├── Training Pipeline  ├── Experiment Tracking              │
│  ├── Model Registry     └── Evaluation                      │
│                                                              │
│  SERVING LAYER                                               │
│  ├── Model Serving      ├── Feature Serving                 │
│  ├── Prediction Cache   └── Load Balancing                  │
│                                                              │
│  MONITORING LAYER                                            │
│  ├── Data Monitoring    ├── Model Monitoring                │
│  ├── System Metrics     └── Alerting                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

## ML Lifecycle

1. **Problem Definition** - Business goal → ML task
2. **Data Collection** - Gather relevant data
3. **Data Processing** - Clean, transform, validate
4. **Feature Engineering** - Create informative features
5. **Model Development** - Train, tune, evaluate
6. **Deployment** - Serve predictions
7. **Monitoring** - Track performance
8. **Iteration** - Improve based on feedback

## System Requirements

### Reliability
- Handle failures gracefully
- Maintain prediction quality
- Provide consistent latency

### Scalability
- Handle growing data
- Support more requests
- Enable parallel training

### Maintainability
- Easy to update models
- Clear documentation
- Reproducible experiments

### Adaptability
- Respond to data changes
- Support new features
- Enable quick iterations

## Design Principles

```python
# 1. Start Simple
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
print(f"Baseline: {baseline.score(X_test, y_test)}")

# 2. Data Quality > Model Complexity
def validate_data(df):
    assert df.isnull().sum().sum() == 0
    assert df.duplicated().sum() == 0
    return True

# 3. Version Everything
import mlflow
mlflow.log_param("model_version", "1.0.0")
mlflow.log_artifact("data/processed/")

# 4. Monitor Continuously
def check_drift(reference, current):
    return ks_2samp(reference, current).pvalue < 0.05
```

## Commands
- `/omgml:init` - Initialize ML project
- `/omgml:status` - Project status

## Best Practices

1. Define clear success metrics
2. Establish baselines early
3. Invest in data quality
4. Automate everything possible
5. Monitor production models

Overview

This skill captures core ML systems concepts for designing, building, and operating production machine learning. It summarizes architecture layers, the ML lifecycle, system requirements, and pragmatic design principles to ship reliable models. Use it as a checklist and quick reference for production ML decisions.

How this skill works

The skill breaks an ML system into four layers—data, model, serving, and monitoring—and explains components and responsibilities for each. It outlines the end-to-end ML lifecycle from problem definition through iteration and highlights requirements (reliability, scalability, maintainability, adaptability). It also prescribes simple, practical design principles and a short command set to bootstrap or check project status.

When to use it

  • Planning or architecting a production ML system
  • Onboarding engineers to ML operational best practices
  • Creating requirements and nonfunctional specs for ML projects
  • Evaluating gaps in an existing ML deployment (data, serve, monitor)
  • Designing CI/CD, model registry, and experiment tracking workflows

Best practices

  • Define clear business success metrics and measurable model goals
  • Establish simple baselines before adding model complexity
  • Invest in data quality: validation, deduplication, and lineage
  • Version models, data, and experiments for reproducibility
  • Automate pipelines and continuous monitoring for drift and alerts

Example use cases

  • Designing an architecture diagram that separates data, model, serving, and monitoring responsibilities
  • Creating an ML lifecycle checklist to align stakeholders and engineers
  • Setting up feature stores, model registries, and experiment tracking for reproducible workflows
  • Implementing monitoring and alerting to detect data drift and model degradation
  • Scoping nonfunctional requirements: latency targets, failure handling, and scaling plans

FAQ

What is the most important layer to get right first?

Start with the data layer: collection, processing, and quality checks. Data quality often has larger impact than complex models.

How do I ensure reproducibility across experiments?

Version code, data snapshots, model artifacts, and log key parameters. Use experiment tracking and a model registry to tie runs to artifacts.