home / skills / doanchienthangdev / omgkit / ml-systems-fundamentals
This skill helps you grasp production ML fundamentals, architecture, lifecycle, and best practices for reliable, scalable, and maintainable ML systems.
npx playbooks add skill doanchienthangdev/omgkit --skill ml-systems-fundamentalsReview the files below or copy the command above to add this skill to your agents.
---
name: ml-systems-fundamentals
description: Core ML systems concepts including ML lifecycle, system architecture, requirements, and design principles for production ML.
---
# ML Systems Fundamentals
Foundation concepts for building production ML systems.
## ML System Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ ML SYSTEM ARCHITECTURE │
├─────────────────────────────────────────────────────────────┤
│ │
│ DATA LAYER │
│ ├── Data Collection ├── Data Storage │
│ ├── Data Processing └── Feature Store │
│ │
│ MODEL LAYER │
│ ├── Training Pipeline ├── Experiment Tracking │
│ ├── Model Registry └── Evaluation │
│ │
│ SERVING LAYER │
│ ├── Model Serving ├── Feature Serving │
│ ├── Prediction Cache └── Load Balancing │
│ │
│ MONITORING LAYER │
│ ├── Data Monitoring ├── Model Monitoring │
│ ├── System Metrics └── Alerting │
│ │
└─────────────────────────────────────────────────────────────┘
```
## ML Lifecycle
1. **Problem Definition** - Business goal → ML task
2. **Data Collection** - Gather relevant data
3. **Data Processing** - Clean, transform, validate
4. **Feature Engineering** - Create informative features
5. **Model Development** - Train, tune, evaluate
6. **Deployment** - Serve predictions
7. **Monitoring** - Track performance
8. **Iteration** - Improve based on feedback
## System Requirements
### Reliability
- Handle failures gracefully
- Maintain prediction quality
- Provide consistent latency
### Scalability
- Handle growing data
- Support more requests
- Enable parallel training
### Maintainability
- Easy to update models
- Clear documentation
- Reproducible experiments
### Adaptability
- Respond to data changes
- Support new features
- Enable quick iterations
## Design Principles
```python
# 1. Start Simple
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
print(f"Baseline: {baseline.score(X_test, y_test)}")
# 2. Data Quality > Model Complexity
def validate_data(df):
assert df.isnull().sum().sum() == 0
assert df.duplicated().sum() == 0
return True
# 3. Version Everything
import mlflow
mlflow.log_param("model_version", "1.0.0")
mlflow.log_artifact("data/processed/")
# 4. Monitor Continuously
def check_drift(reference, current):
return ks_2samp(reference, current).pvalue < 0.05
```
## Commands
- `/omgml:init` - Initialize ML project
- `/omgml:status` - Project status
## Best Practices
1. Define clear success metrics
2. Establish baselines early
3. Invest in data quality
4. Automate everything possible
5. Monitor production models
This skill captures core ML systems concepts for designing, building, and operating production machine learning. It summarizes architecture layers, the ML lifecycle, system requirements, and pragmatic design principles to ship reliable models. Use it as a checklist and quick reference for production ML decisions.
The skill breaks an ML system into four layers—data, model, serving, and monitoring—and explains components and responsibilities for each. It outlines the end-to-end ML lifecycle from problem definition through iteration and highlights requirements (reliability, scalability, maintainability, adaptability). It also prescribes simple, practical design principles and a short command set to bootstrap or check project status.
What is the most important layer to get right first?
Start with the data layer: collection, processing, and quality checks. Data quality often has larger impact than complex models.
How do I ensure reproducibility across experiments?
Version code, data snapshots, model artifacts, and log key parameters. Use experiment tracking and a model registry to tie runs to artifacts.