home / skills / pluginagentmarketplace / custom-plugin-ai-data-scientist / machine-learning
This skill helps you build, evaluate, and tune machine learning models using scikit-learn for classification, regression, and clustering.
npx playbooks add skill pluginagentmarketplace/custom-plugin-ai-data-scientist --skill machine-learningReview the files below or copy the command above to add this skill to your agents.
---
name: machine-learning
description: Supervised/unsupervised learning, model selection, evaluation, and scikit-learn. Use for building classification, regression, or clustering models.
sasmp_version: "1.3.0"
bonded_agent: 04-machine-learning-ai
bond_type: PRIMARY_BOND
---
# Machine Learning with Scikit-Learn
Build, train, and evaluate ML models for classification, regression, and clustering.
## Quick Start
### Classification
```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Predict
predictions = model.predict(X_test)
probabilities = model.predict_proba(X_test)
# Evaluate
print(classification_report(y_test, predictions))
```
### Regression
```python
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score
model = GradientBoostingRegressor(n_estimators=100)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"MAE: {mean_absolute_error(y_test, predictions):.2f}")
print(f"R²: {r2_score(y_test, predictions):.3f}")
```
### Clustering
```python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
# Find optimal k (elbow method)
inertias = []
for k in range(1, 11):
km = KMeans(n_clusters=k, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
# Train with optimal k
kmeans = KMeans(n_clusters=5, random_state=42)
clusters = kmeans.fit_predict(X)
```
## Model Selection Guide
**Classification:**
- **Logistic Regression**: Linear, interpretable, baseline
- **Random Forest**: Non-linear, feature importance, robust
- **XGBoost**: Best performance, handles missing data
- **SVM**: Small datasets, kernel trick
**Regression:**
- **Linear Regression**: Linear relationships, interpretable
- **Ridge/Lasso**: Regularization, feature selection
- **Random Forest**: Non-linear, robust to outliers
- **XGBoost**: Best performance, often wins competitions
**Clustering:**
- **K-Means**: Fast, spherical clusters
- **DBSCAN**: Arbitrary shapes, handles noise
- **Hierarchical**: Dendrogram, no k selection
## Evaluation Metrics
**Classification:**
```python
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix
)
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')
roc_auc = roc_auc_score(y_true, y_pred_proba, multi_class='ovr')
```
**Regression:**
```python
from sklearn.metrics import (
mean_absolute_error, mean_squared_error, r2_score
)
mae = mean_absolute_error(y_true, y_pred)
mse = mean_squared_error(y_true, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_true, y_pred)
```
## Cross-Validation
```python
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
print(f"CV F1: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")
```
## Hyperparameter Tuning
```python
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15],
'min_samples_split': [2, 5, 10]
}
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5,
scoring='f1_weighted',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
# Use best model
best_model = grid_search.best_estimator_
```
## Feature Engineering
```python
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Encoding
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
```
## Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier(n_estimators=100))
])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
```
## Best Practices
1. Always split data before preprocessing
2. Use cross-validation for reliable estimates
3. Scale features for distance-based models
4. Handle class imbalance (SMOTE, class weights)
5. Check for overfitting (train vs test performance)
6. Save models with joblib or pickle
This skill provides practical tools and recipes for supervised and unsupervised machine learning using scikit-learn. It covers building, training, evaluating, and selecting models for classification, regression, and clustering. Use it to accelerate model development with ready patterns for pipelines, cross-validation, hyperparameter tuning, and feature engineering.
The skill supplies code patterns and guidance that walk through data splitting, model fitting, prediction, and evaluation using scikit-learn estimators. It includes examples for classification, regression, and clustering, plus utilities for scaling, encoding, pipelines, cross-validation, and GridSearch hyperparameter tuning. Results are validated with standard metrics and best-practice checks to avoid common pitfalls like data leakage and overfitting.
Which model should I try first for classification?
Start with a simple, interpretable baseline like logistic regression, then try tree-based models (Random Forest, XGBoost) for non-linear patterns.
How do I prevent data leakage?
Always split into train/test before preprocessing and incorporate transforms inside a Pipeline so fit/transform occurs only on training data during cross-validation.