home / skills / eyadsibai / ltk / scikit-learn

This skill helps you apply scikit-learn for classification, regression, clustering, and pipelines with best practices for preprocessing and evaluation.

npx playbooks add skill eyadsibai/ltk --skill scikit-learn

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
4.5 KB
---
name: scikit-learn
description: Use when "scikit-learn", "sklearn", "machine learning", "classification", "regression", "clustering", or asking about "train test split", "cross validation", "hyperparameter tuning", "ML pipeline", "random forest", "SVM", "preprocessing"
version: 1.0.0
---

# Scikit-learn Machine Learning

Industry-standard Python library for classical machine learning.

## When to Use

- Classification or regression tasks
- Clustering or dimensionality reduction
- Preprocessing and feature engineering
- Model evaluation and cross-validation
- Hyperparameter tuning
- Building ML pipelines

---

## Algorithm Selection

### Classification

| Algorithm | Best For | Strengths |
|-----------|----------|-----------|
| **Logistic Regression** | Baseline, interpretable | Fast, probabilistic |
| **Random Forest** | General purpose | Handles non-linear, feature importance |
| **Gradient Boosting** | Best accuracy | State-of-art for tabular |
| **SVM** | High-dimensional data | Works well with few samples |
| **KNN** | Simple problems | No training, instance-based |

### Regression

| Algorithm | Best For | Notes |
|-----------|----------|-------|
| **Linear Regression** | Baseline | Interpretable coefficients |
| **Ridge/Lasso** | Regularization needed | L2 vs L1 penalty |
| **Random Forest** | Non-linear relationships | Robust to outliers |
| **Gradient Boosting** | Best accuracy | XGBoost, LightGBM wrappers |

### Clustering

| Algorithm | Best For | Key Parameter |
|-----------|----------|---------------|
| **KMeans** | Spherical clusters | n_clusters (must specify) |
| **DBSCAN** | Arbitrary shapes | eps (density) |
| **Agglomerative** | Hierarchical | n_clusters or distance threshold |
| **Gaussian Mixture** | Soft clustering | n_components |

### Dimensionality Reduction

| Method | Preserves | Use Case |
|--------|-----------|----------|
| **PCA** | Global variance | Feature reduction |
| **t-SNE** | Local structure | 2D/3D visualization |
| **UMAP** | Both local/global | Visualization + downstream |

---

## Pipeline Concepts

**Key concept**: Pipelines prevent data leakage by ensuring transformations are fit only on training data.

| Component | Purpose |
|-----------|---------|
| **Pipeline** | Sequential steps (transform → model) |
| **ColumnTransformer** | Apply different transforms to different columns |
| **FeatureUnion** | Combine multiple feature extraction methods |

**Common preprocessing flow**:

1. Impute missing values (SimpleImputer)
2. Scale numeric features (StandardScaler, MinMaxScaler)
3. Encode categoricals (OneHotEncoder, OrdinalEncoder)
4. Optional: feature selection or polynomial features

---

## Model Evaluation

### Cross-Validation Strategies

| Strategy | Use Case |
|----------|----------|
| **KFold** | General purpose |
| **StratifiedKFold** | Imbalanced classification |
| **TimeSeriesSplit** | Temporal data |
| **LeaveOneOut** | Very small datasets |

### Metrics

| Task | Metric | When to Use |
|------|--------|-------------|
| **Classification** | Accuracy | Balanced classes |
| | F1-score | Imbalanced classes |
| | ROC-AUC | Ranking, threshold tuning |
| | Precision/Recall | Domain-specific costs |
| **Regression** | RMSE | Penalize large errors |
| | MAE | Robust to outliers |
| | R² | Explained variance |

---

## Hyperparameter Tuning

| Method | Pros | Cons |
|--------|------|------|
| **GridSearchCV** | Exhaustive | Slow for many params |
| **RandomizedSearchCV** | Faster | May miss optimal |
| **HalvingGridSearchCV** | Efficient | Requires sklearn 0.24+ |

**Key concept**: Always tune on validation set, evaluate final model on held-out test set.

---

## Best Practices

| Practice | Why |
|----------|-----|
| Split data first | Prevent leakage |
| Use pipelines | Reproducible, no leakage |
| Scale for distance-based | KNN, SVM, PCA need scaled features |
| Stratify imbalanced | Preserve class distribution |
| Cross-validate | Reliable performance estimates |
| Check learning curves | Diagnose over/underfitting |

---

## Common Pitfalls

| Pitfall | Solution |
|---------|----------|
| Fitting scaler on all data | Use pipeline or fit only on train |
| Using accuracy for imbalanced | Use F1, ROC-AUC, or balanced accuracy |
| Too many hyperparameters | Start simple, add complexity |
| Ignoring feature importance | Use `feature_importances_` or permutation importance |

## Resources

- Docs: <https://scikit-learn.org/>
- User Guide: <https://scikit-learn.org/stable/user_guide.html>
- Algorithm Cheat Sheet: <https://scikit-learn.org/stable/tutorial/machine_learning_map/>

Overview

This skill provides practical guidance for using scikit-learn, the industry-standard Python library for classical machine learning. It covers algorithm selection for classification, regression, clustering, and dimensionality reduction, plus preprocessing, pipelines, evaluation, and hyperparameter tuning. The goal is to help you choose methods and build reproducible workflows for tabular and structured data.

How this skill works

I summarize what algorithms are best suited for common tasks and highlight strengths and trade-offs (e.g., random forests for non-linear relations, SVMs for high-dimensional data). I explain pipeline components like ColumnTransformer and Pipeline to prevent data leakage and show evaluation strategies including cross-validation types and metrics. I also cover hyperparameter search options and common pitfalls to avoid when training models.

When to use it

  • Building classification or regression models for tabular data
  • Clustering or dimensionality reduction for exploration and visualization
  • Creating reproducible ML pipelines that include preprocessing and modeling
  • Evaluating models with cross-validation and appropriate metrics
  • Tuning hyperparameters (GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV)

Best practices

  • Split data into train/validation/test before any preprocessing to prevent leakage
  • Use Pipeline and ColumnTransformer so transforms are fit only on training data
  • Scale features for distance-based methods (KNN, SVM, PCA)
  • Use StratifiedKFold for imbalanced classification and TimeSeriesSplit for temporal data
  • Start with simple models and baseline metrics, then increase complexity and tune

Example use cases

  • Train a random forest with a preprocessing pipeline that imputes, encodes, and scales features
  • Use StratifiedKFold and F1-score to evaluate an imbalanced classification problem
  • Apply PCA or UMAP for visualization before clustering with KMeans or DBSCAN
  • Run RandomizedSearchCV to find good hyperparameter ranges, then refine with GridSearchCV
  • Build a production-ready pipeline that serializes preprocessing and model steps together

FAQ

When should I use GridSearchCV vs RandomizedSearchCV?

Use GridSearchCV for small, well-bounded hyperparameter grids when exhaustive search is feasible. Use RandomizedSearchCV to explore large or unknown parameter spaces faster; follow up with a focused grid if needed.

How do I avoid data leakage during preprocessing?

Always put preprocessing steps inside a Pipeline or ColumnTransformer so fit operations run only on training folds. Never fit scalers or imputers on the full dataset before splitting.

Which cross-validation should I pick?

Use KFold for general use, StratifiedKFold for imbalanced classes, and TimeSeriesSplit for temporal sequences where order matters.