home / skills / physics91 / claude-vibe / ml-reviewer

ml-reviewer skill

/skills/ml-reviewer

This skill reviews ML/DL code for PyTorch/TensorFlow patterns, training optimization, data pipelines, and MLOps checks to improve model quality and reliability.

npx playbooks add skill physics91/claude-vibe --skill ml-reviewer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
10.5 KB
---
name: ml-reviewer
description: |
  WHEN: Machine Learning/Deep Learning code review, PyTorch/TensorFlow patterns, Model training optimization, MLOps checks
  WHAT: Model architecture review + Training patterns + Data pipeline checks + GPU optimization + Experiment tracking
  WHEN NOT: Data analysis only → python-data-reviewer, General Python → python-reviewer
---

# ML Reviewer Skill

## Purpose
Reviews Machine Learning and Deep Learning code for PyTorch, TensorFlow, scikit-learn, and MLOps best practices.

## When to Use
- ML/DL project code review
- "PyTorch", "TensorFlow", "Keras", "scikit-learn", "model training" mentions
- Model performance, training optimization inspection
- Projects with ML framework dependencies

## Project Detection
- `torch`, `tensorflow`, `keras`, `sklearn` in requirements.txt/pyproject.toml
- `.pt`, `.pth`, `.h5`, `.pkl` model files
- `train.py`, `model.py`, `dataset.py` files
- Jupyter notebooks with ML imports

## Workflow

### Step 1: Analyze Project
```
**Framework**: PyTorch / TensorFlow / scikit-learn
**Python**: 3.10+
**CUDA**: 11.x / 12.x
**Task**: Classification / Regression / NLP / CV
**Stage**: Research / Production
```

### Step 2: Select Review Areas
**AskUserQuestion:**
```
"Which areas to review?"
Options:
- Full ML pattern check (recommended)
- Model architecture review
- Training loop optimization
- Data pipeline efficiency
- MLOps/deployment patterns
multiSelect: true
```

## Detection Rules

### PyTorch Patterns
| Check | Recommendation | Severity |
|-------|----------------|----------|
| Missing model.eval() | Inconsistent inference | HIGH |
| Missing torch.no_grad() | Memory leak in inference | HIGH |
| In-place operations in autograd | Gradient computation error | CRITICAL |
| DataLoader num_workers=0 | CPU bottleneck | MEDIUM |
| Missing gradient clipping | Exploding gradients | MEDIUM |

```python
# BAD: Missing eval() and no_grad()
def predict(model, x):
    return model(x)  # Dropout/BatchNorm inconsistent!

# GOOD: Proper inference mode
def predict(model, x):
    model.eval()
    with torch.no_grad():
        return model(x)

# BAD: In-place operation breaking autograd
x = torch.randn(10, requires_grad=True)
x += 1  # In-place! Breaks gradient computation

# GOOD: Out-of-place operation
x = torch.randn(10, requires_grad=True)
x = x + 1

# BAD: DataLoader bottleneck
loader = DataLoader(dataset, batch_size=32)  # num_workers=0

# GOOD: Parallel data loading
loader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    pin_memory=True,  # For GPU
    persistent_workers=True,
)

# BAD: No gradient clipping
optimizer.step()

# GOOD: Clip gradients
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
```

### TensorFlow/Keras Patterns
| Check | Recommendation | Severity |
|-------|----------------|----------|
| Missing @tf.function | Performance loss | MEDIUM |
| Eager mode in production | Slow inference | HIGH |
| Large model in memory | OOM risk | HIGH |
| Missing mixed precision | Training inefficiency | MEDIUM |

```python
# BAD: No @tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        pred = model(x)
        loss = loss_fn(y, pred)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# GOOD: Use @tf.function
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        pred = model(x, training=True)
        loss = loss_fn(y, pred)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

# BAD: Missing mixed precision
model.fit(x_train, y_train, epochs=10)

# GOOD: Enable mixed precision
tf.keras.mixed_precision.set_global_policy('mixed_float16')
model.fit(x_train, y_train, epochs=10)
```

### scikit-learn Patterns
| Check | Recommendation | Severity |
|-------|----------------|----------|
| fit_transform on test data | Data leakage | CRITICAL |
| Missing cross-validation | Overfitting risk | HIGH |
| No feature scaling | Model performance | MEDIUM |
| Hardcoded random_state | Reproducibility | LOW |

```python
# BAD: Data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)  # LEAK! Re-fitting

# GOOD: transform only on test
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)  # No re-fit

# BAD: No cross-validation
model.fit(X_train, y_train)
score = model.score(X_test, y_test)

# GOOD: Use cross-validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(f"CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")

# BAD: Pipeline without scaling
model = LogisticRegression()
model.fit(X_train, y_train)

# GOOD: Use Pipeline with scaling
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', LogisticRegression())
])
pipeline.fit(X_train, y_train)
```

### Data Pipeline
| Check | Problem | Solution |
|-------|---------|----------|
| Loading full dataset to memory | OOM | Use generators/tf.data |
| No data augmentation | Overfitting | Add augmentation |
| Unbalanced classes | Biased model | Oversample/undersample/weights |
| No validation split | No early stopping | Use validation set |

```python
# BAD: Full dataset in memory
images = []
for path in all_image_paths:
    images.append(load_image(path))  # OOM for large datasets!

# GOOD: Use generator
def data_generator(paths, batch_size):
    for i in range(0, len(paths), batch_size):
        batch_paths = paths[i:i+batch_size]
        yield np.array([load_image(p) for p in batch_paths])

# GOOD: Use tf.data
dataset = tf.data.Dataset.from_tensor_slices(paths)
dataset = dataset.map(load_and_preprocess)
dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)

# BAD: No class weights for imbalanced data
model.fit(X_train, y_train)

# GOOD: Add class weights
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
class_weights = dict(enumerate(weights))
model.fit(X_train, y_train, class_weight=class_weights)
```

### GPU/Performance
| Check | Recommendation | Severity |
|-------|----------------|----------|
| CPU tensor operations | Use GPU tensors | HIGH |
| Frequent GPU-CPU transfer | Batch transfers | HIGH |
| No gradient accumulation | OOM for large batch | MEDIUM |
| Missing torch.cuda.empty_cache() | Memory fragmentation | LOW |

```python
# BAD: CPU operations
x = torch.randn(1000, 1000)
y = torch.randn(1000, 1000)
z = x @ y  # CPU computation

# GOOD: GPU operations
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
z = x @ y  # GPU computation

# BAD: Frequent CPU-GPU transfer
for x, y in dataloader:
    x = x.cuda()
    y = y.cuda()
    loss = model(x, y)
    print(loss.item())  # Sync every iteration!

# GOOD: Batch logging
losses = []
for x, y in dataloader:
    x, y = x.to(device), y.to(device)
    loss = model(x, y)
    losses.append(loss)
if step % log_interval == 0:
    print(torch.stack(losses).mean().item())

# Gradient accumulation for large effective batch
accumulation_steps = 4
for i, (x, y) in enumerate(dataloader):
    loss = model(x, y) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()
```

### MLOps/Experiment Tracking
| Check | Recommendation | Severity |
|-------|----------------|----------|
| No experiment tracking | Reproducibility | HIGH |
| Hardcoded hyperparameters | Config management | MEDIUM |
| No model versioning | Deployment issues | MEDIUM |
| Missing seed setting | Non-reproducible | HIGH |

```python
# BAD: No seed setting
model = train_model(X, y)

# GOOD: Set all seeds
import random
import numpy as np
import torch

def set_seed(seed=42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True

set_seed(42)

# BAD: Hardcoded hyperparameters
lr = 0.001
batch_size = 32
epochs = 100

# GOOD: Use config file or hydra
import hydra
from omegaconf import DictConfig

@hydra.main(config_path="configs", config_name="train")
def train(cfg: DictConfig):
    model = build_model(cfg.model)
    optimizer = torch.optim.Adam(model.parameters(), lr=cfg.lr)

# GOOD: Use experiment tracking
import wandb
wandb.init(project="my-project", config=cfg)
for epoch in range(epochs):
    loss = train_epoch(model, dataloader)
    wandb.log({"loss": loss, "epoch": epoch})
wandb.finish()
```

## Response Template
```
## ML Code Review Results

**Project**: [name]
**Framework**: PyTorch/TensorFlow/scikit-learn
**Task**: Classification/Regression/NLP/CV
**Files Analyzed**: X

### Model Architecture
| Status | File | Issue |
|--------|------|-------|
| MEDIUM | models/resnet.py | Missing dropout for regularization |
| LOW | models/transformer.py | Consider gradient checkpointing |

### Training Loop
| Status | File | Issue |
|--------|------|-------|
| HIGH | train.py | Missing model.eval() in validation (line 45) |
| HIGH | train.py | No gradient clipping (line 67) |

### Data Pipeline
| Status | File | Issue |
|--------|------|-------|
| CRITICAL | data/dataset.py | fit_transform on test data (line 23) |
| HIGH | data/loader.py | DataLoader num_workers=0 |

### MLOps
| Status | File | Issue |
|--------|------|-------|
| HIGH | train.py | No seed setting for reproducibility |
| MEDIUM | train.py | Hardcoded hyperparameters |

### Recommended Actions
1. [ ] Add model.eval() and torch.no_grad() for inference
2. [ ] Fix data leakage in preprocessing
3. [ ] Set random seeds for reproducibility
4. [ ] Add experiment tracking (wandb/mlflow)
```

## Best Practices
1. **Training**: eval mode, no_grad, gradient clipping, mixed precision
2. **Data**: No leakage, proper splits, augmentation, balanced classes
3. **Performance**: GPU operations, batch transfers, gradient accumulation
4. **MLOps**: Seed setting, experiment tracking, config management
5. **Testing**: Unit tests for data pipeline, model output shape tests

## Integration
- `python-reviewer` skill: General Python code quality
- `python-data-reviewer` skill: Data preprocessing patterns
- `test-generator` skill: ML test generation
- `docker-reviewer` skill: ML containerization

## Notes
- Based on PyTorch 2.x, TensorFlow 2.x, scikit-learn 1.x
- Supports distributed training patterns (DDP, FSDP)
- Includes MLOps patterns (wandb, mlflow, hydra)

Overview

This skill reviews machine learning and deep learning code and configurations for PyTorch, TensorFlow/Keras, and scikit-learn projects. It identifies model architecture issues, training loop mistakes, data pipeline problems, GPU/performance pitfalls, and MLOps gaps. The goal is actionable recommendations to improve correctness, performance, and reproducibility.

How this skill works

The reviewer scans project files and common manifests for ML frameworks, model files, and training scripts. It inspects model definitions, training/validation loops, data preprocessing, DataLoader/dataset usage, GPU transfers, and experiment/config management. For each finding it reports severity, file/line hints, and concrete fixes (code suggestions or tooling).

When to use it

  • Code reviews for ML/DL projects using PyTorch, TensorFlow/Keras, or scikit-learn
  • When validating model training, inference correctness, or GPU performance
  • Before deploying models to production or preparing reproducible experiments
  • When adding experiment tracking, config management, or model versioning
  • To audit data pipelines for leakage, augmentation, and memory usage

Best practices

  • Run inference with model.eval() / tf.function and wrap with no_grad or proper tracing
  • Prevent data leakage: fit scalers on train only, use validation split and CV
  • Optimize GPU usage: move tensors to device, batch GPU transfers, use mixed precision
  • Manage experiments: set seeds, track runs (wandb/mlflow), and externalize hyperparameters
  • Set DataLoader num_workers, use tf.data or generators for large datasets, and add augmentation

Example use cases

  • Review a PyTorch repo for missing model.eval(), in-place ops, and gradient clipping
  • Audit a Keras training loop for missing @tf.function, eager-mode overhead, or OOM risks
  • Detect scikit-learn data leakage from fit_transform on test data and missing cross-validation
  • Improve GPU throughput by flagging CPU tensor ops, frequent CPU-GPU transfers, or lack of accumulation
  • Assess MLOps maturity: missing experiment tracking, hardcoded hyperparameters, or no model versioning

FAQ

Which files does this skill analyze?

It inspects training scripts (train.py), model modules (model.py, models/), dataset/loaders, notebooks with ML imports, and project manifests like requirements.txt or pyproject.toml.

Can it suggest code fixes?

Yes — findings include concrete recommendations and small example snippets (e.g., adding no_grad(), enabling mixed precision, or using transform instead of fit_transform on test).