home / skills / openclaw / skills / wandb-monitor

This skill helps you monitor, analyze, and compare Weights & Biases training runs to detect failures and optimize performance.

npx playbooks add skill openclaw/skills --skill wandb-monitor

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
3.0 KB
---
name: wandb
description: Monitor and analyze Weights & Biases training runs. Use when checking training status, detecting failures, analyzing loss curves, comparing runs, or monitoring experiments. Triggers on "wandb", "training runs", "how's training", "did my run finish", "any failures", "check experiments", "loss curve", "gradient norm", "compare runs".
---

# Weights & Biases

Monitor, analyze, and compare W&B training runs.

## Setup

```bash
wandb login
# Or set WANDB_API_KEY in environment
```

## Scripts

### Characterize a Run (Full Health Analysis)

```bash
~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/characterize_run.py ENTITY/PROJECT/RUN_ID
```

Analyzes:
- Loss curve trend (start → current, % change, direction)
- Gradient norm health (exploding/vanishing detection)  
- Eval metrics (if present)
- Stall detection (heartbeat age)
- Progress & ETA estimate
- Config highlights
- Overall health verdict

Options: `--json` for machine-readable output.

### Watch All Running Jobs

```bash
~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/watch_runs.py ENTITY [--projects p1,p2]
```

Quick health summary of all running jobs plus recent failures/completions. Ideal for morning briefings.

Options:
- `--projects p1,p2` — Specific projects to check
- `--all-projects` — Check all projects
- `--hours N` — Hours to look back for finished runs (default: 24)
- `--json` — Machine-readable output

### Compare Two Runs

```bash
~/clawd/venv/bin/python3 ~/clawd/skills/wandb/scripts/compare_runs.py ENTITY/PROJECT/RUN_A ENTITY/PROJECT/RUN_B
```

Side-by-side comparison:
- Config differences (highlights important params)
- Loss curves at same steps
- Gradient norm comparison
- Eval metrics
- Performance (tokens/sec, steps/hour)
- Winner verdict

## Python API Quick Reference

```python
import wandb
api = wandb.Api()

# Get runs
runs = api.runs("entity/project", {"state": "running"})

# Run properties
run.state      # running | finished | failed | crashed | canceled
run.name       # display name
run.id         # unique identifier
run.summary    # final/current metrics
run.config     # hyperparameters
run.heartbeat_at # stall detection

# Get history
history = list(run.scan_history(keys=["train/loss", "train/grad_norm"]))
```

## Metric Key Variations

Scripts handle these automatically:
- Loss: `train/loss`, `loss`, `train_loss`, `training_loss`
- Gradients: `train/grad_norm`, `grad_norm`, `gradient_norm`
- Steps: `train/global_step`, `global_step`, `step`, `_step`
- Eval: `eval/loss`, `eval_loss`, `eval/accuracy`, `eval_acc`

## Health Thresholds

- **Gradients > 10**: Exploding (critical)
- **Gradients > 5**: Spiky (warning)
- **Gradients < 0.0001**: Vanishing (warning)
- **Heartbeat > 30min**: Stalled (critical)
- **Heartbeat > 10min**: Slow (warning)

## Integration Notes

For morning briefings, use `watch_runs.py --json` and parse the output.

For detailed analysis of a specific run, use `characterize_run.py`.

For A/B testing or hyperparameter comparisons, use `compare_runs.py`.

Overview

This skill monitors and analyzes Weights & Biases (W&B) training runs to give fast, actionable health checks and comparisons. It surfaces loss trends, gradient-norm issues, stalling, ETA, and key config differences so you can quickly spot failures or regressions. Use it for morning briefings, run triage, and A/B experiment comparisons.

How this skill works

The skill queries the W&B API to list runs, read run metadata, and stream history for key metrics (loss, gradient norm, step, eval metrics). It runs health checks against configurable thresholds, detects stalls via heartbeat timestamps, and generates human- and machine-readable summaries. Comparison tools align runs by step and highlight config and metric deltas.

When to use it

  • Check current training status or progress ("how's training?")
  • Detect failures, stalls, or exploding/vanishing gradients
  • Compare two runs or perform A/B hyperparameter analysis
  • Generate morning briefing of all running jobs and recent completions
  • Estimate ETA and progress for long-running experiments

Best practices

  • Log standard metric keys (loss, grad_norm, step) or rely on the skill's built-in key variations to ensure detection
  • Use --json output for programmatic pipelines or morning briefing dashboards
  • Include heartbeat updates in your training loop to avoid false stall alerts
  • Run characterize_run.py for deep analysis of a single run and watch_runs.py for team-wide overviews
  • Compare runs at the same step granularity to avoid misleading curve mismatches

Example use cases

  • Run characterize_run.py ENTITY/PROJECT/RUN_ID to get loss trend, gradient health, progress, and a final verdict
  • Use watch_runs.py ENTITY --projects p1,p2 to create a morning summary of active experiments and recent failures
  • Invoke compare_runs.py to highlight config diffs, aligned loss curves, gradient comparisons, and pick a winner for A/B testing
  • Pipe --json output into an internal dashboard or Slack bot for automated alerts
  • Detect stalls by checking run.heartbeat_at and alert if heartbeat > 30min

FAQ

What metric names does the skill recognize?

It maps common variants automatically (e.g., train/loss, loss, train_loss; train/grad_norm, grad_norm). Specify keys in code only if you use nonstandard names.

How are stalls detected?

Stall detection compares run.heartbeat_at against thresholds: >30 minutes is critical (stalled), >10 minutes is a warning. These values are configurable in your monitoring wrapper.