home / skills / serendipityoneinc / srp-claude-code-marketplace / slurm
/plugins/srp-developer/skills/slurm
npx playbooks add skill serendipityoneinc/srp-claude-code-marketplace --skill slurmReview the files below or copy the command above to add this skill to your agents.
---
name: slurm
description: Submit, manage, and monitor GPU workloads on SRP's Slurm clusters with Apptainer containers
---
# Slurm Cluster Management
Help developers submit, manage, and troubleshoot GPU-accelerated workloads on SRP's Slurm clusters. Supports training, inference, and data processing jobs using Apptainer containers.
## When to Use This Skill
Use this skill when:
- Submitting GPU training or inference jobs to Slurm clusters
- Managing running or queued jobs
- Monitoring cluster resources and job status
- Debugging job failures or performance issues
- Writing Slurm job scripts with Apptainer containers
- Checking GPU availability and utilization
## SRP Slurm Clusters
### Oracle OKE Cluster (H100 GPUs)
**SSH Access:**
```bash
ssh -p 2222 <your-ldap-username>@129.80.180.16
# Example:
ssh -p 2222 [email protected]
```
**GPU Type:** H100
**Partition:** `h100` (must specify in job scripts)
**Use Cases:** Large model training, high-performance inference
### DO DOKS Cluster (H200 GPUs)
**SSH Access:**
```bash
ssh -p 2222 <your-ldap-username>@129.212.240.50
# Example:
ssh -p 2222 [email protected]
```
**GPU Type:** H200
**Partition:** Specify in job scripts
**Use Cases:** Latest GPU workloads, large-scale training
### Data Access
Both clusters use **JuiceFS** for unified data access:
- Path: `/data0/` or `/data/srp/`
- Same permissions and directory structure as development machines
- Shared across all cluster nodes and with A10 dev machines
### Monitoring
**Oracle OKE Cluster Dashboards:**
- Cluster Overview: https://grafana.g.yesy.site/d/edrg5th9t1edcb/slinky-slurm
- Workload Monitoring: https://grafana.g.yesy.site/d/f2c83374-71e2-42c6-92a1-10505b584cf2/workload
- Job-Level Stats: https://grafana.g.yesy.site/d/HRLkiLS7k/slurmjobstats
**DO DOKS Cluster Dashboards:**
- Cluster Overview: https://grafana.g2.yesy.site/d/edrg5th9t1edcb/slinky-slurm
- Workload Monitoring: https://grafana.g2.yesy.site/d/workload/workload
- Job-Level Stats: https://grafana.g2.yesy.site/d/slurm/slurm
**Metrics Available:**
- Cluster resource utilization
- GPU availability and usage
- Job queue status
- Per-job resource consumption
- Historical workload patterns
## Essential Slurm Commands
### Job Submission
```bash
# Submit batch job script
sbatch job_script.sh
# Submit with ssubmit wrapper (recommended)
ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00 -cmd "python train.py"
# Interactive job allocation
salloc --partition=h100 --gres=gpu:1 --time=01:00:00
# Run command directly
srun --partition=h100 --gres=gpu:1 python test.py
```
### Job Management
```bash
# View your jobs
squeue -u $USER
# View all jobs
squeue
# View specific job details
scontrol show job <job_id>
# Cancel job
scancel <job_id>
# Cancel all your jobs
scancel -u $USER
# Cancel jobs by name
scancel --name=job_name
```
### Cluster Information
```bash
# View partitions and nodes
sinfo
# View detailed node info
sinfo -N -l
# Check GPU availability
sinfo -o "%20N %10c %10m %25f %10G"
# View specific partition
sinfo -p h100
```
### Job History
```bash
# View completed jobs
sacct
# View specific job details
sacct -j <job_id> --format=JobID,JobName,Partition,AllocCPUS,State,ExitCode
# View jobs from last week
sacct --starttime=now-7days --format=JobID,JobName,Elapsed,State,ExitCode
```
## Job Script Structure
### Modern Slurm Script (Simplified)
The new Slinky Slurm clusters use **prolog/epilog** for notifications, so scripts are much simpler:
```bash
#!/bin/bash
#SBATCH --output=logs/%x_%j.out
#SBATCH --error=logs/%x_%j.err
#SBATCH --job-name=my-training-job
#SBATCH --partition=h100
#SBATCH --gres=gpu:H100:1
#SBATCH --nodes=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=32GB
#SBATCH --time=02:00:00
#SBATCH --mail-type=ALL
#SBATCH [email protected]
set -x
#==============================
# Environment Setup
#==============================
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=$(shuf -i 1000-65535 -n 1)
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
# Set your tokens (replace with actual values)
export HF_TOKEN=your_huggingface_token_here
export WANDB_API_KEY=your_wandb_api_key_here
export WANDB_PROJECT=${SLURM_JOB_NAME}
export WANDB_NAME=${SLURM_JOB_NAME}-$(date +%Y%m%d%H%M%S)
#==============================
# Pre-task initialization
#==============================
echo "Running pre-task initialization..."
# Your setup commands here
#==============================
# Main Job Execution
#==============================
echo "Starting main task..."
srun -v -l --jobid $SLURM_JOBID --job-name=${SLURM_JOB_NAME} \
--output $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.out \
--error $SLURM_SUBMIT_DIR/logs/%x_%j_%s_%t_%N.err \
apptainer run --fakeroot --writable-tmpfs --nv \
/data0/apptainer/pytorch_24.01-py3.sif bash -ex << 'EOF'
# ==== YOUR JOB COMMANDS START ====
echo "Training started at $(date)"
python train.py \
--model gpt2 \
--batch-size 32 \
--epochs 10 \
--output-dir /data0/models/
nvidia-smi
echo "Training completed at $(date)"
# ==== YOUR JOB COMMANDS END ====
EOF
```
### Key SBATCH Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `--job-name` | Job name (shows in squeue) | `my-training` |
| `--partition` | Cluster partition | `h100` |
| `--gres` | GPU resources | `gpu:H100:1` (1 GPU)<br>`gpu:H100:2` (2 GPUs) |
| `--nodes` | Number of nodes | `1` (single node)<br>`2` (distributed) |
| `--cpus-per-task` | CPUs per task | `10` |
| `--mem` | Memory per node | `32GB` |
| `--time` | Max runtime | `02:00:00` (2 hours) |
| `--output` | stdout log file | `logs/%x_%j.out` |
| `--error` | stderr log file | `logs/%x_%j.err` |
| `--mail-type` | Email notification | `ALL`, `FAIL`, `END` |
**Log File Placeholders:**
- `%x` - Job name
- `%j` - Job ID
- `%s` - Step ID
- `%t` - Task ID
- `%N` - Node name
### Multi-Node Distributed Training
```bash
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --partition=h100
#SBATCH --nodes=2
#SBATCH --gres=gpu:H100:2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH --mem=64GB
#SBATCH --time=04:00:00
set -x
# Distributed training setup
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=12345
export WORLD_SIZE=$((SLURM_NNODES * SLURM_NTASKS_PER_NODE))
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif \
python -m torch.distributed.launch \
--nproc_per_node=$SLURM_NTASKS_PER_NODE \
--nnodes=$SLURM_NNODES \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=$MASTER_PORT \
train_distributed.py
```
## Using Apptainer Containers
### Available Container Images
**Location:** `/data0/apptainer/`
**Common Images:**
- `pytorch_24.01-py3.sif` - PyTorch 24.01 with Python 3
- `ray_2.52.0-py310-gpu.sif` - Ray 2.52.0 with Python 3.10
- Custom images built for specific projects
### Apptainer Command Patterns
```bash
# Run container with GPU support
apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif python script.py
# Shell into container
apptainer shell --nv /data0/apptainer/pytorch_24.01-py3.sif
# Execute single command
apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif nvidia-smi
# With additional flags
apptainer run --fakeroot --writable-tmpfs --nv <image.sif> <command>
```
**Common Flags:**
- `--nv` - Enable NVIDIA GPU support
- `--fakeroot` - Fake root user privileges (for installing packages)
- `--writable-tmpfs` - Create writable temporary filesystem
- `--bind <src>:<dst>` - Mount additional directories
### Interactive Container Session
```bash
# Start interactive job with Apptainer
sapptainer -c 20 -m 200G -g 1 -p h100 -i /data0/apptainer/pytorch_24.01-py3.sif
# Parameters:
# -c: CPUs
# -m: Memory
# -g: GPUs
# -p: Partition
# -i: Container image
```
## Using ssubmit Wrapper
SRP provides `ssubmit` wrapper for simplified job submission:
```bash
# Basic usage
ssubmit -j job_name -p h100 -g 1 -c 10 -m 32G -t 2:00:00 \
-cmd "python train.py"
# With custom script
ssubmit -j my-job -p h100 -g 2 -s job_script.sh
# Interactive mode
ssubmit -j interactive -p h100 -g 1 -i
```
**Parameters:**
- `-j` - Job name
- `-p` - Partition (h100, compute)
- `-g` - Number of GPUs
- `-c` - Number of CPUs
- `-m` - Memory (e.g., 32G)
- `-t` - Time limit (HH:MM:SS)
- `-cmd` - Command to run
- `-s` - Script file to execute
- `-i` - Interactive mode
**Reference:** https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md
## Feishu Notifications
Slurm clusters automatically send **Feishu notifications** for job events via prolog/epilog:
**Notification Types:**
- ✅ Job started
- ✅ Job completed successfully
- ❌ Job failed with error code
- ⏱️ Job timeout
- 🛑 Job cancelled
**Notification Channel:** `[email protected]`
**What's Included:**
- Job ID, name, partition
- Node allocation
- Start and end time
- Exit status
- Resource usage summary
- Log file locations
**No Action Needed:** Notifications are automatic - no need to add notification code to your scripts.
## Best Practices
### Resource Allocation
1. **Request What You Need:**
- Don't over-request CPUs/memory - it delays scheduling
- Start with minimal resources, scale up if needed
2. **GPU Utilization:**
- Use `nvidia-smi` to verify GPU is being used
- Monitor GPU memory with `nvidia-smi dmon`
3. **Time Limits:**
- Set realistic time limits (slightly above expected)
- Jobs exceeding time limit are killed
4. **Partitions:**
- Always specify partition explicitly
- Use `h100` for Oracle, appropriate partition for DO
### Job Organization
```bash
# Organize logs by date
#SBATCH --output=logs/%Y%m%d/%x_%j.out
#SBATCH --error=logs/%Y%m%d/%x_%j.err
# Or by job name
#SBATCH --output=logs/%x/%j.out
#SBATCH --error=logs/%x/%j.err
```
### Checkpoint and Resume
```python
# Save checkpoints periodically
import torch
import os
checkpoint_dir = "/data0/checkpoints"
checkpoint_path = os.path.join(checkpoint_dir, f"model_epoch_{epoch}.pt")
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss,
}, checkpoint_path)
# Resume from checkpoint
if os.path.exists(checkpoint_path):
checkpoint = torch.load(checkpoint_path)
model.load_state_dict(checkpoint['model_state_dict'])
start_epoch = checkpoint['epoch'] + 1
```
### Error Handling
```bash
# Set bash options for safety
set -e # Exit on error
set -u # Error on undefined variable
set -x # Print commands (useful for debugging)
set -o pipefail # Exit on pipe failure
# Add error traps
trap 'echo "Error on line $LINENO"; exit 1' ERR
```
## Monitoring and Debugging
### Check Job Status
```bash
# Detailed job info
scontrol show job <job_id>
# Watch job queue
watch -n 5 squeue -u $USER
# Check why job is pending
squeue -j <job_id> --start
```
### View Logs
```bash
# Tail logs while job runs
tail -f logs/job_name_12345.out
# View last 100 lines
tail -n 100 logs/job_name_12345.out
# Search for errors
grep -i error logs/job_name_12345.err
```
### GPU Monitoring
```bash
# Inside running job container
nvidia-smi
# Continuous monitoring
nvidia-smi dmon
# Detailed GPU utilization
nvidia-smi --query-gpu=timestamp,name,utilization.gpu,utilization.memory,memory.used,memory.free --format=csv -l 5
```
### Resource Usage
```bash
# Check job efficiency
seff <job_id>
# Detailed accounting
sacct -j <job_id> --format=JobID,JobName,Elapsed,CPUTime,MaxRSS,State
```
## Common Issues and Solutions
| Issue | Cause | Solution |
|-------|-------|----------|
| **Job pending forever** | No available resources | Check `sinfo` for available GPUs; adjust resource requests |
| **"Out of memory" error** | Insufficient memory request | Increase `--mem` in job script |
| **GPU not detected** | Missing `--gres` or `--nv` | Add `--gres=gpu:X` to sbatch, `--nv` to apptainer |
| **Container not found** | Wrong image path | Verify path in `/data0/apptainer/` |
| **Permission denied** | File permissions issue | Check file ownership and permissions |
| **Module not found** | Missing Python packages | Install in container or use different image |
| **NCCL timeout** | Network issues in distributed training | Check NCCL env vars, verify nodes can communicate |
| **Killed job (OOM)** | Memory exceeded | Reduce batch size or increase `--mem` |
## Quick Reference
### Essential Commands
```bash
# Submit job
sbatch job.sh
# Check queue
squeue -u $USER
# Job details
scontrol show job <job_id>
# Cancel job
scancel <job_id>
# View logs
tail -f logs/job_*.out
# Cluster info
sinfo -p h100
# Job history
sacct --starttime=today
```
### Example Workflows
#### 1. Quick GPU Test
```bash
# Submit test job
sbatch << 'EOF'
#!/bin/bash
#SBATCH --job-name=gpu-test
#SBATCH --partition=h100
#SBATCH --gres=gpu:1
#SBATCH --time=00:10:00
#SBATCH --output=test_%j.out
srun apptainer exec --nv /data0/apptainer/pytorch_24.01-py3.sif \
nvidia-smi
EOF
```
#### 2. Training with Checkpoints
```bash
#!/bin/bash
#SBATCH --job-name=training-with-checkpoint
#SBATCH --partition=h100
#SBATCH --gres=gpu:H100:1
#SBATCH --time=04:00:00
#SBATCH --signal=B:USR1@60
checkpoint_handler() {
echo "Received signal, saving checkpoint..."
# Signal Python process to save checkpoint
pkill -USR1 -f train.py
}
trap checkpoint_handler USR1
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif \
python train.py \
--checkpoint-dir /data0/checkpoints \
--resume-if-exists
```
#### 3. Batch Processing
```bash
#!/bin/bash
#SBATCH --job-name=batch-inference
#SBATCH --partition=h100
#SBATCH --gres=gpu:H100:1
#SBATCH --array=0-9
#SBATCH --time=01:00:00
# Process 10 shards in parallel
SHARD_ID=$SLURM_ARRAY_TASK_ID
srun apptainer run --nv /data0/apptainer/pytorch_24.01-py3.sif \
python inference.py \
--input /data0/input/shard_${SHARD_ID}.json \
--output /data0/output/shard_${SHARD_ID}.json
```
## Resources
### Official Documentation
- Slurm Commands: https://slurm.schedmd.com/man_index.html
- Slurm Quick Start: https://slurm.schedmd.com/quickstart.html
- Apptainer User Guide: https://apptainer.org/docs/user/latest/
### SRP Resources
- Deployment Guide: https://starquest.feishu.cn/wiki/TZASwm86nivXLTkMV6kcoJF4n2I
- **Oracle OKE Grafana:**
- Cluster: https://grafana.g.yesy.site/d/edrg5th9t1edcb/slinky-slurm
- Workload: https://grafana.g.yesy.site/d/f2c83374-71e2-42c6-92a1-10505b584cf2/workload
- Job Stats: https://grafana.g.yesy.site/d/HRLkiLS7k/slurmjobstats
- **DO DOKS Grafana:**
- Cluster: https://grafana.g2.yesy.site/d/edrg5th9t1edcb/slinky-slurm
- Workload: https://grafana.g2.yesy.site/d/workload/workload
- Job Stats: https://grafana.g2.yesy.site/d/slurm/slurm
- ssubmit Examples: https://github.com/SerendipityOneInc/llm-jobs/blob/main/slurm/ssubmit-examples/README.md
## Implementation Steps
When helping users with Slurm jobs:
1. **Understand Requirements:**
- What workload type? (training, inference, data processing)
- GPU requirements (quantity, memory)
- Expected runtime
- Data input/output locations
2. **Choose Cluster:**
- Oracle OKE (H100) for most workloads
- DO DOKS (H200) for cutting-edge GPU needs
3. **Write Job Script:**
- Use modern simplified template (no notification code)
- Specify appropriate resources
- Use Apptainer container with `--nv` flag
- Set up proper logging
4. **Submit and Monitor:**
- Submit with `sbatch` or `ssubmit`
- Monitor with `squeue` and Grafana
- Check logs for errors
- Verify GPU utilization
5. **Debug Issues:**
- Check Feishu notifications for failure reasons
- Review log files
- Use `scontrol` for detailed job info
- Consult troubleshooting table
6. **Optimize:**
- Adjust batch sizes based on GPU memory
- Use job arrays for parallel processing
- Implement checkpointing for long runs
- Monitor resource usage with `sacct` and `seff`