home / skills / fl-sean03 / agentic-science-worker / hpc-cluster
hpc-cluster skill

not checked
npx playbooks add skill fl-sean03/agentic-science-worker --skill hpc-cluster
Review the files below or copy the command above to add this skill to your agents.
Files (2)
SKILL.md
21.1 KB
---
name: hpc-cluster
description: Run jobs on CU Boulder CURC HPC cluster (Alpine). Use when simulations need more compute than the local workstation, for large-scale parallel jobs, or when GPU resources are needed beyond local availability. You have full SSH access - work like a researcher.
allowed-tools:
  - Read
  - Write
  - Edit
  - Bash
  - Glob
  - Grep
  - WebSearch
  - WebFetch
---

# CURC HPC Cluster Access (CU Boulder Alpine)

You have full SSH access to CU Boulder's Alpine HPC cluster. You can do everything a human researcher can do: submit jobs, debug failures, load modules, transfer files, and work autonomously.

## Quick Reference

| Item | Value |
|------|-------|
| **Login** | `ssh [email protected]` |
| **Filesystem** | `/scratch/alpine/$CURC_USER/` (10TB, fast I/O) |
| **Agent Workspace** | `/scratch/alpine/$CURC_USER/Agent_Runs/` |
| **Job Scheduler** | SLURM |
| **Default Partition** | `amilan` (CPU), `aa100` (GPU) |
| **Authentication** | SSH key (pre-configured) |
| **HPC Client** | `.claude/skills/hpc-cluster/hpc_client.py` |

---

## Two Ways to Work

You have two approaches available:

### 1. Python HPC Client (Recommended for common operations)

A lightweight client that handles connection management and common patterns:

```python
import sys
import os
# Add the skill directory to path (relative to project root)
skill_dir = os.path.join(os.environ.get('PROJECT_ROOT', '.'), '.claude/skills/hpc-cluster')
sys.path.insert(0, skill_dir)
from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# Create workspace, upload files, submit job, wait for completion
run_dir = hpc.create_run("argon-diffusion")
hpc.upload("input.lmp", f"{run_dir}/input.lmp")
hpc.upload("job.slurm", f"{run_dir}/job.slurm")
job_id = hpc.submit(f"{run_dir}/job.slurm")
status = hpc.wait_for_job(job_id, timeout=3600)

if status.is_success:
    hpc.download(f"{run_dir}/output.dat", "./results/")
else:
    # Debug: read error output
    print(hpc.read_file(f"{run_dir}/my_job_{job_id}.err"))

hpc.disconnect()
```

### 2. Direct SSH (For full control)

When you need to do something the client doesn't support, use raw SSH:

```bash
# Run any command
ssh [email protected] "your command here"

# Interactive debugging
ssh [email protected]
```

**Use the client for**: workspace setup, file transfer, job submission, job monitoring
**Use raw SSH for**: debugging, exploring, unusual operations, anything not covered

---

## Connection

### SSH Access

SSH is pre-configured with key-based authentication and connection multiplexing via `~/.ssh/config`. Use the `cu_alpine` alias for simplicity:

```bash
# Connect to CURC login node (uses ~/.ssh/config)
ssh cu_alpine

# Run a single command
ssh cu_alpine "squeue -u $CURC_USER"

# Or use full address
ssh [email protected] "squeue -u $CURC_USER"

# Transfer files TO HPC
scp local_file.txt [email protected]:/scratch/alpine/$CURC_USER/

# Transfer files FROM HPC
scp [email protected]:/scratch/alpine/$CURC_USER/results.dat ./
```

**Connection multiplexing**: The SSH config uses ControlMaster to reuse connections - the first connection is slower, but subsequent ones are instant.

**Important**: The login node is for submitting jobs and light tasks. Never run compute-intensive work directly on login nodes.

---

## Workspace Structure

All agent work on HPC goes in the existing `Agent_Runs` directory:

```
/scratch/alpine/$CURC_USER/Agent_Runs/
├── argon-diffusion-20260118/
│   ├── inputs/
│   ├── outputs/
│   ├── job.slurm
│   └── README.md
├── water-tip4p-20260119/
├── shared/
│   ├── potentials/          # Downloaded force fields
│   ├── pseudopotentials/    # Downloaded pseudopotentials
│   └── scripts/             # Reusable analysis scripts
└── ...
```

### Creating a New Run

```bash
# Create run directory with timestamp
RUN_NAME="project-name-$(date +%Y%m%d-%H%M%S)"
RUN_DIR="/scratch/alpine/$CURC_USER/Agent_Runs/$RUN_NAME"
ssh cu_alpine "mkdir -p $RUN_DIR/{inputs,outputs}"
```

---

## SLURM Job Submission

### Job Script Template

```bash
#!/bin/bash
#SBATCH --job-name=my_simulation
#SBATCH --partition=amilan          # CPU partition (or aa100 for GPU)
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Number of MPI tasks
#SBATCH --time=04:00:00             # Max runtime (HH:MM:SS)
#SBATCH --output=%x_%j.out          # stdout file
#SBATCH --error=%x_%j.err           # stderr file
#SBATCH --mail-type=END,FAIL        # Email notifications
#SBATCH [email protected]

# Load required modules
module purge
module load gcc/13.1.0
module load openmpi/4.1.6

# Change to run directory
cd $SLURM_SUBMIT_DIR

# Run your simulation
mpirun -np $SLURM_NTASKS ./your_program input.in
```

### Key SLURM Commands

| Command | Purpose |
|---------|---------|
| `sbatch job.slurm` | Submit batch job |
| `squeue -u $USER` | Check your job status |
| `squeue -j <jobid>` | Check specific job |
| `scancel <jobid>` | Cancel a job |
| `sinfo -p amilan` | Check partition status |
| `sacct -j <jobid>` | Job accounting info |
| `scontrol show job <jobid>` | Detailed job info |

### Job Status Codes

| Code | Meaning |
|------|---------|
| `PD` | Pending (waiting for resources) |
| `R` | Running |
| `CG` | Completing |
| `CD` | Completed |
| `F` | Failed |
| `TO` | Timeout |
| `CA` | Cancelled |

---

## Available Partitions

### Partition Selection Strategy

**CRITICAL: Always validate on testing partition first before production runs!**

```
Workflow:
1. atesting / atesting_a100  →  Validate job script works (1 hour max)
2. amilan / aa100            →  Production runs (24 hour max)
3. amilan + qos=long         →  Extended runs (7 day max, lower priority)
```

### Testing Partitions (Use First!)

| Partition | Limits | Max Time | Purpose |
|-----------|--------|----------|---------|
| `atesting` | 2 nodes, 16 cores max | 1h | **Validate CPU jobs work before production** |
| `atesting_a100` | 1 GPU, 10 cores max | 1h | **Validate GPU jobs work before production** |
| `atesting_mi100` | 1 GPU, 10 cores max | 1h | Validate AMD GPU jobs |

**Always run a short test on atesting first** to catch:
- Module loading issues
- Path errors
- Input file problems
- Memory requirements

### Production CPU Partitions

| Partition | Nodes | Cores/Node | RAM/Node | Max Time | Use For |
|-----------|-------|------------|----------|----------|---------|
| `amilan` | 387 | 32-64 | 256 GB (3.75 GB/core) | 24h | **Default for production CPU jobs** |
| `amilan128c` | 16 | 128 | 256 GB (2 GB/core) | 24h | **High core count on single node** (see below) |
| `amem` | 24 | 48-128 | up to 2 TB | 24h | Memory-intensive (requires `--qos=mem`, must request 256GB+) |

#### When to Use amilan128c vs amilan

**Use `amilan128c` when:**
- Your job benefits from **128 cores on ONE node** (vs spreading across multiple nodes)
- Running **OpenMP/shared-memory** parallel codes
- High **inter-process communication** (MPI with frequent small messages)
- **Tightly-coupled simulations** where network latency hurts performance
- Large LAMMPS/QE jobs that scale well but suffer from inter-node communication

**Use regular `amilan` when:**
- Your job needs **fewer than 64 cores**
- You need **multiple nodes** (amilan has 387 nodes vs only 16 for 128c)
- Memory per core matters more (3.75 GB/core vs 2 GB/core on 128c)
- Queue wait time is a concern (more nodes = shorter queue)

**Example: 128-core single-node LAMMPS job**
```bash
#SBATCH --partition=amilan128c
#SBATCH --nodes=1
#SBATCH --ntasks=128           # Use all 128 cores
#SBATCH --time=12:00:00
```

### Production GPU Partitions

| Partition | Nodes | GPUs/Node | GPU Type | Max Time | Use For |
|-----------|-------|-----------|----------|----------|---------|
| `aa100` | 11 | 3 | NVIDIA A100 (40GB) | 24h | **Best for CUDA, ML/DL, GPU-accelerated MD** |
| `ami100` | 7 | 3 | AMD MI100 | 24h | ROCm/HIP workloads |
| `al40` | 3 | 3 | NVIDIA L40 | 24h | Newer architecture, visualization |

### Special Partitions

| Partition | Max Time | Purpose |
|-----------|----------|---------|
| `acompile` | 12h | Compiling software only (use via `acompile` command) |
| `csu` | 24h | Colorado State contributed nodes |
| `amc` | 24h | CU Anschutz contributed nodes |

### QoS (Quality of Service)

| QoS | Max Time | Priority | When to Use |
|-----|----------|----------|-------------|
| `normal` | 24h | Normal | **Default - use for most jobs** |
| `long` | 7 days | Lower | Extended simulations (will wait longer in queue) |
| `mem` | 24h | Normal | Required for `amem` partition (high-memory jobs) |

### Partition Selection Examples

```bash
# 1. TESTING: Always start here to validate your job works
#SBATCH --partition=atesting
#SBATCH --time=00:30:00
#SBATCH --ntasks=4

# 2. PRODUCTION CPU: After testing passes
#SBATCH --partition=amilan
#SBATCH --time=04:00:00
#SBATCH --ntasks=32

# 3. PRODUCTION GPU: For GPU-accelerated codes
#SBATCH --partition=aa100
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00

# 4. LONG RUNS: When 24h isn't enough (lower priority)
#SBATCH --partition=amilan
#SBATCH --qos=long
#SBATCH --time=168:00:00   # 7 days

# 5. HIGH MEMORY: For memory-intensive jobs (256GB+ required)
#SBATCH --partition=amem
#SBATCH --qos=mem
#SBATCH --mem=512G
#SBATCH --time=12:00:00
```

---

## Module System

Software is managed through environment modules. **Always work from a compute node or compile node, not login.**

### Essential Commands

```bash
# List available modules
module avail

# Search for specific software
module spider lammps
module spider python

# Load modules
module load gcc/13.1.0
module load openmpi/4.1.6
module load lammps/20230802

# See what's loaded
module list

# Unload all modules
module purge

# Save/restore module sets
module save my_env
module restore my_env
```

### Finding and Loading Software

Software on CURC is installed in `/curc/sw/install/`. To find what's available:

```bash
# List all installed software
ls /curc/sw/install/

# Check specific software versions
ls /curc/sw/install/lammps/    # LAMMPS versions (22July25, 2Sept25, etc.)
ls /curc/sw/install/QE/        # Quantum ESPRESSO (7.0, 7.2)
ls /curc/sw/install/gromacs/   # GROMACS versions
```

**LAMMPS example** (check exact paths for current versions):
```bash
# Find the binary
ls /curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin/

# In job script
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS lmp -in input.lmp
```

**Quantum ESPRESSO example**:
```bash
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"
mpirun -np $SLURM_NTASKS pw.x < input.in > output.out
```

**Note**: Module dependencies matter. Load compiler first, then MPI. Check exact version paths as they may change.

---

## Storage Filesystem

### Paths and Quotas

| Path | Quota | Purge | Use For |
|------|-------|-------|---------|
| `/home/$USER` | 2 GB | Never | Scripts, small configs |
| `/projects/$USER` | 250 GB | Never | Code, small datasets |
| `/scratch/alpine/$USER` | 10 TB | 90 days | **Job I/O, large files** |
| `$SLURM_SCRATCH` | ~300 GB | Job end | Node-local temp storage |

### Performance Rules

**DO:**
- Run all job I/O on `/scratch/alpine/`
- Use `$SLURM_SCRATCH` for intensive temporary files
- Copy results back after job completes

**DON'T:**
- Run I/O-intensive jobs on `/home` or `/projects` (will be killed)
- Store important data only on `/scratch` (it's purged!)
- Leave large files on login nodes

---

## Example Workflows

### Recommended Workflow: Test First, Then Production

**Step 1: Create a testing job script (job_test.slurm)**
```bash
#!/bin/bash
#SBATCH --job-name=argon_test
#SBATCH --partition=atesting        # <-- TEST PARTITION FIRST
#SBATCH --nodes=1
#SBATCH --ntasks=4                  # Small scale for testing
#SBATCH --time=00:30:00             # 30 min is plenty for testing
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

echo "=== Testing job script ==="
echo "Started at: $(date)"
echo "Running on: $(hostname)"

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
echo "Working directory: $(pwd)"
echo "Input files: $(ls -la)"

# Run short test (reduce timesteps in input for testing)
mpirun -np $SLURM_NTASKS lmp -in input.lmp

echo "Finished at: $(date)"
```

**Step 2: If test passes, create production job (job_prod.slurm)**
```bash
#!/bin/bash
#SBATCH --job-name=argon_prod
#SBATCH --partition=amilan          # <-- PRODUCTION PARTITION
#SBATCH --nodes=1
#SBATCH --ntasks=32                 # Full scale
#SBATCH --time=04:00:00             # Appropriate for full run
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp
```

### LAMMPS MD Simulation (Full Example)

```bash
#!/bin/bash
#SBATCH --job-name=argon_md
#SBATCH --partition=amilan
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=02:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/lammps/22July25/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS lmp -in input.lmp
```

### Quantum ESPRESSO DFT

```bash
#!/bin/bash
#SBATCH --job-name=si_scf
#SBATCH --partition=amilan
#SBATCH --nodes=2
#SBATCH --ntasks=64
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

module purge
module load gcc/12.2.0 openmpi/4.1.5
export PATH="/curc/sw/install/QE/7.2/gcc/12.2.0/openmpi/4.1.5/bin:$PATH"

cd $SLURM_SUBMIT_DIR
mpirun -np $SLURM_NTASKS pw.x < si_scf.in > si_scf.out
```

### GPU Job (Testing First)

**Test on atesting_a100:**
```bash
#!/bin/bash
#SBATCH --job-name=md_gpu_test
#SBATCH --partition=atesting_a100   # <-- GPU TESTING
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:30:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1
# Add LAMMPS GPU path here

cd $SLURM_SUBMIT_DIR
lmp -k on g 1 -sf kk -pk kokkos gpu/aware off -in input.lmp
```

**Then production on aa100:**
```bash
#!/bin/bash
#SBATCH --job-name=md_gpu_prod
#SBATCH --partition=aa100           # <-- GPU PRODUCTION
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:3                # Can use up to 3 GPUs per node
#SBATCH --time=04:00:00
#SBATCH --output=%x_%j.out

module purge
module load gcc/12.2.0 cuda/12.1.1

cd $SLURM_SUBMIT_DIR
lmp -k on g 3 -sf kk -pk kokkos gpu/aware off -in input.lmp
```

---

## Debugging Failed Jobs

When a job fails, investigate systematically:

### 1. Check Job Status

```bash
# See why it failed
sacct -j <jobid> --format=JobID,State,ExitCode,Reason

# Get detailed info
scontrol show job <jobid>
```

### 2. Read Output Files

```bash
# Check stdout
cat my_job_12345.out

# Check stderr (often has the real error)
cat my_job_12345.err

# Check application logs
cat log.lammps
```

### 3. Common Failure Reasons

| Issue | Symptom | Solution |
|-------|---------|----------|
| Timeout | State=TIMEOUT | Increase `--time` or optimize |
| Memory | State=OUT_OF_MEMORY | Increase nodes or use `amem` |
| Module not found | "command not found" | Check `module load` order |
| Bad path | "file not found" | Use absolute paths |
| Wrong partition | Job pending forever | Check partition resources |

### 4. Interactive Debugging

```bash
# Get interactive session for debugging
sinteractive --partition=atesting --time=01:00:00 --ntasks=4

# Then run commands interactively to debug
module load lammps
lmp -in input.lmp  # See errors in real-time
```

---

## File Transfer

### Between Local and HPC

```bash
# Upload input files
scp -r ./inputs/ [email protected]:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/

# Download results
scp [email protected]:/scratch/alpine/$CURC_USER/agent-workspace/runs/my-run/output.dat ./

# Sync directories (rsync is more efficient for updates)
rsync -avz ./project/ [email protected]:/scratch/alpine/$CURC_USER/project/
```

### Large File Transfers

For very large files, use Globus (web-based) or DTN nodes:

```bash
# Use data transfer node for large transfers
scp large_file.tar [email protected]:/scratch/alpine/$CURC_USER/
```

---

## Queue Times and Async Job Management

### Understanding Queue Wait Times

**CRITICAL**: HPC jobs don't start immediately. Queue times vary dramatically:

| Partition | Typical Wait | Why |
|-----------|-------------|-----|
| `atesting` | Minutes | Testing partition, low demand |
| `amilan` | Minutes to hours | Many nodes (387), high throughput |
| `amilan128c` | **Hours to DAYS** | Only 16 nodes, high demand |
| `aa100` | Hours to days | Only 11 nodes, GPU scarcity |

**Before submitting, check the queue:**
```bash
# See pending jobs and estimated start times
ssh cu_alpine "squeue -p amilan128c --start"

# Quick queue depth check
ssh cu_alpine "squeue -p amilan128c --state=PENDING | wc -l"
```

### Async Workflow (For Long Queue Times)

**DON'T** block waiting for jobs with multi-day queues. Instead:

```python
from hpc_client import HPCClient

hpc = HPCClient()
hpc.connect()

# 1. Check queue before choosing partition
status = hpc.get_queue_status('amilan128c')
print(f"Estimated wait: {status['estimated_wait']}")
print(f"Pending jobs: {status['pending_jobs']}")

# 2. Compare partitions to choose wisely
for part in hpc.compare_partitions(['amilan', 'amilan128c', 'aa100']):
    print(f"{part['partition']}: {part['estimated_wait']}, {part['pending_jobs']} pending")

# 3. Submit async (returns immediately, saves tracking file)
tracking = hpc.submit_async(f"{run_dir}/job.slurm")
print(f"Job {tracking['job_id']} submitted")
print(f"Estimated start: {tracking['estimated_start']}")
# Returns immediately - don't wait!

# 4. Later: Check on all submitted jobs
jobs = hpc.check_async_jobs()
for job in jobs:
    print(f"Job {job['job_id']}: {job['current_status']}")
    if job['is_finished']:
        print(f"  Completed! Success: {job['is_success']}")
```

### Workflow Strategy for Long-Running Studies

For multi-day queue scenarios:

```
Day 1: Submit jobs
├── Check queue status
├── Submit with submit_async()
├── Note estimated start times
└── Move on to other work

Day 2+: Check periodically
├── hpc.check_async_jobs()
├── If still PENDING: wait
├── If RUNNING: monitor progress
└── If COMPLETED: download results and analyze
```

### SLURM Email Notifications (Recommended)

Add to your job scripts for automatic notifications:

```bash
#SBATCH --mail-type=BEGIN,END,FAIL    # When to email
#SBATCH [email protected]    # Your email

# Options: NONE, BEGIN, END, FAIL, REQUEUE, ALL
# BEGIN = job started (left queue)
# END = job finished
# FAIL = job failed
```

### Smart Partition Selection

**Decision tree:**

```
Need GPU?
├── YES → Check aa100 queue
│         └── Long wait? Consider if job can run on CPU instead
└── NO → How many cores?
         ├── ≤64 cores → amilan (shorter queue, more nodes)
         └── >64 cores or tightly-coupled →
             └── Check amilan128c queue
                 └── Wait >24h? Consider splitting across amilan nodes
```

### Check Job Progress

```bash
# One-time status check with start time estimates
ssh cu_alpine "squeue -u $CURC_USER --start"

# See job details
ssh cu_alpine "scontrol show job <jobid>"

# Check why job is pending
ssh cu_alpine "squeue -j <jobid> --format='%r'"  # Shows REASON
```

### Wait for Job Completion (Short Jobs Only)

Only use blocking wait for jobs expected to complete within minutes:

```bash
# Poll until job completes (ONLY for short jobs!)
JOB_ID=12345
while ssh cu_alpine "squeue -j $JOB_ID 2>/dev/null | grep -q $JOB_ID"; do
    echo "Job $JOB_ID still running..."
    sleep 60
done
echo "Job $JOB_ID completed"

# Check final status
ssh cu_alpine "sacct -j $JOB_ID --format=JobID,State,ExitCode"
```

---

## Key Principles

### You Are a Researcher

You have the same access a human researcher has. You can:
- Create any job script you need
- Load any available module
- Debug failures by reading logs
- Adapt to different software versions
- Figure out problems through investigation

### Don't Just Execute - Verify

After running on HPC:
1. Check job completed successfully (not just submitted)
2. Verify output files exist and have content
3. Check for error messages in stderr
4. Validate results are physically reasonable

### Document Your Work

Leave breadcrumbs for yourself:
```bash
# In job script
echo "Job started at $(date)"
echo "Running on $(hostname)"
echo "Loaded modules: $(module list 2>&1)"
```

---

## Reference Links

- [CURC Documentation](https://curc.readthedocs.io/en/latest/)
- [Alpine Hardware](https://curc.readthedocs.io/en/latest/clusters/alpine/alpine-hardware.html)
- [SLURM Guide](https://curc.readthedocs.io/en/latest/running-jobs/running-apps-with-jobs.html)
- [Module System](https://curc.readthedocs.io/en/latest/compute/modules.html)
- [Filesystems](https://curc.readthedocs.io/en/latest/compute/filesystems.html)