home / skills / chrisvoncsefalvay / funsloth / funsloth-runpod

funsloth-runpod skill

/skills/funsloth-runpod

This skill helps you manage RunPod GPU training pods end-to-end, from selection to checkpoints, boosting efficiency and cost control.

npx playbooks add skill chrisvoncsefalvay/funsloth --skill funsloth-runpod

Review the files below or copy the command above to add this skill to your agents.

Files (5)
SKILL.md
3.5 KB
---
name: funsloth-runpod
description: Training manager for RunPod GPU instances - configure pods, launch training, monitor progress, retrieve checkpoints
---

# RunPod Training Manager

Run Unsloth training on RunPod GPU instances.

## Prerequisites

1. **RunPod API Key**: `echo $RUNPOD_API_KEY` (get at runpod.io/console/user/settings)
2. **RunPod SDK**: `pip install runpod`
3. **Training notebook/script**: From `funsloth-train`

## Workflow

### 1. Select GPU

| GPU | VRAM | Cost | Best For |
|-----|------|------|----------|
| RTX 3090 | 24GB | ~$0.35/hr | Budget 7-14B |
| RTX 4090 | 24GB | ~$0.55/hr | Fast 7-14B |
| A100 40GB | 40GB | ~$1.50/hr | 14-34B |
| A100 80GB | 80GB | ~$2.00/hr | 70B |
| H100 | 80GB | ~$3.50/hr | Fastest |

RunPod typically has better prices than HF Jobs.

### 2. Choose Deployment

- **Pod** (Recommended): Persistent, SSH access, network storage
- **Serverless**: Pay per second, complex setup (better for inference)

### 3. Configure Network Volume (Recommended)

```python
import runpod
volume = runpod.create_network_volume(name="funsloth-training", size_gb=50, region="US")
```

Allows: resume training, download checkpoints, share between pods.

### 4. Launch Pod

Use the [official Unsloth Docker image](https://docs.unsloth.ai/new/how-to-fine-tune-llms-with-unsloth-and-docker) for a pre-configured environment:

```python
import runpod

pod = runpod.create_pod(
    name="funsloth-training",
    image_name="unsloth/unsloth",  # Official image, supports all GPUs incl. Blackwell
    gpu_type_id="{gpu_type}",
    volume_in_gb=50,
    network_volume_id="{volume_id}",
    env={
        "HF_TOKEN": "{token}",
        "WANDB_API_KEY": "{key}",
        "JUPYTER_PASSWORD": "unsloth",
    },
    ports="8888/http,22/tcp",
)
```

The Unsloth image includes Jupyter Lab (port 8888) and example notebooks in `/workspace/unsloth-notebooks/`.

### 5. Upload and Run

```bash
# SSH into pod
ssh root@{pod_ip}

# Upload script
scp train.py root@{pod_ip}:/workspace/

# Run training (use tmux for persistence)
tmux new -s training
cd /workspace && python train.py
# Ctrl+B, D to detach
```

### 6. Monitor

```bash
# SSH monitoring
tail -f /workspace/training.log
nvidia-smi -l 1

# Dashboard
https://runpod.io/console/pods/{pod_id}
```

### 7. Retrieve Checkpoints

```bash
# Save to network volume
cp -r /workspace/outputs /runpod-volume/

# Download via SCP
scp -r root@{pod_ip}:/workspace/outputs ./

# Or push to HF Hub from pod
```

### 8. Stop Pod

```python
runpod.stop_pod(pod_id)    # Can resume later
runpod.terminate_pod(pod_id)  # Deletes pod, keeps volume
```

### 9. Handoff

Offer `funsloth-upload` for Hub upload with model card.

## Best Practices

1. **Always use network volumes** - pod storage is ephemeral
2. **Use spot instances** for lower costs (risk of preemption)
3. **Set up SSH keys** before creating pods
4. **Stop pods when not training** - charges per minute
5. **Save checkpoints frequently** with `save_steps`

## Error Handling

| Error | Resolution |
|-------|------------|
| Pod creation failed | Try different GPU type or region |
| SSH refused | Wait 1-2 min, check IP |
| Out of disk | Increase volume or clean up |
| Volume not mounting | Check same region as pod |

## Bundled Resources

- [scripts/train_sft.py](scripts/train_sft.py) - Training script template
- [scripts/estimate_cost.py](scripts/estimate_cost.py) - Cost estimation
- [references/PLATFORM_COMPARISON.md](references/PLATFORM_COMPARISON.md) - RunPod vs alternatives
- [references/TROUBLESHOOTING.md](references/TROUBLESHOOTING.md) - Common issues

Overview

This skill is a training manager for RunPod GPU instances that streamlines launching, monitoring, and managing Unsloth fine-tuning runs. It helps configure GPU type, persistent network volumes, and the Unsloth Docker environment, then starts pods, tracks progress, and retrieves checkpoints. Use it to run reproducible training with SSH access, Jupyter, and straightforward checkpoint handling.

How this skill works

The skill automates RunPod SDK calls to create network volumes, provision pods with the official Unsloth image, and set required environment variables for HF and Weights & Biases. It exposes commands to SSH into pods, upload training scripts, run training sessions (via tmux), tail logs and GPU usage, and copy or push checkpoints from the volume. It also supports stopping, resuming, and terminating pods while preserving volumes.

When to use it

  • You need low-friction GPU access for fine-tuning LLMs with Unsloth.
  • You want persistent storage for checkpoints and to resume interrupted runs.
  • You prefer SSH and Jupyter Lab access to interactively run notebooks or scripts.
  • You need cost/instance flexibility across RTX, A100, H100 classes.
  • You plan to push trained models to the Hub or download checkpoints for local evaluation.

Best practices

  • Always create and attach a RunPod network volume to keep checkpoints and outputs persistent.
  • Choose the GPU tier that matches model size and budget (e.g., A100/80GB for 34B+ models).
  • Use tmux or a similar session manager to run long trainings robustly over SSH.
  • Save checkpoints frequently and mirror them to the network volume or push to the Hub.
  • Stop pods when idle to avoid unnecessary charges and prefer spot instances for cheaper compute (with preemption risk).

Example use cases

  • Spin up an RTX 4090 pod to run a 7B–14B finetune with Jupyter-driven experiments.
  • Launch an A100 80GB pod for training a 34B model and store outputs on a 200GB network volume.
  • Run distributed or long-running training inside tmux, monitor with nvidia-smi and training logs, then SCP checkpoints to local storage.
  • Create a persistent pod for interactive development: Jupyter Lab, SSH access, and an attached volume to iterate quickly.
  • Terminate compute but keep the network volume for later resume or to transfer checkpoints to the Hub.

FAQ

How do I persist training outputs after stopping a pod?

Attach a RunPod network volume and copy outputs to /runpod-volume/ before stopping; volumes persist beyond pod termination.

What GPU should I pick for a 14B model?

For 14B models, an A100 40GB or RTX 4090 is a common choice: A100 offers faster throughput and more memory headroom, RTX 4090 is cost-effective for many 7–14B workflows.