home / skills / chrisvoncsefalvay / funsloth / funsloth-runpod
This skill helps you manage RunPod GPU training pods end-to-end, from selection to checkpoints, boosting efficiency and cost control.
npx playbooks add skill chrisvoncsefalvay/funsloth --skill funsloth-runpodReview the files below or copy the command above to add this skill to your agents.
---
name: funsloth-runpod
description: Training manager for RunPod GPU instances - configure pods, launch training, monitor progress, retrieve checkpoints
---
# RunPod Training Manager
Run Unsloth training on RunPod GPU instances.
## Prerequisites
1. **RunPod API Key**: `echo $RUNPOD_API_KEY` (get at runpod.io/console/user/settings)
2. **RunPod SDK**: `pip install runpod`
3. **Training notebook/script**: From `funsloth-train`
## Workflow
### 1. Select GPU
| GPU | VRAM | Cost | Best For |
|-----|------|------|----------|
| RTX 3090 | 24GB | ~$0.35/hr | Budget 7-14B |
| RTX 4090 | 24GB | ~$0.55/hr | Fast 7-14B |
| A100 40GB | 40GB | ~$1.50/hr | 14-34B |
| A100 80GB | 80GB | ~$2.00/hr | 70B |
| H100 | 80GB | ~$3.50/hr | Fastest |
RunPod typically has better prices than HF Jobs.
### 2. Choose Deployment
- **Pod** (Recommended): Persistent, SSH access, network storage
- **Serverless**: Pay per second, complex setup (better for inference)
### 3. Configure Network Volume (Recommended)
```python
import runpod
volume = runpod.create_network_volume(name="funsloth-training", size_gb=50, region="US")
```
Allows: resume training, download checkpoints, share between pods.
### 4. Launch Pod
Use the [official Unsloth Docker image](https://docs.unsloth.ai/new/how-to-fine-tune-llms-with-unsloth-and-docker) for a pre-configured environment:
```python
import runpod
pod = runpod.create_pod(
name="funsloth-training",
image_name="unsloth/unsloth", # Official image, supports all GPUs incl. Blackwell
gpu_type_id="{gpu_type}",
volume_in_gb=50,
network_volume_id="{volume_id}",
env={
"HF_TOKEN": "{token}",
"WANDB_API_KEY": "{key}",
"JUPYTER_PASSWORD": "unsloth",
},
ports="8888/http,22/tcp",
)
```
The Unsloth image includes Jupyter Lab (port 8888) and example notebooks in `/workspace/unsloth-notebooks/`.
### 5. Upload and Run
```bash
# SSH into pod
ssh root@{pod_ip}
# Upload script
scp train.py root@{pod_ip}:/workspace/
# Run training (use tmux for persistence)
tmux new -s training
cd /workspace && python train.py
# Ctrl+B, D to detach
```
### 6. Monitor
```bash
# SSH monitoring
tail -f /workspace/training.log
nvidia-smi -l 1
# Dashboard
https://runpod.io/console/pods/{pod_id}
```
### 7. Retrieve Checkpoints
```bash
# Save to network volume
cp -r /workspace/outputs /runpod-volume/
# Download via SCP
scp -r root@{pod_ip}:/workspace/outputs ./
# Or push to HF Hub from pod
```
### 8. Stop Pod
```python
runpod.stop_pod(pod_id) # Can resume later
runpod.terminate_pod(pod_id) # Deletes pod, keeps volume
```
### 9. Handoff
Offer `funsloth-upload` for Hub upload with model card.
## Best Practices
1. **Always use network volumes** - pod storage is ephemeral
2. **Use spot instances** for lower costs (risk of preemption)
3. **Set up SSH keys** before creating pods
4. **Stop pods when not training** - charges per minute
5. **Save checkpoints frequently** with `save_steps`
## Error Handling
| Error | Resolution |
|-------|------------|
| Pod creation failed | Try different GPU type or region |
| SSH refused | Wait 1-2 min, check IP |
| Out of disk | Increase volume or clean up |
| Volume not mounting | Check same region as pod |
## Bundled Resources
- [scripts/train_sft.py](scripts/train_sft.py) - Training script template
- [scripts/estimate_cost.py](scripts/estimate_cost.py) - Cost estimation
- [references/PLATFORM_COMPARISON.md](references/PLATFORM_COMPARISON.md) - RunPod vs alternatives
- [references/TROUBLESHOOTING.md](references/TROUBLESHOOTING.md) - Common issues
This skill is a training manager for RunPod GPU instances that streamlines launching, monitoring, and managing Unsloth fine-tuning runs. It helps configure GPU type, persistent network volumes, and the Unsloth Docker environment, then starts pods, tracks progress, and retrieves checkpoints. Use it to run reproducible training with SSH access, Jupyter, and straightforward checkpoint handling.
The skill automates RunPod SDK calls to create network volumes, provision pods with the official Unsloth image, and set required environment variables for HF and Weights & Biases. It exposes commands to SSH into pods, upload training scripts, run training sessions (via tmux), tail logs and GPU usage, and copy or push checkpoints from the volume. It also supports stopping, resuming, and terminating pods while preserving volumes.
How do I persist training outputs after stopping a pod?
Attach a RunPod network volume and copy outputs to /runpod-volume/ before stopping; volumes persist beyond pod termination.
What GPU should I pick for a 14B model?
For 14B models, an A100 40GB or RTX 4090 is a common choice: A100 offers faster throughput and more memory headroom, RTX 4090 is cost-effective for many 7–14B workflows.