home / skills / chrisvoncsefalvay / funsloth / funsloth-hfjobs

funsloth-hfjobs skill

/skills/funsloth-hfjobs

This skill helps you manage Hugging Face Jobs fine-tuning on cloud GPUs with optional WandB monitoring and cost estimation.

npx playbooks add skill chrisvoncsefalvay/funsloth --skill funsloth-hfjobs

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
2.8 KB
---
name: funsloth-hfjobs
description: Training manager for Hugging Face Jobs - launch fine-tuning on HF cloud GPUs with optional WandB monitoring
---

# Hugging Face Jobs Training Manager

Run Unsloth training on Hugging Face Jobs (cloud GPU training).

## Prerequisites

1. **HF Authentication**: `huggingface-cli whoami` (login if needed)
2. **HF Jobs Access**: Requires PRO subscription or org compute access
3. **Training notebook/script**: From `funsloth-train`

## Workflow

### 1. Select Hardware

| GPU | VRAM | Cost | Best For |
|-----|------|------|----------|
| A10G | 24GB | ~$1.50/hr | 7-14B LoRA |
| A100 40GB | 40GB | ~$4/hr | 14-34B |
| A100 80GB | 80GB | ~$6/hr | 70B |
| H100 | 80GB | ~$8/hr | Fastest |

See [references/HARDWARE_GUIDE.md](references/HARDWARE_GUIDE.md) for model-to-GPU mapping.

### 2. Convert Notebook to Script

HF Jobs requires PEP 723 script format:

```python
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git",
#     "torch>=2.0",
#     "transformers>=4.45",
#     "trl>=0.12",
#     "peft>=0.13",
#     "datasets>=2.18",
# ]
# ///
```

Use [scripts/train_sft.py](scripts/train_sft.py) as a template.

### 3. Optional: WandB Integration

Add to script:
```python
import wandb
wandb.init(project="funsloth-training")
# Add report_to="wandb" in TrainingArguments
```

Set: `export WANDB_API_KEY="your-key"`

### 4. Estimate Costs

Use the cost estimator:
```bash
python scripts/estimate_cost.py --tokens {total_tokens} --platform hfjobs
```

### 5. Launch Job

```bash
# Create job config
cat > job_config.yaml << 'EOF'
compute:
  gpu: {gpu_type}
  gpu_count: 1
script: train_hfjobs.py
outputs:
  - /outputs/*
EOF

# Submit
huggingface-cli jobs create --config job_config.yaml
```

### 6. Monitor Progress

```bash
huggingface-cli jobs status {job_id}
huggingface-cli jobs logs {job_id} --follow
```

WandB: `https://wandb.ai/{username}/funsloth-training`

### 7. Download Artifacts

```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="{username}/funsloth-job", local_dir="./outputs")
```

### 8. Handoff

Offer `funsloth-upload` for Hub upload with model card.

## Error Handling

| Error | Resolution |
|-------|------------|
| No HF Jobs access | Get PRO subscription |
| OOM | Reduce batch size or upgrade GPU |
| Job timeout | Enable checkpointing |
| Script error | Check PEP 723 dependencies |

## Bundled Resources

- [scripts/train_sft.py](scripts/train_sft.py) - PEP 723 script template
- [scripts/estimate_cost.py](scripts/estimate_cost.py) - Cost estimation
- [references/PLATFORM_COMPARISON.md](references/PLATFORM_COMPARISON.md) - HF Jobs vs alternatives
- [references/HARDWARE_GUIDE.md](references/HARDWARE_GUIDE.md) - VRAM requirements
- [references/TROUBLESHOOTING.md](references/TROUBLESHOOTING.md) - Common issues

Overview

This skill is a training manager that launches Unsloth fine-tuning jobs on Hugging Face Jobs with optional Weights & Biases monitoring. It packages a PEP 723 script, estimates cost, selects appropriate GPU types, and streamlines job submission, monitoring, and artifact retrieval. The tool is designed to make cloud GPU fine-tuning fast and repeatable for models from small to very large.

How this skill works

You convert your training notebook into a PEP 723-compatible Python script and include required dependencies and an optional WandB init. The skill helps pick the right GPU class, runs a cost estimate, generates a HF Jobs config, and submits the job via the Hugging Face CLI. It then provides commands for job status, live logs, WandB tracking, and downloading artifacts from the Hub when the run finishes.

When to use it

  • You need to fine-tune models on Hugging Face cloud GPUs rather than local hardware.
  • You want automated cost estimation before launching experiments.
  • You need integrated WandB monitoring for metrics and visualizations.
  • You require a repeatable PEP 723 script and job config for production training.
  • You want simple commands to monitor logs and retrieve model artifacts.

Best practices

  • Choose GPU based on model size: A10G for 7–14B, A100 40GB for 14–34B, A100 80GB or H100 for larger models.
  • Convert notebooks to PEP 723 script format and pin critical dependencies.
  • Enable WandB only if you have an API key set in the environment and need live dashboards.
  • Estimate costs with the provided estimator before committing long runs.
  • Add checkpointing and smaller batch sizes to avoid OOM and reduce wasted spend.

Example use cases

  • Fine-tune a 13B model using an A100 40GB instance with WandB experiment tracking.
  • Run multiple short hyperparameter sweeps on A10G to find optimal learning rates cost-effectively.
  • Migrate a local Colab notebook into a reproducible HF Jobs script for team training.
  • Recover artifacts and model checkpoints from a completed HF Jobs run to publish on the Hub.
  • Diagnose training failures by streaming job logs and checking common error resolutions.

FAQ

Do I need a Hugging Face PRO subscription to use this?

Yes, HF Jobs access typically requires PRO or organization compute permissions.

How do I enable WandB tracking?

Import wandb in your script, call wandb.init(project="funsloth-training"), set report_to="wandb" in TrainingArguments, and export WANDB_API_KEY in your environment.