home / skills / eyadsibai / ltk / llm-training

This skill helps you optimize and finetune large language models using Accelerate, DeepSpeed, TRL, Unsloth, and related techniques for scalable training.

npx playbooks add skill eyadsibai/ltk --skill llm-training

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.0 KB
---
name: llm-training
description: Use when "training LLM", "finetuning", "RLHF", "distributed training", "DeepSpeed", "Accelerate", "PyTorch Lightning", "Ray Train", "TRL", "Unsloth", "LoRA training", "flash attention", "gradient checkpointing"
version: 1.0.0
---

# LLM Training

Frameworks and techniques for training and finetuning large language models.

## Framework Comparison

| Framework | Best For | Multi-GPU | Memory Efficient |
|-----------|----------|-----------|------------------|
| **Accelerate** | Simple distributed | Yes | Basic |
| **DeepSpeed** | Large models, ZeRO | Yes | Excellent |
| **PyTorch Lightning** | Clean training loops | Yes | Good |
| **Ray Train** | Scalable, multi-node | Yes | Good |
| **TRL** | RLHF, reward modeling | Yes | Good |
| **Unsloth** | Fast LoRA finetuning | Limited | Excellent |

---

## Accelerate (HuggingFace)

Minimal wrapper for distributed training. Run `accelerate config` for interactive setup.

**Key concept**: Wrap model, optimizer, dataloader with `accelerator.prepare()`, use `accelerator.backward()` for loss.

---

## DeepSpeed (Large Models)

Microsoft's optimization library for training massive models.

**ZeRO Stages:**

- **Stage 1**: Optimizer states partitioned across GPUs
- **Stage 2**: + Gradients partitioned
- **Stage 3**: + Parameters partitioned (for largest models, 100B+)

**Key concept**: Configure via JSON, higher stages = more memory savings but more communication overhead.

---

## TRL (RLHF/DPO)

HuggingFace library for reinforcement learning from human feedback.

**Training types:**

- **SFT (Supervised Finetuning)**: Standard instruction tuning
- **DPO (Direct Preference Optimization)**: Simpler than RLHF, uses preference pairs
- **PPO**: Classic RLHF with reward model

**Key concept**: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.

---

## Unsloth (Fast LoRA)

Optimized LoRA finetuning - 2x faster, 60% less memory.

**Key concept**: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.

---

## Memory Optimization Techniques

| Technique | Memory Savings | Trade-off |
|-----------|---------------|-----------|
| **Gradient checkpointing** | ~30-50% | Slower training |
| **Mixed precision (fp16/bf16)** | ~50% | Minor precision loss |
| **4-bit quantization (QLoRA)** | ~75% | Some quality loss |
| **Flash Attention** | ~20-40% | Requires compatible GPU |
| **Gradient accumulation** | Effective batch↑ | No memory cost |

---

## Decision Guide

| Scenario | Recommendation |
|----------|----------------|
| Simple finetuning | Accelerate + PEFT |
| 7B-13B models | Unsloth (fastest) |
| 70B+ models | DeepSpeed ZeRO-3 |
| RLHF/DPO alignment | TRL |
| Multi-node cluster | Ray Train |
| Clean code structure | PyTorch Lightning |

## Resources

- Accelerate: <https://huggingface.co/docs/accelerate>
- DeepSpeed: <https://www.deepspeed.ai/>
- TRL: <https://huggingface.co/docs/trl>
- Unsloth: <https://github.com/unslothai/unsloth>

Overview

This skill helps engineers choose and apply frameworks and techniques for training and fine-tuning large language models (LLMs). It summarizes when to use Accelerate, DeepSpeed, PyTorch Lightning, Ray Train, TRL, and Unsloth, and highlights memory and performance optimizations. The guidance targets practical decisions for multi-GPU, multi-node, and RLHF workflows.

How this skill works

I compare frameworks by best use cases, multi-GPU support, and memory efficiency, and I outline the core concepts you must apply (e.g., ZeRO stages, accelerator.prepare, LoRA optimizations). I also list memory-saving techniques—gradient checkpointing, mixed precision, quantization, flash attention—and map scenarios to recommended stacks. The result is a concise decision guide and actionable tips for implementing training pipelines.

When to use it

  • Quick distributed finetuning on a single machine or small cluster — use Accelerate.
  • Training very large models (70B+) with extreme memory needs — use DeepSpeed ZeRO-3.
  • Clean, maintainable training loops and callbacks — use PyTorch Lightning.
  • Scale across many nodes or heterogeneous clusters — use Ray Train.
  • RLHF, preference tuning, or reward-based alignment — use TRL (DPO or PPO).
  • Fast, memory-efficient LoRA finetuning for 7B–13B models — use Unsloth.

Best practices

  • Start with a decision matrix: model size, available GPUs, and training objective determine the framework choice.
  • Combine techniques: use mixed precision + gradient checkpointing + flash attention for best memory/perf trade-offs.
  • For distributed runs, always validate config on a small subset before full-scale launch; run accelerator config or DeepSpeed JSON checks locally.
  • Prefer DPO for preference-based alignment when you can use chosen/rejected pairs; use PPO only when reward modeling is required.
  • Use PEFT/LoRA for parameter-efficient finetuning on large models; switch to ZeRO when memory or optimizer state becomes the bottleneck.

Example use cases

  • Fine-tune an instruction model with Accelerate and PEFT for chat assistants on a single multi-GPU server.
  • Train a 70B model across 8+ GPUs with DeepSpeed ZeRO-3 to fit parameters that otherwise exceed memory.
  • Run RLHF alignment with TRL using DPO on preference data to improve responses without a reward model.
  • Perform fast LoRA tuning on a 13B model with Unsloth to iterate quickly on instruction datasets.
  • Deploy multi-node hyperparameter search with Ray Train for large dataset or distributed preprocessing.

FAQ

When should I choose DPO over PPO for RLHF?

Choose DPO when you have preference pairs (chosen vs rejected) and want a simpler, more stable optimization. Use PPO if you need to train with a learned reward model or more complex policy constraints.

How do ZeRO stages trade memory and communication?

Higher ZeRO stages partition more states (optimizer, gradients, parameters) to reduce per-GPU memory. This lowers memory but increases inter-GPU communication and configuration complexity; stage choice depends on model size and network topology.