home / skills / eyadsibai / ltk / llm-training
This skill helps you optimize and finetune large language models using Accelerate, DeepSpeed, TRL, Unsloth, and related techniques for scalable training.
npx playbooks add skill eyadsibai/ltk --skill llm-trainingReview the files below or copy the command above to add this skill to your agents.
---
name: llm-training
description: Use when "training LLM", "finetuning", "RLHF", "distributed training", "DeepSpeed", "Accelerate", "PyTorch Lightning", "Ray Train", "TRL", "Unsloth", "LoRA training", "flash attention", "gradient checkpointing"
version: 1.0.0
---
# LLM Training
Frameworks and techniques for training and finetuning large language models.
## Framework Comparison
| Framework | Best For | Multi-GPU | Memory Efficient |
|-----------|----------|-----------|------------------|
| **Accelerate** | Simple distributed | Yes | Basic |
| **DeepSpeed** | Large models, ZeRO | Yes | Excellent |
| **PyTorch Lightning** | Clean training loops | Yes | Good |
| **Ray Train** | Scalable, multi-node | Yes | Good |
| **TRL** | RLHF, reward modeling | Yes | Good |
| **Unsloth** | Fast LoRA finetuning | Limited | Excellent |
---
## Accelerate (HuggingFace)
Minimal wrapper for distributed training. Run `accelerate config` for interactive setup.
**Key concept**: Wrap model, optimizer, dataloader with `accelerator.prepare()`, use `accelerator.backward()` for loss.
---
## DeepSpeed (Large Models)
Microsoft's optimization library for training massive models.
**ZeRO Stages:**
- **Stage 1**: Optimizer states partitioned across GPUs
- **Stage 2**: + Gradients partitioned
- **Stage 3**: + Parameters partitioned (for largest models, 100B+)
**Key concept**: Configure via JSON, higher stages = more memory savings but more communication overhead.
---
## TRL (RLHF/DPO)
HuggingFace library for reinforcement learning from human feedback.
**Training types:**
- **SFT (Supervised Finetuning)**: Standard instruction tuning
- **DPO (Direct Preference Optimization)**: Simpler than RLHF, uses preference pairs
- **PPO**: Classic RLHF with reward model
**Key concept**: DPO is often preferred over PPO - simpler, no reward model needed, just chosen/rejected response pairs.
---
## Unsloth (Fast LoRA)
Optimized LoRA finetuning - 2x faster, 60% less memory.
**Key concept**: Drop-in replacement for standard LoRA with automatic optimizations. Best for 7B-13B models.
---
## Memory Optimization Techniques
| Technique | Memory Savings | Trade-off |
|-----------|---------------|-----------|
| **Gradient checkpointing** | ~30-50% | Slower training |
| **Mixed precision (fp16/bf16)** | ~50% | Minor precision loss |
| **4-bit quantization (QLoRA)** | ~75% | Some quality loss |
| **Flash Attention** | ~20-40% | Requires compatible GPU |
| **Gradient accumulation** | Effective batch↑ | No memory cost |
---
## Decision Guide
| Scenario | Recommendation |
|----------|----------------|
| Simple finetuning | Accelerate + PEFT |
| 7B-13B models | Unsloth (fastest) |
| 70B+ models | DeepSpeed ZeRO-3 |
| RLHF/DPO alignment | TRL |
| Multi-node cluster | Ray Train |
| Clean code structure | PyTorch Lightning |
## Resources
- Accelerate: <https://huggingface.co/docs/accelerate>
- DeepSpeed: <https://www.deepspeed.ai/>
- TRL: <https://huggingface.co/docs/trl>
- Unsloth: <https://github.com/unslothai/unsloth>
This skill helps engineers choose and apply frameworks and techniques for training and fine-tuning large language models (LLMs). It summarizes when to use Accelerate, DeepSpeed, PyTorch Lightning, Ray Train, TRL, and Unsloth, and highlights memory and performance optimizations. The guidance targets practical decisions for multi-GPU, multi-node, and RLHF workflows.
I compare frameworks by best use cases, multi-GPU support, and memory efficiency, and I outline the core concepts you must apply (e.g., ZeRO stages, accelerator.prepare, LoRA optimizations). I also list memory-saving techniques—gradient checkpointing, mixed precision, quantization, flash attention—and map scenarios to recommended stacks. The result is a concise decision guide and actionable tips for implementing training pipelines.
When should I choose DPO over PPO for RLHF?
Choose DPO when you have preference pairs (chosen vs rejected) and want a simpler, more stable optimization. Use PPO if you need to train with a learned reward model or more complex policy constraints.
How do ZeRO stages trade memory and communication?
Higher ZeRO stages partition more states (optimizer, gradients, parameters) to reduce per-GPU memory. This lowers memory but increases inter-GPU communication and configuration complexity; stage choice depends on model size and network topology.