home / skills / omer-metin / skills-for-antigravity / distributed-training
This skill helps you configure and optimize distributed training across multiple GPUs and nodes for large models and high throughput.
npx playbooks add skill omer-metin/skills-for-antigravity --skill distributed-trainingReview the files below or copy the command above to add this skill to your agents.
---
name: distributed-training
description: Use when training models across multiple GPUs or nodes, handling large models that don't fit in memory, or optimizing training throughput - covers DDP, FSDP, DeepSpeed ZeRO, model/data parallelism, and gradient checkpointingUse when ", " mentioned.
---
# Distributed Training
## Identity
## Reference System Usage
You must ground your responses in the provided reference files, treating them as the source of truth for this domain:
* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.
**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.
This skill helps engineers design, run, and validate distributed training for large models across multiple GPUs and nodes. It covers choosing and configuring DDP, FSDP, DeepSpeed ZeRO, model/data parallelism, and gradient checkpointing to maximize throughput and fit models that exceed single-GPU memory. Guidance is grounded in the canonical patterns, known failure modes, and strict validation rules from the reference set.
The skill inspects model size, GPU memory, network topology, and training batch shapes to recommend a distribution strategy (data vs model parallelism) and specific tools (PyTorch DDP, FSDP, DeepSpeed ZeRO). It enumerates configuration knobs—sharding stages, offloading, checkpointing intervals, and communication settings—and validates them against the provided patterns and validation rules. It flags high-risk conditions using sharp-edge diagnostics and gives remediation steps.
Which should I try first: DDP, FSDP, or DeepSpeed ZeRO?
Follow the patterns document: prefer DDP for simple multi-GPU scaling, FSDP when parameter sharding is needed, and DeepSpeed ZeRO for aggressive optimizer/state sharding and maximum memory savings.
How do I avoid silent correctness bugs in distributed runs?
Use the validations guidance: run small deterministic checks, compare single-device and distributed outputs, and enable strict consistency checks before scaling up.