home / skills / omer-metin / skills-for-antigravity / distributed-training

distributed-training skill

This skill helps you configure and optimize distributed training across multiple GPUs and nodes for large models and high throughput.

npx playbooks add skill omer-metin/skills-for-antigravity --skill distributed-training

Review the files below or copy the command above to add this skill to your agents.

Files (4)

SKILL.md

1.1 KB

---
name: distributed-training
description: Use when training models across multiple GPUs or nodes, handling large models that don't fit in memory, or optimizing training throughput - covers DDP, FSDP, DeepSpeed ZeRO, model/data parallelism, and gradient checkpointingUse when ", " mentioned. 
---

# Distributed Training

## Identity



## Reference System Usage

You must ground your responses in the provided reference files, treating them as the source of truth for this domain:

* **For Creation:** Always consult **`references/patterns.md`**. This file dictates *how* things should be built. Ignore generic approaches if a specific pattern exists here.
* **For Diagnosis:** Always consult **`references/sharp_edges.md`**. This file lists the critical failures and "why" they happen. Use it to explain risks to the user.
* **For Review:** Always consult **`references/validations.md`**. This contains the strict rules and constraints. Use it to validate user inputs objectively.

**Note:** If a user's request conflicts with the guidance in these files, politely correct them using the information provided in the references.

Overview

This skill helps engineers design, run, and validate distributed training for large models across multiple GPUs and nodes. It covers choosing and configuring DDP, FSDP, DeepSpeed ZeRO, model/data parallelism, and gradient checkpointing to maximize throughput and fit models that exceed single-GPU memory. Guidance is grounded in the canonical patterns, known failure modes, and strict validation rules from the reference set.

How this skill works

The skill inspects model size, GPU memory, network topology, and training batch shapes to recommend a distribution strategy (data vs model parallelism) and specific tools (PyTorch DDP, FSDP, DeepSpeed ZeRO). It enumerates configuration knobs—sharding stages, offloading, checkpointing intervals, and communication settings—and validates them against the provided patterns and validation rules. It flags high-risk conditions using sharp-edge diagnostics and gives remediation steps.

When to use it

Training models that do not fit on a single GPU or that require more total memory than available on one node.
Scaling training across multiple GPUs or compute nodes to increase throughput or reduce time-to-solution.
Implementing memory-saving techniques like ZeRO, FSDP, or activation/gradient checkpointing to enable larger batch sizes or models.
Troubleshooting unstable training, out-of-memory errors, or poor scaling efficiency in multi-GPU setups.
Preparing deployment-ready checkpoints and sharded state for inference or continued training.

Best practices

Start with the reference patterns to select the minimal complexity option that satisfies memory and performance requirements.
Run small-scale validation using the validation rules to catch configuration and correctness issues before full cluster runs.
Measure and tune communication overhead: prefer NCCL, align batch sizes to reduce synchronization frequency, and use gradient accumulation when needed.
Use checkpoint sharding and offload only after validating consistency with the strict validation rules.
Automate failure detection described in the sharp edges document and include deterministic seeds for reproducibility.

Example use cases

Training a 100B-parameter transformer by combining FSDP sharding with activation checkpointing and CPU offload.
Switching from single-node DDP to multi-node DeepSpeed ZeRO Stage 3 to reduce peak GPU memory and enable larger batches.
Diagnosing and fixing training stalls or non-deterministic validation metrics caused by incorrect gradient synchronization.
Designing a reproducible CI test that validates distributed training configs against the validations specification.

FAQ

Which should I try first: DDP, FSDP, or DeepSpeed ZeRO?

Follow the patterns document: prefer DDP for simple multi-GPU scaling, FSDP when parameter sharding is needed, and DeepSpeed ZeRO for aggressive optimizer/state sharding and maximum memory savings.

How do I avoid silent correctness bugs in distributed runs?

Use the validations guidance: run small deterministic checks, compare single-device and distributed outputs, and enable strict consistency checks before scaling up.