home / skills / benchflow-ai / skillsbench / nanogpt-training

This skill helps you train GPT-2 scale models efficiently on a single GPU using Muon optimizer, mixed precision, and tokenized data loading.

npx playbooks add skill benchflow-ai/skillsbench --skill nanogpt-training

Review the files below or copy the command above to add this skill to your agents.

Files (7)
SKILL.md
3.4 KB
---
name: nanogpt-training
description: Train GPT-2 scale models (~124M parameters) efficiently on a single GPU. Covers GPT-124M architecture, tokenized dataset loading (e.g., HuggingFace Hub shards), modern optimizers (Muon, AdamW), mixed precision training, and training loop implementation.
---

# NanoGPT Training

## Overview

Training GPT-2 scale models (~124M parameters) efficiently on a single GPU. It provides:

- **GPT-124M Architecture**: Standard transformer with RoPE and modern optimizations
- **Tokenized Datasets**: Loading pre-tokenized shards from HuggingFace Hub or local files
- **Modern Optimizers**: Muon optimizer with Newton-Schulz orthogonalization
- **Mixed Precision**: bfloat16 training on A100 for 2x speedup

Training options:
- **Baseline GPT**: Standard residual connections
- **Experimental residual variants**: Optional alternative residual schemes for stability/efficiency

## Quick Reference

| Topic | Reference |
|-------|-----------|
| Model Architecture | [GPT Architecture](references/gpt-architecture.md) |
| Data Loading | [Tokenized Data](references/tokenized-data.md) |
| Optimizers | [Optimizers](references/optimizers.md) |
| Training Loop | [Training Loop](references/training-loop.md) |
| Hyperparameters | [Hyperparameters](references/hyperparameters.md) |

## Installation

```bash
pip install torch einops numpy huggingface_hub
```

## Minimal Example

```python
import modal

app = modal.App("gpt-training")

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch", "einops", "numpy", "huggingface_hub"
)

@app.function(gpu="A100", image=image, timeout=3600)
def train():
    import torch
    from dataclasses import dataclass

    @dataclass
    class GPTConfig:
        block_size: int = 1024
        vocab_size: int = 50257
        n_layer: int = 12
        n_head: int = 12
        n_embd: int = 768
        dropout: float = 0.0
        bias: bool = False

    # Download data, build model, train
    # ... (see references for full implementation)

    return {"final_loss": final_loss}

@app.local_entrypoint()
def main():
    results = train.remote()
    print(results)
```

## Common Imports

```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from dataclasses import dataclass
from einops import rearrange, repeat, reduce
import numpy as np
import math
```

## When to Use What

| Scenario | Approach |
|----------|----------|
| Standard GPT training | Use baseline model with standard residuals |
| Stability experiments | Try alternative residual variants or extra streams |
| Small experiments | Use T4/A10G GPU |
| Full training | Use A100 with bfloat16 |
| Custom data | Modify the dataset loader class |
| Different model size | Adjust GPTConfig parameters |

## Metrics to Monitor

| Metric | Typical Signal | Notes |
|--------|----------------|-------|
| Validation loss | Steady decrease | Absolute value depends on dataset/tokenizer |
| Grad norm | Moderate, stable range | Large spikes indicate instability |
| Training stability | Smooth curves | Frequent spikes suggest LR/batch issues |
| Throughput | Consistent tokens/sec | Use for comparing configs |

## External Resources

- nanoGPT: https://github.com/karpathy/nanoGPT
- build-nanogpt: https://github.com/karpathy/build-nanogpt
- modded-nanogpt: https://github.com/KellerJordan/modded-nanogpt
- FineWeb-Edu token shards: https://huggingface.co/datasets/karpathy/fineweb-edu-100B-gpt2-token-shards

Overview

This skill teaches efficient training of GPT-2 scale models (~124M parameters) on a single GPU. It covers the GPT-124M transformer implementation, tokenized dataset loading (including HuggingFace Hub shards), modern optimizers, and mixed-precision training for large speedups. Practical guidance and a minimal runnable example are included to get a training run started quickly.

How this skill works

The skill implements a compact GPT architecture with rotary positional embeddings and modern stability tweaks. It loads pre-tokenized shards or local token files into memory-efficient batches, runs forward/backward passes with mixed precision (bfloat16/amp), and updates parameters using advanced optimizers such as Muon or AdamW. The training loop includes gradient scaling, checkpointing, and metrics collection (loss, grad norm, throughput).

When to use it

  • Training a GPT-2 sized model (~124M params) on a single GPU for prototyping or fine-tuning
  • Running experiments that require fast iterations with mixed precision on A100 or similar hardware
  • Testing optimizer variants (Muon, AdamW) or residual connection experiments for stability
  • Training from pre-tokenized datasets downloaded from HuggingFace Hub or local shard files
  • Small-scale production runs where throughput and memory efficiency matter

Best practices

  • Use bfloat16 on A100 for ~2x speedup and stable convergence; fall back to torch.cuda.amp on other GPUs
  • Pre-tokenize and shard datasets to avoid runtime tokenization overhead and simplify I/O
  • Monitor validation loss, grad norm, and throughput; address instability by reducing LR or switching residual variants
  • Enable periodic checkpointing and save optimizer state to resume interrupted runs reliably
  • Tune batch size and sequence length to maximize tokens/sec while keeping GPU memory headroom

Example use cases

  • Train a baseline GPT-124M model on a domain-specific corpus for downstream fine-tuning
  • Compare Muon vs AdamW optimizer performance on stability and final loss in controlled experiments
  • Run quick iterations of architectural residual variants to measure training stability
  • Fine-tune on pre-tokenized HuggingFace shards for faster dataset loading and predictable throughput
  • Prototype training on a T4/A10G and scale to A100 for full runs with bfloat16

FAQ

Can I train this model on a single consumer GPU (e.g., RTX 3090)?

Yes for smaller sequence lengths and reduced batch sizes, but expect slower throughput and use fp16 rather than bfloat16 on consumer cards.

Do I need pre-tokenized data?

Pre-tokenized shards are recommended to reduce I/O and CPU tokenization overhead, but runtime tokenization can be integrated if necessary.