home / skills / orchestra-research / ai-research-skills / torchforge

torchforge skill

This skill helps you accelerate RL experimentation with PyTorch-native abstractions, enabling clean algorithm isolation and scalable training using torchforge.

npx playbooks add skill orchestra-research/ai-research-skills --skill torchforge

Review the files below or copy the command above to add this skill to your agents.

Files (3)

SKILL.md

10.5 KB

---
name: torchforge-rl-training
description: Provides guidance for PyTorch-native agentic RL using torchforge, Meta's library separating infra from algorithms. Use when you want clean RL abstractions, easy algorithm experimentation, or scalable training with Monarch and TorchTitan.
version: 1.0.0
author: Orchestra Research
license: MIT
tags: [Reinforcement Learning, PyTorch, GRPO, SFT, Monarch, TorchTitan, Meta]
dependencies: [torch>=2.9.0, torchtitan>=0.2.0, vllm, monarch]
---

# torchforge: PyTorch-Native Agentic RL Library

torchforge is Meta's PyTorch-native RL library that separates infrastructure concerns from algorithm concerns. It enables rapid RL research by letting you focus on algorithms while handling distributed training, inference, and weight sync automatically.

## When to Use torchforge

**Choose torchforge when you need:**
- Clean separation between RL algorithms and infrastructure
- PyTorch-native abstractions (no Ray dependency)
- Easy algorithm experimentation (GRPO, DAPO, SAPO in ~100 lines)
- Scalable training with Monarch actor system
- Integration with TorchTitan for model parallelism

**Consider alternatives when:**
- You need production-ready stability → use **miles** or **verl**
- You want Megatron-native training → use **slime**
- torchforge is experimental and APIs may change

## Key Features

- **Algorithm isolation**: Implement RL algorithms without touching infrastructure
- **Scalability**: From single GPU to thousands via Monarch
- **Modern stack**: TorchTitan (training), vLLM (inference), TorchStore (sync)
- **Loss functions**: GRPO, DAPO, CISPO, GSPO, SAPO built-in

## Architecture Overview

```
┌─────────────────────────────────────────────────────────┐
│ Application Layer (Your Code)                           │
│ - Define reward models, loss functions, sampling        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Forge API Layer                                         │
│ - Episode, Group dataclasses                           │
│ - Service interfaces (async/await)                      │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│ Distributed Services (Monarch)                          │
│ ├── Trainer (TorchTitan FSDP)                          │
│ ├── Generator (vLLM inference)                          │
│ ├── Reference Model (frozen KL baseline)               │
│ └── Reward Actors (compute rewards)                    │
└─────────────────────────────────────────────────────────┘
```

## Installation

```bash
# Create environment
conda create -n forge python=3.12
conda activate forge

# Install (handles PyTorch nightly + dependencies)
./scripts/install.sh

# Verify
python -c "import torch, forge, vllm; print('OK')"
```

### ROCm Installation

```bash
./scripts/install_rocm.sh
```

## Quick Start

### SFT Training (2+ GPUs)

```bash
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
```

### GRPO Training (3+ GPUs)

```bash
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
```

---

## Workflow 1: GRPO Training for Math Reasoning

Use this workflow for training reasoning models with group-relative advantages.

### Prerequisites Checklist
- [ ] 3+ GPUs (GPU0: trainer, GPU1: ref_model, GPU2: generator)
- [ ] Model from HuggingFace Hub
- [ ] Training dataset (GSM8K, MATH, etc.)

### Step 1: Create Configuration

```yaml
# config/grpo_math.yaml
model: "Qwen/Qwen2.5-7B-Instruct"

dataset:
  path: "openai/gsm8k"
  split: "train"
  streaming: true

training:
  batch_size: 4
  learning_rate: 1e-6
  seq_len: 4096
  dtype: bfloat16
  gradient_accumulation_steps: 4

grpo:
  n_samples: 8           # Responses per prompt
  clip_low: 0.2
  clip_high: 0.28
  beta: 0.1              # KL penalty coefficient
  temperature: 0.7

services:
  generator:
    procs: 1
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 1
    num_replicas: 1
    with_gpus: true
  ref_model:
    procs: 1
    num_replicas: 1
    with_gpus: true
```

### Step 2: Define Reward Function

```python
# rewards.py
# Reward functions are in forge.data.rewards
from forge.data.rewards import MathReward, ThinkingReward
import re

# Or define your own reward function
class CustomMathReward:
    def __call__(self, prompt: str, response: str, target: str) -> float:
        # Extract answer from response
        match = re.search(r'\\boxed{([^}]+)}', response)
        if not match:
            return 0.0

        answer = match.group(1).strip()
        return 1.0 if answer == target else 0.0
```

### Step 3: Launch Training

```bash
python -m apps.grpo.main --config config/grpo_math.yaml
```

### Step 4: Monitor Progress
- [ ] Check W&B dashboard for loss curves
- [ ] Verify entropy is decreasing (policy becoming more deterministic)
- [ ] Monitor KL divergence (should stay bounded)

---

## Workflow 2: Custom Loss Function

Use this workflow to implement new RL algorithms.

### Step 1: Create Loss Class

```python
# src/forge/losses/custom_loss.py
import torch
import torch.nn as nn

class CustomLoss(nn.Module):
    def __init__(self, clip_range: float = 0.2, beta: float = 0.1):
        super().__init__()
        self.clip_range = clip_range
        self.beta = beta

    def forward(
        self,
        logprobs: torch.Tensor,
        ref_logprobs: torch.Tensor,
        advantages: torch.Tensor,
        padding_mask: torch.Tensor,
    ) -> torch.Tensor:
        # Compute importance ratio
        ratio = torch.exp(logprobs - ref_logprobs)

        # Clipped policy gradient
        clipped_ratio = torch.clamp(
            ratio,
            1 - self.clip_range,
            1 + self.clip_range
        )
        pg_loss = -torch.min(ratio * advantages, clipped_ratio * advantages)

        # KL penalty
        kl = ref_logprobs - logprobs

        # Apply mask and aggregate
        masked_loss = (pg_loss + self.beta * kl) * padding_mask
        loss = masked_loss.sum() / padding_mask.sum()

        return loss
```

### Step 2: Integrate into Application

```python
# apps/custom/main.py
from forge.losses.custom_loss import CustomLoss

loss_fn = CustomLoss(clip_range=0.2, beta=0.1)

# In training loop
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask,
)
```

---

## Workflow 3: Multi-GPU Distributed Training

Use this workflow for scaling to multiple GPUs or nodes.

### Configuration for Distributed

```yaml
# config/distributed.yaml
model: "meta-llama/Meta-Llama-3.1-8B-Instruct"

parallelism:
  tensor_parallel_degree: 2    # Split model across GPUs
  pipeline_parallel_degree: 1
  data_parallel_shard_degree: 2

services:
  generator:
    procs: 2                   # 2 processes for TP=2
    num_replicas: 1
    with_gpus: true
  trainer:
    procs: 2
    num_replicas: 1
    with_gpus: true
```

### Launch with SLURM

```bash
# Submit job
sbatch --nodes=2 --gpus-per-node=8 run_grpo.sh
```

### Launch Locally (Multi-GPU)

```bash
# 8 GPU setup
python -m apps.grpo.main \
    --config config/distributed.yaml \
    --trainer.procs 4 \
    --generator.procs 4
```

---

## Core API Reference

### Training Batch Format

torchforge uses dictionary-based batches for training:

```python
# inputs: list of dicts with torch.Tensor values
inputs = [{"tokens": torch.Tensor}]

# targets: list of dicts with training signals
targets = [{
    "response": torch.Tensor,
    "ref_logprobs": torch.Tensor,
    "advantages": torch.Tensor,
    "padding_mask": torch.Tensor
}]

# train_step returns loss as float
loss = trainer.train_step(inputs, targets)
```

### Completion

Generated output from vLLM:

```python
@dataclass
class Completion:
    text: str              # Generated text
    token_ids: list[int]   # Token IDs
    logprobs: list[float]  # Log probabilities
    metadata: dict         # Custom metadata
```

---

## Built-in Loss Functions

### Loss Functions

Loss functions are in the `forge.losses` module:

```python
from forge.losses import SimpleGRPOLoss, ReinforceLoss

# SimpleGRPOLoss for GRPO training
loss_fn = SimpleGRPOLoss(beta=0.1)

# Forward pass
loss = loss_fn(
    logprobs=logprobs,
    ref_logprobs=ref_logprobs,
    advantages=advantages,
    padding_mask=padding_mask
)
```

### ReinforceLoss

```python
from forge.losses.reinforce_loss import ReinforceLoss

# With optional importance ratio clipping
loss_fn = ReinforceLoss(clip_ratio=0.2)
```

---

## Common Issues and Solutions

### Issue: Not Enough GPUs

**Symptoms**: "Insufficient GPU resources" error

**Solutions**:
```yaml
# Reduce service requirements
services:
  generator:
    procs: 1
    with_gpus: true
  trainer:
    procs: 1
    with_gpus: true
  # Remove ref_model (uses generator weights)
```

Or use CPU for reference model:
```yaml
ref_model:
  with_gpus: false
```

### Issue: OOM During Generation

**Symptoms**: CUDA OOM in vLLM

**Solutions**:
```yaml
# Reduce batch size
grpo:
  n_samples: 4  # Reduce from 8

# Or reduce sequence length
training:
  seq_len: 2048
```

### Issue: Slow Weight Sync

**Symptoms**: Long pauses between training and generation

**Solutions**:
```bash
# Enable RDMA (if available)
export TORCHSTORE_USE_RDMA=1

# Or reduce sync frequency
training:
  sync_interval: 10  # Sync every 10 steps
```

### Issue: Policy Collapse

**Symptoms**: Entropy drops to zero, reward stops improving

**Solutions**:
```yaml
# Increase KL penalty
grpo:
  beta: 0.2  # Increase from 0.1

# Or add entropy bonus
training:
  entropy_coef: 0.01
```

---

## Resources

- **Documentation**: https://meta-pytorch.org/torchforge
- **GitHub**: https://github.com/meta-pytorch/torchforge
- **Discord**: https://discord.gg/YsTYBh6PD9
- **TorchTitan**: https://github.com/pytorch/torchtitan
- **Monarch**: https://github.com/meta-pytorch/monarch

Overview

This skill provides hands-on guidance for PyTorch-native agentic reinforcement learning using torchforge, Meta’s library that cleanly separates infrastructure from algorithms. It helps researchers and engineers prototype and scale RL algorithms (e.g., GRPO, DAPO, SAPO) while handling distributed services, inference, and weight sync with Monarch and TorchTitan. Use it to accelerate algorithm iteration and run scalable multi-GPU experiments.

How this skill works

The skill explains how torchforge isolates algorithm logic from infrastructure by offering Forge API dataclasses, service interfaces, and a Monarch-based distributed service topology (trainer, generator, reference model, reward actors). It covers installation, configuration patterns, built-in loss modules, and examples for SFT, GRPO, custom losses, and multi-GPU training with TorchTitan and vLLM. Practical recipes show config layout, reward hooks, and troubleshooting steps for common failure modes.

When to use it

You want PyTorch-native RL abstractions without Ray or heavy infra coupling
You need to experiment quickly with new loss functions or policy updates (GRPO, DAPO, SAPO)
You plan to scale training across multiple GPUs or nodes using Monarch and TorchTitan
You need integrated vLLM inference as part of on-policy generation and evaluation
You want clear separation between reward computation, generation, and trainer logic

Best practices

Define reward functions as small, testable callables and validate them on sample prompts before large runs
Start with small resource settings (1–3 GPUs) to validate configs and losses before scaling to many-GPU deployments
Use built-in losses (SimpleGRPOLoss, ReinforceLoss) as references when implementing custom loss modules
Monitor KL, entropy, and weight-sync timings (use TORCHSTORE_USE_RDMA or adjust sync_interval to reduce stalls)
Stream datasets for large corpora and prefer gradient accumulation to control memory use

Example use cases

Train a math-reasoning model with GRPO using GSM8K or MATH datasets, sampling multiple responses per prompt and optimizing group-relative advantages
Implement and plug in a custom policy-gradient loss (example CustomLoss) to test new clipping or KL penalties
Scale an instruct-finetuning job (SFT) across 8+ GPUs with TorchTitan FSDP and Monarch-managed services
Run multi-process vLLM generators to produce diverse rollouts for contrastive or ranking-based reward signals
Quickly iterate on algorithm variants (beta, clip ranges, temperature) by swapping loss classes and config values

FAQ

Is torchforge production-ready?

torchforge is designed for research and experimentation; it is experimental and APIs may change. For production-focused systems consider alternatives optimized for stability.

What are minimum hardware requirements for GRPO?

A recommended starting point is 3 GPUs (trainer, generator, ref_model). You can reduce resource needs by running the reference model on CPU or lowering generator procs and sample counts.