home / skills / benchflow-ai / skillsbench / modal-gpu

This skill enables running Python ML training on cloud GPUs via Modal, handling GPU selection, data download, and remote result management.

npx playbooks add skill benchflow-ai/skillsbench --skill modal-gpu

Review the files below or copy the command above to add this skill to your agents.

Files (6)
SKILL.md
2.6 KB
---
name: modal-gpu
description: Run Python code on cloud GPUs using Modal serverless platform. Use when you need A100/T4/A10G GPU access for training ML models. Covers Modal app setup, GPU selection, data downloading inside functions, and result handling.
---

# Modal GPU Training

## Overview

Modal is a serverless platform for running Python code on cloud GPUs. It provides:

- **Serverless GPUs**: On-demand access to T4, A10G, A100 GPUs
- **Container Images**: Define dependencies declaratively with pip
- **Remote Execution**: Run functions on cloud infrastructure
- **Result Handling**: Return Python objects from remote functions

Two patterns:
- **Single Function**: Simple script with `@app.function` decorator
- **Multi-Function**: Complex workflows with multiple remote calls

## Quick Reference

| Topic | Reference |
|-------|-----------|
| Basic Structure | [Getting Started](references/getting-started.md) |
| GPU Options | [GPU Selection](references/gpu-selection.md) |
| Data Handling | [Data Download](references/data-download.md) |
| Results & Outputs | [Results](references/results.md) |
| Troubleshooting | [Common Issues](references/common-issues.md) |

## Installation

```bash
pip install modal
modal token set --token-id <id> --token-secret <secret>
```

## Minimal Example

```python
import modal

app = modal.App("my-training-app")

image = modal.Image.debian_slim(python_version="3.11").pip_install(
    "torch",
    "einops",
    "numpy",
)

@app.function(gpu="A100", image=image, timeout=3600)
def train():
    import torch
    device = torch.device("cuda")
    print(f"Using GPU: {torch.cuda.get_device_name(0)}")

    # Training code here
    return {"loss": 0.5}

@app.local_entrypoint()
def main():
    results = train.remote()
    print(results)
```

## Common Imports

```python
import modal
from modal import Image, App

# Inside remote function
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download
```

## When to Use What

| Scenario | Approach |
|----------|----------|
| Quick GPU experiments | `gpu="T4"` (16GB, cheapest) |
| Medium training jobs | `gpu="A10G"` (24GB) |
| Large-scale training | `gpu="A100"` (40/80GB, fastest) |
| Long-running jobs | Set `timeout=3600` or higher |
| Data from HuggingFace | Download inside function with `hf_hub_download` |
| Return metrics | Return dict from function |

## Running

```bash
# Run script
modal run train_modal.py

# Run in background
modal run --detach train_modal.py
```

## External Resources

- Modal Documentation: https://modal.com/docs
- Modal Examples: https://github.com/modal-labs/modal-examples

Overview

This skill shows how to run Python code on cloud GPUs using the Modal serverless platform. It guides you through setting up a Modal app, choosing A100/T4/A10G GPUs, packaging dependencies, and returning results from remote functions. The content covers both simple single-function workflows and multi-function orchestration for more complex training jobs.

How this skill works

You declare a Modal App and an execution image that includes your Python dependencies. Annotate functions with @app.function specifying gpu, image, and timeout; those functions execute remotely on the chosen GPU and can download data inside the function (for example via hf_hub_download). Remote functions run on Modal-managed containers and return Python objects or dictionaries you can inspect locally after completion.

When to use it

  • Quick GPU experiments where you need temporary access to a T4 for fast iteration.
  • Medium-sized training runs that require an A10G for increased memory and performance.
  • Large-scale or high-throughput training that benefits from A100 GPUs.
  • Workflows that must download datasets or model artifacts at runtime inside the remote function.
  • Jobs where you want serverless execution and result handling without managing VM lifecycles.

Best practices

  • Install and pin required Python packages in the Modal Image so remote functions are reproducible.
  • Select the smallest GPU that meets memory needs to control cost (T4 → A10G → A100).
  • Download data and model checkpoints inside the remote function to avoid packaging large files in the image.
  • Return compact metrics (e.g., loss, accuracy, checkpoints paths) instead of huge objects to minimize data transfer.
  • Set an appropriate timeout for long-running training jobs and consider running detached for background runs.

Example use cases

  • Run a quick fine-tuning loop on a small model using a T4 for rapid iteration.
  • Train a medium-sized model on A10G when 16GB is insufficient but A100 is overkill.
  • Perform large-batch training or distributed experiments on A100 for maximum throughput.
  • Download a checkpoint from Hugging Face inside the function and resume training remotely.
  • Execute long-running experiments detached and retrieve final metrics and artifacts after completion.

FAQ

How do I authenticate Modal and set credentials?

Install the modal package and run modal token set with your token id and secret; include these in your local environment before running.

Where should I download datasets or model files?

Download datasets and checkpoints inside the remote function (for example using hf_hub_download) to keep images small and avoid shipping large files with the container.

How do I retrieve results from a remote run?

Have the remote function return a Python dict or object; call .remote() from the local entrypoint and inspect the returned value or logs.