home / skills / benchflow-ai / skillsbench / modal-gpu
This skill enables running Python ML training on cloud GPUs via Modal, handling GPU selection, data download, and remote result management.
npx playbooks add skill benchflow-ai/skillsbench --skill modal-gpuReview the files below or copy the command above to add this skill to your agents.
---
name: modal-gpu
description: Run Python code on cloud GPUs using Modal serverless platform. Use when you need A100/T4/A10G GPU access for training ML models. Covers Modal app setup, GPU selection, data downloading inside functions, and result handling.
---
# Modal GPU Training
## Overview
Modal is a serverless platform for running Python code on cloud GPUs. It provides:
- **Serverless GPUs**: On-demand access to T4, A10G, A100 GPUs
- **Container Images**: Define dependencies declaratively with pip
- **Remote Execution**: Run functions on cloud infrastructure
- **Result Handling**: Return Python objects from remote functions
Two patterns:
- **Single Function**: Simple script with `@app.function` decorator
- **Multi-Function**: Complex workflows with multiple remote calls
## Quick Reference
| Topic | Reference |
|-------|-----------|
| Basic Structure | [Getting Started](references/getting-started.md) |
| GPU Options | [GPU Selection](references/gpu-selection.md) |
| Data Handling | [Data Download](references/data-download.md) |
| Results & Outputs | [Results](references/results.md) |
| Troubleshooting | [Common Issues](references/common-issues.md) |
## Installation
```bash
pip install modal
modal token set --token-id <id> --token-secret <secret>
```
## Minimal Example
```python
import modal
app = modal.App("my-training-app")
image = modal.Image.debian_slim(python_version="3.11").pip_install(
"torch",
"einops",
"numpy",
)
@app.function(gpu="A100", image=image, timeout=3600)
def train():
import torch
device = torch.device("cuda")
print(f"Using GPU: {torch.cuda.get_device_name(0)}")
# Training code here
return {"loss": 0.5}
@app.local_entrypoint()
def main():
results = train.remote()
print(results)
```
## Common Imports
```python
import modal
from modal import Image, App
# Inside remote function
import torch
import torch.nn as nn
from huggingface_hub import hf_hub_download
```
## When to Use What
| Scenario | Approach |
|----------|----------|
| Quick GPU experiments | `gpu="T4"` (16GB, cheapest) |
| Medium training jobs | `gpu="A10G"` (24GB) |
| Large-scale training | `gpu="A100"` (40/80GB, fastest) |
| Long-running jobs | Set `timeout=3600` or higher |
| Data from HuggingFace | Download inside function with `hf_hub_download` |
| Return metrics | Return dict from function |
## Running
```bash
# Run script
modal run train_modal.py
# Run in background
modal run --detach train_modal.py
```
## External Resources
- Modal Documentation: https://modal.com/docs
- Modal Examples: https://github.com/modal-labs/modal-examples
This skill shows how to run Python code on cloud GPUs using the Modal serverless platform. It guides you through setting up a Modal app, choosing A100/T4/A10G GPUs, packaging dependencies, and returning results from remote functions. The content covers both simple single-function workflows and multi-function orchestration for more complex training jobs.
You declare a Modal App and an execution image that includes your Python dependencies. Annotate functions with @app.function specifying gpu, image, and timeout; those functions execute remotely on the chosen GPU and can download data inside the function (for example via hf_hub_download). Remote functions run on Modal-managed containers and return Python objects or dictionaries you can inspect locally after completion.
How do I authenticate Modal and set credentials?
Install the modal package and run modal token set with your token id and secret; include these in your local environment before running.
Where should I download datasets or model files?
Download datasets and checkpoints inside the remote function (for example using hf_hub_download) to keep images small and avoid shipping large files with the container.
How do I retrieve results from a remote run?
Have the remote function return a Python dict or object; call .remote() from the local entrypoint and inspect the returned value or logs.