home / skills / eyadsibai / ltk / modal

This skill helps you deploy and run serverless GPU-powered ML workloads with autoscaling, scheduling, and pay-per-use compute.

npx playbooks add skill eyadsibai/ltk --skill modal

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
3.8 KB
---
name: modal
description: Use when "Modal", "serverless GPU", "cloud GPU", "deploy ML model", or asking about "serverless containers", "GPU compute", "batch processing", "scheduled jobs", "autoscaling ML"
version: 1.0.0
---

<!-- Adapted from: claude-scientific-skills/scientific-skills/modal -->

# Modal Serverless Cloud Platform

Serverless Python execution with GPUs, autoscaling, and pay-per-use compute.

## When to Use

- Deploy and serve ML models (LLMs, image generation)
- Run GPU-accelerated computation
- Batch process large datasets in parallel
- Schedule compute-intensive jobs
- Build serverless APIs with autoscaling

## Quick Start

```bash
# Install
pip install modal

# Authenticate
modal token new
```

```python
import modal

app = modal.App("my-app")

@app.function()
def hello():
    return "Hello from Modal!"

# Run with: modal run script.py
```

## Container Images

```python
# Build image with dependencies
image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("torch", "transformers", "numpy")
)

app = modal.App("ml-app", image=image)
```

## GPU Functions

```python
@app.function(gpu="H100")
def train_model():
    import torch
    assert torch.cuda.is_available()
    # GPU code here

# Available GPUs: T4, L4, A10, A100, L40S, H100, H200, B200
# Multi-GPU: gpu="H100:8"
```

## Web Endpoints

```python
@app.function()
@modal.web_endpoint(method="POST")
def predict(data: dict):
    result = model.predict(data["input"])
    return {"prediction": result}

# Deploy: modal deploy script.py
```

## Scheduled Jobs

```python
@app.function(schedule=modal.Cron("0 2 * * *"))  # Daily at 2 AM
def daily_backup():
    pass

@app.function(schedule=modal.Period(hours=4))  # Every 4 hours
def refresh_cache():
    pass
```

## Autoscaling

```python
@app.function()
def process_item(item_id: int):
    return analyze(item_id)

@app.local_entrypoint()
def main():
    items = range(1000)
    # Automatically parallelized across containers
    results = list(process_item.map(items))
```

## Persistent Storage

```python
volume = modal.Volume.from_name("my-data", create_if_missing=True)

@app.function(volumes={"/data": volume})
def save_results(data):
    with open("/data/results.txt", "w") as f:
        f.write(data)
    volume.commit()  # Persist changes
```

## Secrets Management

```python
@app.function(secrets=[modal.Secret.from_name("huggingface")])
def download_model():
    import os
    token = os.environ["HF_TOKEN"]
```

## ML Model Serving

```python
@app.cls(gpu="L40S")
class Model:
    @modal.enter()
    def load_model(self):
        from transformers import pipeline
        self.pipe = pipeline("text-classification", device="cuda")

    @modal.method()
    def predict(self, text: str):
        return self.pipe(text)

@app.local_entrypoint()
def main():
    model = Model()
    result = model.predict.remote("Modal is great!")
```

## Resource Configuration

```python
@app.function(
    cpu=8.0,              # 8 CPU cores
    memory=32768,         # 32 GiB RAM
    ephemeral_disk=10240, # 10 GiB disk
    timeout=3600          # 1 hour timeout
)
def memory_intensive_task():
    pass
```

## Best Practices

1. **Pin dependencies** for reproducible builds
2. **Use appropriate GPU types** - L40S for inference, H100 for training
3. **Leverage caching** via Volumes for model weights
4. **Use `.map()` for parallel processing**
5. **Import packages inside functions** if not available locally
6. **Store secrets securely** - never hardcode API keys

## vs Alternatives

| Platform | Best For |
|----------|----------|
| **Modal** | Serverless GPUs, autoscaling, Python-native |
| RunPod | GPU rental, long-running jobs |
| AWS Lambda | CPU workloads, AWS ecosystem |
| Replicate | Model hosting, simple deployments |

## Resources

- Docs: <https://modal.com/docs>
- Examples: <https://github.com/modal-labs/modal-examples>

Overview

This skill encapsulates deploying and running Python workloads on Modal’s serverless cloud with GPU support, autoscaling, and pay-per-use billing. It focuses on making GPU-accelerated model training, inference, batch processing, and scheduled jobs simple to deploy and manage. The skill shows how to build container images, configure resources, and expose web endpoints or scheduled functions for production use.

How this skill works

You define an app and functions in Python using Modal primitives (App, function, Image, Volume, Secret). Functions can request GPUs, CPUs, memory, ephemeral disk, or schedules; Modal then provisions serverless containers with the requested resources and autoscaling. Use .map() for parallel work, web_endpoint for HTTP APIs, and Volumes/Secrets for persistent storage and credentials.

When to use it

  • Deploy and serve ML models (LLMs, vision models) requiring GPU inference or training
  • Run one-off or recurring GPU-accelerated workloads without managing servers
  • Batch process large datasets in parallel using automatic container scaling
  • Build autoscaling serverless APIs for inference with predictable cost control
  • Schedule maintenance, backups, or periodic data refresh jobs

Best practices

  • Pin Python and pip package versions in your Image for reproducible builds
  • Choose GPU types to match workload: L40S for inference, H100/H200 for training
  • Cache large model weights in Volumes to avoid repeated downloads
  • Import heavy packages inside function bodies to reduce local dependency issues
  • Use .map() or concurrent patterns to parallelize many short tasks efficiently

Example use cases

  • Host a transformer-based text-classification API behind a POST web endpoint with autoscaling
  • Train a large model on H100 GPUs for a few hours and pay only for used GPU time
  • Run nightly batch image processing jobs with a Cron schedule and persistent output to a Volume
  • Parallelize data preprocessing across many containers using process_item.map(items)
  • Store API tokens in Secrets and download models or datasets securely at runtime

FAQ

How do I choose the right GPU?

Choose inference-optimized GPUs like L40S or L4 for lower latency and cost; pick H100/H200 or A100 for large-scale training and multi-GPU parallelism.

Can I persist files between runs?

Yes. Use Modal Volume objects to mount persistent storage into containers and call volume.commit() to save changes.