home / skills / scientiacapital / skills / runpod-deployment-skill

runpod-deployment-skill skill

needs review

This skill helps you deploy and manage GPU workloads on RunPod with serverless handlers, vLLM endpoints, scalable pods, and cost optimization.

npx playbooks add skill scientiacapital/skills --skill runpod-deployment-skill

Review the files below or copy the command above to add this skill to your agents.

Files (13)

SKILL.md

11.7 KB

---
name: "runpod-deployment"
description: "Deploy GPU workloads to RunPod serverless and pods - vLLM endpoints, A100/H100 setup, scale-to-zero, cost optimization. Use when: deploy to RunPod, GPU serverless, vLLM endpoint, scale to zero, A100 deployment, H100 setup, serverless handler, GPU cost optimization."
---

<objective>
Deploy and manage GPU workloads on RunPod infrastructure:

1. **Serverless Workers** - Scale-to-zero handlers with pay-per-second billing
2. **vLLM Endpoints** - OpenAI-compatible LLM serving with 2-3x throughput
3. **Pod Management** - Dedicated GPU instances for development/training
4. **Cost Optimization** - GPU selection, spot instances, budget controls

Key deliverables:
- Production-ready serverless handlers with streaming
- vLLM deployment with OpenAI API compatibility
- Cost-optimized GPU selection for any model size
- Health monitoring and auto-scaling configuration
</objective>

<quick_start>
**Minimal Serverless Handler (v1.8.1):**
```python
import runpod

def handler(job):
    """Basic handler - receives job, returns result."""
    job_input = job["input"]
    prompt = job_input.get("prompt", "")

    # Your inference logic here
    result = process(prompt)

    return {"output": result}

runpod.serverless.start({"handler": handler})
```

**Streaming Handler:**
```python
import runpod

def streaming_handler(job):
    """Generator for streaming responses."""
    for chunk in generate_chunks(job["input"]):
        yield {"token": chunk, "finished": False}
    yield {"token": "", "finished": True}

runpod.serverless.start({
    "handler": streaming_handler,
    "return_aggregate_stream": True
})
```

**vLLM OpenAI-Compatible Client:**
```python
from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=100,
)
```
</quick_start>

<success_criteria>
A RunPod deployment is successful when:
- Handler processes requests without errors
- Endpoint scales appropriately (0 → N workers)
- Cold start time is acceptable for use case
- Cost stays within budget projections
- Health checks pass consistently
</success_criteria>

<m1_mac_critical>
## M1/M2 Mac: Cannot Build Docker Locally

ARM architecture is incompatible with RunPod's x86 GPUs.

**Solution: GitHub Actions builds for you:**
```bash
# Push code - Actions builds x86 image
git add . && git commit -m "Deploy" && git push
```

> See `reference/cicd.md` for complete GitHub Actions workflow.

**Never run `docker build` locally for RunPod on Apple Silicon.**
</m1_mac_critical>

<gpu_selection>
## GPU Selection Matrix (January 2025)

| GPU | VRAM | Secure $/hr | Spot $/hr | Best For |
|-----|------|-------------|-----------|----------|
| RTX A4000 | 16GB | $0.36 | $0.18 | Embeddings, small models |
| RTX 4090 | 24GB | $0.44 | $0.22 | 7B-8B inference |
| A40 | 48GB | $0.65 | $0.39 | 13B-30B, fine-tuning |
| A100 80GB | 80GB | $1.89 | $0.89 | 70B models, production |
| H100 80GB | 80GB | $4.69 | $1.88 | 70B+ training |

**Quick Selection:**
```python
def select_gpu(model_params_b: float, quantized: bool = False) -> str:
    effective = model_params_b * (0.5 if quantized else 1.0)
    if effective <= 3: return "RTX_A4000"      # $0.36/hr
    if effective <= 8: return "RTX_4090"       # $0.44/hr
    if effective <= 30: return "A40"           # $0.65/hr
    if effective <= 70: return "A100_80GB"     # $1.89/hr
    return "H100_80GB"                         # $4.69/hr
```

> See `reference/cost-optimization.md` for detailed pricing and budget controls.
</gpu_selection>

<handler_patterns>
## Handler Patterns

### Progress Updates (Long-Running Tasks)
```python
import runpod

def long_task_handler(job):
    total_steps = job["input"].get("steps", 10)

    for step in range(total_steps):
        process_step(step)
        runpod.serverless.progress_update(
            job_id=job["id"],
            progress=int((step + 1) / total_steps * 100)
        )

    return {"status": "complete", "steps": total_steps}

runpod.serverless.start({"handler": long_task_handler})
```

### Error Handling
```python
import runpod
import traceback

def safe_handler(job):
    try:
        # Validate input
        if "prompt" not in job["input"]:
            return {"error": "Missing required field: prompt"}

        result = process(job["input"])
        return {"output": result}

    except torch.cuda.OutOfMemoryError:
        return {"error": "GPU OOM - reduce input size", "retry": False}
    except Exception as e:
        return {"error": str(e), "traceback": traceback.format_exc()}

runpod.serverless.start({"handler": safe_handler})
```

> See `reference/serverless-workers.md` for async patterns, batching, and advanced handlers.
</handler_patterns>

<vllm_deployment>
## vLLM Deployment

> **Note:** vLLM uses OpenAI-compatible API FORMAT but connects to YOUR RunPod endpoint,
> NOT OpenAI servers. Models run on your GPU (Llama, Qwen, Mistral, etc.)

### Environment Configuration
```python
vllm_env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-70B-Instruct",
    "HF_TOKEN": "${HF_TOKEN}",
    "TENSOR_PARALLEL_SIZE": "2",        # Multi-GPU
    "MAX_MODEL_LEN": "16384",
    "GPU_MEMORY_UTILIZATION": "0.95",
    "QUANTIZATION": "awq",              # Optional: awq, gptq
}
```

### OpenAI-Compatible Streaming
```python
from openai import OpenAI

client = OpenAI(
    api_key="RUNPOD_API_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)

stream = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a poem"}],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)
```

### Direct RunPod Streaming
```python
import requests

url = "https://api.runpod.ai/v2/ENDPOINT_ID/run"
headers = {"Authorization": "Bearer RUNPOD_API_KEY"}

response = requests.post(url, headers=headers, json={
    "input": {"prompt": "Hello", "stream": True}
})
job_id = response.json()["id"]

# Stream results
stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}"
with requests.get(stream_url, headers=headers, stream=True) as r:
    for line in r.iter_lines():
        if line: print(line.decode())
```

> See `reference/model-deployment.md` for HuggingFace, TGI, and custom model patterns.
</vllm_deployment>

<auto_scaling>
## Auto-Scaling Configuration

### Scaler Types
| Type | Best For | Config |
|------|----------|--------|
| QUEUE_DELAY | Variable traffic | `scaler_value=2` (2s target) |
| REQUEST_COUNT | Predictable load | `scaler_value=5` (5 req/worker) |

### Configuration Patterns
```python
configs = {
    "interactive_api": {
        "workers_min": 1,      # Always warm
        "workers_max": 5,
        "idle_timeout": 120,
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 1,     # 1s latency target
    },
    "batch_processing": {
        "workers_min": 0,
        "workers_max": 20,
        "idle_timeout": 30,
        "scaler_type": "REQUEST_COUNT",
        "scaler_value": 5,
    },
    "cost_optimized": {
        "workers_min": 0,
        "workers_max": 3,
        "idle_timeout": 15,    # Aggressive scale-down
        "scaler_type": "QUEUE_DELAY",
        "scaler_value": 5,
    },
}
```

> See `reference/pod-management.md` for pod lifecycle and scaling details.
</auto_scaling>

<health_monitoring>
## Health & Monitoring

### Quick Health Check
```python
import runpod

async def check_health(endpoint_id: str):
    endpoint = runpod.Endpoint(endpoint_id)
    health = await endpoint.health()

    return {
        "status": health.status,
        "workers_ready": health.workers.ready,
        "queue_depth": health.queue.in_queue,
        "avg_latency_ms": health.metrics.avg_execution_time,
    }
```

### GraphQL Metrics Query
```graphql
query GetEndpoint($id: String!) {
    endpoint(id: $id) {
        status
        workers { ready running pending throttled }
        queue { inQueue inProgress completed failed }
        metrics {
            requestsPerMinute
            avgExecutionTimeMs
            p95ExecutionTimeMs
            successRate
        }
    }
}
```

> See `reference/monitoring.md` for structured logging, alerts, and dashboards.
</health_monitoring>

<dockerfile_pattern>
## Dockerfile Template

```dockerfile
FROM runpod/pytorch:2.1.0-py3.10-cuda12.1.1-devel-ubuntu22.04

WORKDIR /app

# Install dependencies (cached layer)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# RunPod entrypoint
CMD ["python", "-u", "handler.py"]
```

> See `reference/templates.md` for runpod.toml, requirements.txt patterns.
</dockerfile_pattern>

<file_locations>
## Reference Files

**Core Patterns:**
- `reference/serverless-workers.md` - Handler patterns, streaming, async
- `reference/model-deployment.md` - vLLM, TGI, HuggingFace deployment
- `reference/pod-management.md` - GPU types, scaling, lifecycle

**Operations:**
- `reference/cost-optimization.md` - Budget controls, right-sizing
- `reference/monitoring.md` - Health checks, logging, GraphQL
- `reference/troubleshooting.md` - Common issues and solutions

**DevOps:**
- `reference/cicd.md` - GitHub Actions for M1 Mac builds
- `reference/templates.md` - Dockerfile, runpod.toml configs
- `templates/runpod-worker.py` - Production handler template
</file_locations>

<routing>
## Request Routing

**User wants serverless deployment:**
→ Provide handler pattern, Dockerfile, deployment steps
→ Reference: `reference/serverless-workers.md`

**User wants vLLM endpoint:**
→ Provide vLLM env config, OpenAI client setup
→ Reference: `reference/model-deployment.md`

**User wants cost optimization:**
→ Provide GPU selection matrix, spot pricing, budget controls
→ Reference: `reference/cost-optimization.md`

**User on M1/M2 Mac:**
→ CRITICAL: Must use GitHub Actions for builds
→ Reference: `reference/cicd.md`

**User has deployment issues:**
→ Check health endpoint, review logs
→ Reference: `reference/troubleshooting.md`
</routing>

<cost_quick_ref>
## Cost Quick Reference

**Storage Costs:**
- Container disk: $0.10/GB/month (billed in 5-min intervals)
- Network volumes: $0.07/GB/month (first 1TB), $0.05/GB after

**Cost Estimation:**
```python
def estimate_monthly_cost(gpu_type, daily_requests, avg_time_s):
    rates = {"RTX_4090": 0.44, "A40": 0.65, "A100_80GB": 1.89}
    daily_hours = (daily_requests * avg_time_s) / 3600
    return daily_hours * 30 * rates.get(gpu_type, 1.0)

# Example: 1000 requests/day, 5s each, RTX 4090
# = (1000 * 5) / 3600 * 30 * 0.44 = $18.33/month
```
</cost_quick_ref>

<cli_commands>
## Quick CLI Commands

```bash
# Install
pip install runpod

# Deploy endpoint
runpodctl project deploy --name my-endpoint --gpu-type "NVIDIA RTX 4090"

# Health check
runpod endpoint health ENDPOINT_ID

# View logs
runpod endpoint logs ENDPOINT_ID

# Scale workers
runpod endpoint scale ENDPOINT_ID --min 1 --max 10

# Local testing
python handler.py --rp_serve_api
```
</cli_commands>

<example_session>
## Example Session

**User:** "Deploy a Llama 3.1 8B model to RunPod serverless"

**Claude:**
1. First, select GPU - 8B model fits on RTX 4090 ($0.44/hr):
```python
gpu_type = "NVIDIA GeForce RTX 4090"
```

2. Use vLLM for 2-3x throughput. Create endpoint with:
```python
env = {
    "MODEL_NAME": "meta-llama/Llama-3.1-8B-Instruct",
    "MAX_MODEL_LEN": "8192",
    "GPU_MEMORY_UTILIZATION": "0.95",
}
```

3. Access via OpenAI-compatible API:
```python
from openai import OpenAI
client = OpenAI(
    api_key="YOUR_KEY",
    base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1",
)
```

4. Cost estimate: ~$0.44/hr compute, scale-to-zero when idle.
</example_session>

Overview

This skill deploys and manages GPU workloads on RunPod, covering serverless handlers, vLLM endpoints, and dedicated pods. It delivers production-ready streaming handlers, OpenAI-compatible vLLM deployment patterns, and cost-optimized GPU selection for model sizes from embeddings to multi‑GPU training. The library includes scaling, health monitoring, and CI guidance for Apple Silicon users.

How this skill works

The skill provides handler templates for RunPod serverless (scale-to-zero) including streaming and progress updates, plus vLLM configurations that expose an OpenAI-compatible API backed by your GPU. It includes pod management and auto-scaling patterns, GPU selection logic (A4000 → H100), health check queries, and CI/CD notes to build x86 images from Apple Silicon. Example Dockerfile, runpodctl commands, and monitoring queries are included for end-to-end deployment and operations.

When to use it

You need a scale-to-zero serverless worker with per-second billing and streaming responses.
You want an OpenAI-compatible vLLM endpoint to serve Llama, Qwen, Mistral, or similar models.
You must select the right GPU (A100/H100/RTX) for cost and model size tradeoffs.
You require pod-based development or multi-GPU training with autoscaling.
You are on M1/M2 Mac and need CI builds for x86 Docker images.

Best practices

Choose GPU by effective model size and quantization; use the provided selection matrix to balance cost vs performance.
Use streaming handlers or generator patterns for low-latency token delivery and reduced client timeouts.
Set sensible auto-scaling: keep interactive APIs warm (min=1) and batch jobs scale-to-zero for cost savings.
Implement progress_update and robust error handling (OOM retries, clear error payloads) in handlers.
Run GitHub Actions to build x86 containers if developing on Apple Silicon; avoid local docker builds on M1/M2.

Example use cases

Deploy a Llama-3.1-8B serverless endpoint on RTX 4090 for interactive chat with streaming tokens.
Host a 70B model on A100 or H100 with vLLM and OpenAI-compatible client for high-throughput inference.
Create long-running batch jobs with progress updates and aggressive scale-to-zero settings to minimize cost.
Run fine-tuning or large-model evaluation on dedicated pods with autoscaling and health monitoring hooks.
Estimate monthly GPU costs programmatically and pick spot instances or idle-time budgets to optimize spend.

FAQ

Can I build Docker images on an M1/M2 Mac?

No — Apple Silicon is incompatible for x86 GPU images. Use the provided GitHub Actions workflow to build and push x86 containers.

How do I get OpenAI-style streaming from vLLM?

Deploy vLLM with the OpenAI-compatible base_url and call chat.completions.create(stream=True) or use the RunPod run/stream endpoints for direct streaming.