home / skills / doanchienthangdev / omgkit / ai-accelerators

This skill helps you optimize ML workloads for AI accelerators by detailing hardware choices and tuning strategies across GPUs, TPUs, and edge devices.

npx playbooks add skill doanchienthangdev/omgkit --skill ai-accelerators

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
9.9 KB
---
name: ai-accelerators
description: AI hardware accelerators including GPUs, TPUs, custom silicon, and hardware-aware optimization strategies for ML workloads.
---

# AI Accelerators

Hardware acceleration for ML workloads.

## Hardware Landscape

```
┌─────────────────────────────────────────────────────────────┐
│                    AI ACCELERATOR TYPES                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  GPU (NVIDIA)         TPU (Google)        NPU/Custom        │
│  ─────────────        ────────────        ─────────         │
│  CUDA cores           Systolic array      Apple Neural      │
│  Tensor cores         BF16 native         Qualcomm Hexagon  │
│  General purpose      TPU pods            Intel Habana      │
│  PyTorch/TF native    JAX optimized       AWS Inferentia    │
│                                                              │
│  FPGA                 ASIC                Edge Accelerators │
│  ─────────────        ────────────        ─────────         │
│  Reconfigurable       Fixed function      Coral Edge TPU    │
│  Low latency          Maximum perf        Jetson (NVIDIA)   │
│  Power efficient      High volume         Intel NCS2        │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

## GPU Optimization

### CUDA Memory Management
```python
import torch

# Memory allocation
torch.cuda.empty_cache()
torch.cuda.memory_allocated()
torch.cuda.max_memory_allocated()

# Pin memory for faster transfers
train_loader = DataLoader(
    dataset,
    batch_size=32,
    pin_memory=True,
    num_workers=4
)

# Async data transfer
def async_prefetch(loader, device):
    stream = torch.cuda.Stream()
    for batch in loader:
        with torch.cuda.stream(stream):
            batch = batch.to(device, non_blocking=True)
        torch.cuda.current_stream().wait_stream(stream)
        yield batch
```

### Tensor Core Utilization
```python
# Ensure tensor core alignment (multiples of 8)
class TensorCoreOptimized(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        # Round to multiple of 8 for tensor cores
        self.in_features = ((in_features + 7) // 8) * 8
        self.out_features = ((out_features + 7) // 8) * 8
        self.linear = nn.Linear(self.in_features, self.out_features)
        self.pad_in = self.in_features - in_features

    def forward(self, x):
        if self.pad_in > 0:
            x = F.pad(x, (0, self.pad_in))
        return self.linear(x)

# Enable TF32 on Ampere+ GPUs
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Force FP16 computation
with torch.cuda.amp.autocast(dtype=torch.float16):
    output = model(input)
```

### Multi-GPU Strategies
```python
# DataParallel (simple, not recommended for training)
model = nn.DataParallel(model)

# DistributedDataParallel (recommended)
model = DistributedDataParallel(model, device_ids=[local_rank])

# Model Parallelism (for large models)
class ModelParallel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.TransformerEncoder(...).to('cuda:0')
        self.decoder = nn.TransformerDecoder(...).to('cuda:1')

    def forward(self, x):
        x = self.encoder(x.to('cuda:0'))
        x = self.decoder(x.to('cuda:1'))
        return x

# Pipeline Parallelism
from torch.distributed.pipeline.sync import Pipe

model = nn.Sequential(
    nn.Linear(100, 200).to('cuda:0'),
    nn.ReLU().to('cuda:0'),
    nn.Linear(200, 100).to('cuda:1')
)
model = Pipe(model, chunks=8)
```

## TPU Optimization

```python
# JAX/TPU optimized training
import jax
import jax.numpy as jnp
from flax import linen as nn

class TPUModel(nn.Module):
    features: int

    @nn.compact
    def __call__(self, x):
        x = nn.Dense(self.features)(x)
        x = nn.relu(x)
        return nn.Dense(10)(x)

# pmap for data parallelism across TPU cores
@jax.pmap
def train_step(state, batch):
    def loss_fn(params):
        logits = state.apply_fn({'params': params}, batch['image'])
        loss = jnp.mean(optax.softmax_cross_entropy(logits, batch['label']))
        return loss

    grad_fn = jax.value_and_grad(loss_fn)
    loss, grads = grad_fn(state.params)
    grads = jax.lax.pmean(grads, axis_name='batch')
    state = state.apply_gradients(grads=grads)
    return state, loss

# PyTorch/XLA for TPU
import torch_xla.core.xla_model as xm
import torch_xla.distributed.parallel_loader as pl

device = xm.xla_device()
model = model.to(device)

for batch in pl.ParallelLoader(train_loader, [device]):
    output = model(batch)
    loss.backward()
    xm.optimizer_step(optimizer)
```

## Edge Accelerators

### NVIDIA Jetson
```python
# TensorRT optimization for Jetson
import tensorrt as trt
import pycuda.driver as cuda

def build_engine(onnx_path, precision='fp16'):
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)

    with open(onnx_path, 'rb') as f:
        parser.parse(f.read())

    config = builder.create_builder_config()
    config.max_workspace_size = 1 << 30  # 1GB

    if precision == 'fp16':
        config.set_flag(trt.BuilderFlag.FP16)
    elif precision == 'int8':
        config.set_flag(trt.BuilderFlag.INT8)
        config.int8_calibrator = EntropyCalibrator(calibration_data)

    return builder.build_engine(network, config)

# DeepStream for video inference
# gst-launch-1.0 filesrc location=video.mp4 ! \
#   decodebin ! nvvideoconvert ! \
#   nvinfer config-file-path=config.txt ! \
#   nvdsosd ! nveglglessink
```

### Coral Edge TPU
```python
from pycoral.utils import edgetpu
from pycoral.adapters import common, classify

# Load Edge TPU model
interpreter = edgetpu.make_interpreter('model_edgetpu.tflite')
interpreter.allocate_tensors()

# Inference
common.set_input(interpreter, image)
interpreter.invoke()
classes = classify.get_classes(interpreter, top_k=5)

# Compile model for Edge TPU
# edgetpu_compiler model.tflite
```

### TFLite for Mobile
```python
import tensorflow as tf

# Convert to TFLite with quantization
converter = tf.lite.TFLiteConverter.from_saved_model('saved_model/')
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]

# Full integer quantization
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8

tflite_model = converter.convert()

# Inference
interpreter = tf.lite.Interpreter(model_content=tflite_model)
interpreter.allocate_tensors()
```

## Hardware-Aware Optimization

### Auto-Tuning
```python
# TVM auto-tuning for specific hardware
import tvm
from tvm import relay, autotvm

# Extract tuning tasks
tasks = autotvm.task.extract_from_program(
    mod["main"], target="cuda", params=params
)

# Tune each task
for task in tasks:
    tuner = autotvm.tuner.XGBTuner(task)
    tuner.tune(
        n_trial=1000,
        measure_option=autotvm.measure_option(
            builder=autotvm.LocalBuilder(),
            runner=autotvm.LocalRunner(number=10)
        ),
        callbacks=[autotvm.callback.log_to_file('tune.log')]
    )

# Compile with best configs
with autotvm.apply_history_best('tune.log'):
    with tvm.transform.PassContext(opt_level=3):
        lib = relay.build(mod, target="cuda", params=params)
```

### Hardware Selection Matrix
```python
def select_hardware(model_size, latency_req, batch_size, budget):
    """Select optimal hardware for ML workload."""
    recommendations = []

    if model_size > 10e9:  # >10B params
        recommendations.append({
            'hardware': 'Multi-GPU (A100/H100)',
            'reason': 'Large model requires high memory bandwidth',
            'cost': 'High'
        })

    if latency_req < 10:  # <10ms
        recommendations.append({
            'hardware': 'TensorRT + GPU',
            'reason': 'Low latency requires optimized inference',
            'cost': 'Medium'
        })

    if batch_size == 1 and latency_req < 5:
        recommendations.append({
            'hardware': 'Edge TPU / Jetson',
            'reason': 'Single-sample low-latency inference',
            'cost': 'Low'
        })

    return recommendations
```

## Benchmarking

```python
import torch.utils.benchmark as benchmark

def benchmark_model(model, input_shape, device='cuda'):
    x = torch.randn(*input_shape).to(device)
    model = model.to(device)
    model.eval()

    # Warmup
    for _ in range(10):
        model(x)

    # Benchmark
    timer = benchmark.Timer(
        stmt='model(x)',
        globals={'model': model, 'x': x}
    )

    result = timer.blocked_autorange(min_run_time=1)

    return {
        'mean_ms': result.mean * 1000,
        'median_ms': result.median * 1000,
        'iqr_ms': result.iqr * 1000,
        'throughput': 1000 / (result.mean * 1000)
    }
```

## Commands
- `/omgoptim:profile` - Profile on hardware
- `/omgdeploy:edge` - Edge deployment
- `/omgdeploy:cloud` - Cloud GPU deployment

## Best Practices

1. Profile on target hardware early
2. Use hardware-specific optimizations
3. Batch for throughput, stream for latency
4. Consider power consumption for edge
5. Test with production data volumes

Overview

This skill surfaces practical guidance and utilities for choosing and optimizing AI hardware accelerators including GPUs, TPUs, NPUs, FPGAs, ASICs, and edge devices. It focuses on memory, compute and latency optimizations, multi-GPU and TPU workflows, edge deployment patterns, and hardware-aware tuning. The goal is to help teams match model requirements to the right hardware and apply concrete performance improvements.

How this skill works

The skill inspects workload characteristics (model size, batch size, latency requirement, and target device) and recommends hardware types and optimization strategies. It includes patterns for CUDA memory management, tensor core alignment, mixed precision, multi-GPU/TPU parallelism, and edge toolchains like TensorRT and TFLite. It also covers auto-tuning and benchmarking workflows to validate choices on target hardware.

When to use it

  • Selecting hardware for training or inference (cloud, on-prem, or edge).
  • Optimizing GPU memory and throughput for large models or high batch sizes.
  • Designing multi-GPU, TPU, or model-parallel training pipelines.
  • Deploying low-latency models to Jetson, Edge TPU, or mobile devices.
  • Benchmarking and tuning kernels with TVM, TensorRT, or JAX/TPU toolchains.

Best practices

  • Profile on the target hardware early and measure with representative data.
  • Use hardware-specific libraries: CUDA/cuDNN for GPUs, XLA/JAX for TPUs, TensorRT/TFLite for inference.
  • Leverage mixed precision and tensor-core alignment to improve throughput with minimal accuracy loss.
  • Choose batching strategy to match latency vs throughput goals: stream for low latency, batch for high throughput.
  • Automate auto-tuning and keep an experimental log to apply best configs during builds.

Example use cases

  • Train a 10B+ parameter model across multiple A100/H100 GPUs with DistributedDataParallel and model sharding.
  • Convert and quantize a vision model to TFLite and deploy to mobile with full integer quantization.
  • Optimize inference pipeline on Jetson using TensorRT FP16 engine and DeepStream for real-time video.
  • Use JAX pmap to scale TPU training across cores with gradient pmean aggregation.
  • Run TVM autotuning for custom kernels and compile the best configuration for a specific GPU/ASIC.

FAQ

How do I choose between GPU, TPU, or edge accelerator?

Match model size, latency, throughput, and budget. Use GPUs for general training and flexibility, TPUs for large-scale JAX/TF training, and edge accelerators for single-sample low-latency inference with tight power constraints.

When should I use mixed precision?

Use mixed precision (FP16/TF32) when hardware supports tensor cores or TF32 and your model tolerates small numerical differences; it often yields large speedups and memory savings.