home / skills / huiali / rust-skills / rust-gpu

rust-gpu skill

/skills/rust-gpu

This skill helps optimize GPU memory management and CUDA/OpenCL workflows in Rust engineering, enabling efficient data transfer, memory access, and kernel

npx playbooks add skill huiali/rust-skills --skill rust-gpu

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
6.2 KB
---
name: rust-gpu
description: GPU 内存与计算专家。处理 CUDA, OpenCL, GPU memory, compute shader, memory coalescing, zero-copy, 显存管理, 异构计算--- # GPU 内存与计算 ## 核心问题 **如何在 Rust 中高效管理 GPU 内存和异构计算?** GPU 计算需要特殊的内存管理策略和同步机制。
---


## GPU 内存架构

```
┌─────────────────────────────────────────┐
│              GPU 显存                    │
├─────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐       │
│  │   Global    │  │   Shared    │       │
│  │   Memory    │  │   Memory    │       │
│  │  (VRAM)     │  │  (SMEM)     │       │
│  └─────────────┘  └─────────────┘       │
│                                         │
│  ┌─────────────┐  ┌─────────────┐       │
│  │  Constant   │  │   Local     │       │
│  │   Memory    │  │   Memory    │       │
│  └─────────────┘  └─────────────┘       │
└─────────────────────────────────────────┘
        ↓                    ↑
   CPU (通过 PCIe)      GPU 计算单元
```


## 内存类型对比

| 内存类型 | 位置 | 延迟 | 大小 | 用途 |
|---------|------|------|------|------|
| Global | VRAM | 高 | 大 | 输入/输出数据 |
| Shared | SMEM | 低 | 小 | 线程块内通信 |
| Constant | 缓存 | 中 | 中 | 只读数据 |
| Local | 寄存器/VRAM | 高 | 小 | 线程私有 |
| Register | SM | 最低 | 极小 | 线程私有 |


## CUDA 内存管理 (rust-cuda)

```rust
// 使用 rust-cuda 或 cuda-sys
use cuda_sys::ffi::*;

// 设备内存分配
let mut d_ptr: *mut f32 = std::ptr::null_mut();
unsafe {
    cudaMalloc(&mut d_ptr as *mut *mut f32, size * std::mem::size_of::<f32>())
};

// 主机到设备拷贝
unsafe {
    cudaMemcpy(
        d_ptr as *mut c_void,
        h_ptr as *const c_void,
        size * std::mem::size_of::<f32>(),
        cudaMemcpyHostToDevice
    );
};

// 设备到主机拷贝
let mut h_result: Vec<f32> = vec![0.0; size];
unsafe {
    cudaMemcpy(
        h_result.as_mut_ptr() as *mut c_void,
        d_ptr as *const c_void,
        size * std::mem::size_of::<f32>(),
        cudaMemcpyDeviceToHost
    );
};

// 释放设备内存
unsafe {
    cudaFree(d_ptr as *mut c_void);
};
```


## 零拷贝内存

```rust
// 零拷贝:共享主机和设备内存
let mut h_ptr: *mut f32 = std::ptr::null_mut();

// 使用 cudaMallocHost 分配固定内存(页锁定)
unsafe {
    cudaMallocHost(&mut h_ptr as *mut *mut f32, size * std::mem::size_of::<f32>())
};

// 固定内存可以直接被 GPU 访问,无需拷贝
// 但会影响系统内存压力

// 使用 cudaMemcpyAsync 进行异步拷贝(与计算重叠)
let stream: cudaStream_t = std::ptr::null_mut();
unsafe {
    cudaMemcpyAsync(
        d_ptr as *mut c_void,
        h_ptr as *const c_void,
        size * std::mem::size_of::<f32>(),
        cudaMemcpyHostToDevice,
        stream
    );
};

// 同步等待
unsafe {
    cudaStreamSynchronize(stream);
};
```


## 统一内存 (Unified Memory)

```rust
// 使用统一内存,CPU 和 GPU 自动管理数据迁移
let mut unified_ptr: *mut f32 = std::ptr::null_mut();

unsafe {
    // 分配统一内存
    cudaMallocManaged(&mut unified_ptr as *mut *mut f32, size * std::mem::size_of::<f32>());
};

// CPU 访问
unsafe {
    for i in 0..size {
        *unified_ptr.add(i) = i as f32;
    }
};

// GPU 访问(自动迁移到设备)
// 调用 CUDA kernel
launch_kernel(unified_ptr, size);

// CPU 访问结果(自动迁移回主机)
unsafe {
    println!("Result: {}", unified_ptr.add(0).read());
};

// 释放
unsafe {
    cudaFree(unified_ptr as *mut c_void);
};
```


## 内存合并访问

```rust
// 合并访问模式 - 优化全局内存带宽
// ❌ 错误:非合并访问
__global__ void bad_access(float* data) {
    int idx = threadIdx.x + blockIdx.x * 32; // 跨步访问
    float value = data[idx * 32];  // 每个线程访问间隔 32 * sizeof(float)
}

// ✅ 正确:合并访问
__global__ void coalesced_access(float* data) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x; // 连续访问
    float value = data[idx];  // 所有线程连续访问
}
```


## 共享内存使用

```rust
// 使用共享内存减少全局内存访问
__global__ void shared_memory_reduce(float* input, float* output) {
    __shared__ float sdata[256];  // 每个块 256 字节共享内存
    
    int tid = threadIdx.x;
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    // 从全局内存加载到共享内存
    sdata[tid] = input[idx];
    __syncthreads();
    
    // 规约计算
    for (int s = blockDim.x / 2; s > 0; s >>= 1) {
        if (tid < s) {
            sdata[tid] += sdata[tid + s];
        }
        __syncthreads();
    }
    
    // 写回结果
    if (tid == 0) {
        output[blockIdx.x] = sdata[0];
    }
}
```


## 内存对齐

```rust
// 内存对齐优化
const size_t ALIGNMENT = 256;  // 256 字节对齐

// 使用 cudaMalloc 返回的指针已经对齐
// 但自定义数据结构需要对齐
struct alignas(256) AlignedData {
    float4 position;  // 16 字节
    float4 normal;    // 16 字节
    // ... 自动填充到 256 字节
};

// 检查对齐
assert(((uintptr_t)ptr % ALIGNMENT) == 0);
```


## 性能优化检查表

| 优化项 | 检查点 |
|-------|-------|
| 内存合并 | 线程访问连续内存 |
| 共享内存 | 减少全局内存访问 |
| 内存对齐 | 256 字节对齐 |
| 异步操作 | 计算与传输重叠 |
| 固定内存 | 使用页锁定内存 |
| 批处理 | 减少内核启动开销 |


## 与其他技能关联

```
rust-gpu
    │
    ├─► rust-performance → 性能优化
    ├─► rust-unsafe → 底层内存操作
    └─► rust-embedded → no_std 设备
```

Overview

This skill provides practical GPU memory and compute guidance for Rust engineers working with CUDA, OpenCL, compute shaders, and heterogeneous systems. It focuses on memory types, allocation patterns, zero-copy and unified memory, coalesced access, shared memory reduction, and alignment strategies. The content helps diagnose performance issues and design efficient data movement for high-throughput kernels.

How this skill works

The skill inspects typical GPU memory layers (global/VRAM, shared/SMEM, constant, local, registers) and maps Rust/CUDA patterns to each. It outlines allocation and copy idioms (cudaMalloc, cudaMallocHost, cudaMallocManaged, cudaMemcpyAsync) and recommends code motifs for coalescing, shared-memory reduction, and alignment. It also provides a checklist of performance checks and integration pointers for related Rust practices.

When to use it

  • Designing kernel data layouts and access patterns for high memory bandwidth
  • Implementing host↔device transfers with minimal stalls and overlapped compute
  • Choosing between page-locked (zero-copy) memory, unified memory, or explicit copies
  • Optimizing reductions, tiling, or other shared-memory algorithms
  • Diagnosing low GPU utilization or memory-bound kernels

Best practices

  • Favor coalesced global memory access: threads should read contiguous elements to maximize bandwidth
  • Use shared memory to stage data and reduce global memory traffic for intra-block communication
  • Prefer page-locked host memory (cudaMallocHost) only when beneficial; it increases system memory pressure
  • Use asynchronous memcpy and streams to overlap data transfer with kernel execution
  • Align structures to 256-byte boundaries when layouts cross cache lines or use wide loads/stores

Example use cases

  • Porting a CPU vectorized loop to a GPU kernel and ensuring global accesses are coalesced
  • Implementing a block-wise reduction using shared memory to reduce global traffic
  • Using cudaMallocHost for real-time capture buffers to enable zero-copy access from GPU
  • Switching to cudaMallocManaged to simplify pointer semantics for prototyping or irregular access patterns
  • Diagnosing a memory-bound kernel by checking stride patterns, alignment, and shared-memory usage

FAQ

When should I use unified memory vs explicit copies?

Use unified memory for convenience or when working with irregular access patterns and rapid prototyping. For maximum performance and predictable transfers, prefer explicit cudaMalloc + cudaMemcpy with carefully scheduled asynchronous copies.

Does zero-copy always improve latency?

No. Zero-copy (page-locked host memory) can reduce copy overhead but may increase system memory pressure and have lower throughput than device-local VRAM for heavy workloads. Measure both approaches for your workload.

How large should shared memory tiles be?

Tile size should fit per-block shared memory limits and enable efficient warp-level parallelism. Common choices are powers of two that match blockDim.x and allow reduction loops to halve the active range each step.