home / skills / trevors / dot-claude / cuda-kernel-refine

cuda-kernel-refine skill

safe

This skill helps you iteratively optimize CUDA kernels by profiling with nsys and ncu, classifying bottlenecks, and validating improvements.

npx playbooks add skill trevors/dot-claude --skill cuda-kernel-refine

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

15.5 KB

---
name: cuda-kernel-refine
description: |
  Iterative CUDA kernel optimization using NVIDIA profiling tools (nsys, ncu).
  Triggers on: "optimize kernel", "refine kernel", "improve throughput",
  "kernel performance", "make kernel faster", "reduce bandwidth",
  "memory bound", "compute bound", "roofline", "compare benchmark",
  "register pressure", "occupancy"
---

# CUDA Kernel Refinement Loop

**One change per cycle:** baseline → profile → classify → optimize → verify → compare → loop. Multiple simultaneous changes make it impossible to attribute improvement. Revert on regression.

## 1. Establish Baseline

Discover the project's benchmark infrastructure before running anything:

1. Check `Makefile` for bench/profile targets (`make bench`, `make profile`, etc.)
2. Check for benchmark binaries (`cargo bench`, `pytest-benchmark`, a custom `e2e_bench` binary)
3. Look at CI scripts or README for the canonical benchmark invocation
4. If no benchmark exists, write a minimal one that exercises the hot path with measurable output

Run with enough iterations for stable results — coefficient of variation <5%.

```bash
# Save baseline (JSON or structured output preferred for diff later)
./benchmark --iterations 5 --json-output /tmp/bench-baseline.json
```

Record: primary metric (tok/s, GFLOPS, GB/s, or latency ms), stddev, and workload parameters (input size, batch size, context depth). You need all three for meaningful comparison later.

## 2. Profile with nsys

### Capture

```bash
nsys profile --stats=true -o /tmp/nsys-profile ./app [args]
```

### Quick look (JSON)

```bash
# Kernel timing summary — sort by Time(%) to find hotspot
nsys stats -r cuda_gpu_kern_sum --format json /tmp/nsys-profile.nsys-rep

# CUDA API calls (spot synchronization overhead, excessive cudaMalloc)
nsys stats -r cuda_api_sum --format json /tmp/nsys-profile.nsys-rep

# Memory operations by size (spot unnecessary small D2H/H2D transfers)
nsys stats -r cuda_gpu_mem_size_sum --format json /tmp/nsys-profile.nsys-rep
```

### Deep analysis (SQLite — the real power tool)

For custom queries, export to SQLite and use SQL directly. This is far more powerful than JSON for kernel-level analysis.

```bash
# Export nsys trace to SQLite
nsys export --type sqlite /tmp/nsys-profile.nsys-rep
# Creates /tmp/nsys-profile.sqlite

# Top kernels by time with full names
sqlite3 -header -column /tmp/nsys-profile.sqlite \
  "SELECT
     printf('%.1f%%', 100.0*sum(k.end-k.start)/(SELECT sum(end-start) FROM CUPTI_ACTIVITY_KIND_KERNEL)) as pct,
     count(*) as calls,
     printf('%.1f', sum(k.end-k.start)/1e6) as total_ms,
     printf('%.1f', avg(k.end-k.start)/1e3) as avg_us,
     s.value as kernel
   FROM CUPTI_ACTIVITY_KIND_KERNEL k
   JOIN StringIds s ON k.shortName = s.id
   GROUP BY s.value
   ORDER BY sum(k.end-k.start) DESC
   LIMIT 20"

# Per-token analysis (for inference: normalize by decode_tokens)
sqlite3 -header -column /tmp/nsys-profile.sqlite \
  "SELECT
     printf('%.0f', count(*) * 1.0 / <DECODE_TOKENS>) as 'launches/tok',
     printf('%.1f', sum(end-start)/1e6/<DECODE_TOKENS>) as 'ms/tok',
     printf('%.1f', avg(end-start)/1e3) as 'us/launch'
   FROM CUPTI_ACTIVITY_KIND_KERNEL"

# Kernel launch configs (grid, block, registers) — find dimension mismatches
sqlite3 -header -column /tmp/nsys-profile.sqlite \
  "SELECT s.value as kernel, k.gridX, k.gridY, k.gridZ,
     k.blockX, k.blockY, k.blockZ, k.registersPerThread as regs,
     count(*) as launches, printf('%.1f', avg(k.end-k.start)/1e3) as avg_us
   FROM CUPTI_ACTIVITY_KIND_KERNEL k
   JOIN StringIds s ON k.shortName = s.id
   GROUP BY s.value, k.gridX, k.gridY, k.gridZ, k.blockX, k.blockY, k.blockZ
   ORDER BY sum(k.end-k.start) DESC LIMIT 20"
```

Replace `<DECODE_TOKENS>` with the number of tokens generated in the trace.

### Interpret the profile

- **High `Instances` + moderate `Avg`** → per-layer kernel, fusion opportunity
- **Low `Instances` + high `Avg`** → single expensive kernel, optimize internals
- **Many small kernels** → launch overhead ("death by a thousand cuts"), count launches/token to quantify
- **High `cudaMalloc` count in API summary** → framework allocator overhead, consider memory pools

## 3. Check Register Usage

```bash
nvcc -cubin -arch=<target_sm> -Xptxas=-v kernel.cu -o /tmp/kernel.cubin
```

Output format:

```text
ptxas info    : Compiling entry function 'kernel_name' for 'sm_121a'
ptxas info    : Function properties for kernel_name
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 29 registers, used 0 barriers
```

Key signals:

| Signal                 | Meaning                                   | Action                                       |
| ---------------------- | ----------------------------------------- | -------------------------------------------- |
| Spill stores/loads > 0 | Register pressure                         | `__launch_bounds__` or reduce live variables |
| Registers > 64         | Occupancy limited by registers            | Check ncu Occupancy section                  |
| Stack frame > 0        | Function calls or array-to-local demotion | Inline functions, use shared memory          |
| Barriers > 0           | Shared memory synchronization             | Expected with `__syncthreads()`              |

## 4. Drill Down with ncu

**Requires GPU performance counter permissions.**

```bash
# Default sections (SOL + occupancy), fast
ncu -k "kernel_regex" -c 5 ./app [args]

# Full metrics including roofline
ncu -k "kernel_regex" --set full -o /tmp/ncu-report ./app [args]

# CSV for specific metrics
ncu --csv -k "kernel_regex" --metrics \
  sm__throughput.avg.pct_of_peak_sustained_elapsed,\
  dram__throughput.avg.pct_of_peak_sustained_elapsed \
  ./app [args]
```

If you get `ERR_NVGPUCTRPERM`, performance counters are restricted. Fix with:

```bash
# Option A: run ncu with sudo
sudo ncu -k "kernel_regex" -c 5 ./app [args]

# Option B: kernel module parameter (persistent, requires reboot)
sudo sh -c 'echo "options nvidia NVreg_RestrictProfilingToAdminUsers=0" > /etc/modprobe.d/ncu-permissions.conf'
```

Section sets (choose based on what you need):

- **Default** (`SpeedOfLight`, `Occupancy`) — fast, answers "is this compute or memory bound?"
- **`--set detailed`** — adds instruction mix, memory workload breakdown, warp scheduling. Use when default sections show a bottleneck but don't explain why.
- **`--set full`** — adds roofline charts. 10-50x slower. Only use when you need the visual roofline or are writing up an analysis.

## 5. Classify Bottleneck

From ncu `SpeedOfLight` section (or infer from kernel type if ncu unavailable):

| SM SOL %    | Memory SOL % | Classification     | Optimization Direction                |
| ----------- | ------------ | ------------------ | ------------------------------------- |
| Low         | High (>60%)  | **Memory-bound**   | Reduce traffic, wider loads, fuse ops |
| High (>60%) | Low          | **Compute-bound**  | Fast math, ILP, tensor cores          |
| Low         | Low          | **Latency-bound**  | Occupancy, reduce sync, overlap       |
| High        | High         | **Well-optimized** | Algorithmic changes or accept         |

### Bandwidth utilization (memory-bound kernels)

For inference workloads, compute bandwidth utilization directly — this is more actionable than abstract roofline:

```text
data_per_step = weight_bytes + activation_bytes + KV_cache_bytes  (read + write)
theoretical_min_time = data_per_step / hw_bandwidth
utilization = theoretical_min_time / actual_time
```

Example: FP4 weights read 5.1 GB/token on 273 GB/s → theoretical min 18.7ms, actual 24ms → 78% utilization. The 22% gap is your optimization headroom.

### Without ncu

If no counter permissions, infer from kernel behavior:

- Elementwise ops (activation functions, normalization) → almost always memory-bound
- Reductions with few output elements (layernorm, softmax) → likely compute-bound on the reduction
- Large matmuls → compute-bound
- Many tiny kernels with low total time → latency-bound (launch overhead)

### Identify host-side bottlenecks

Not all overhead lives in GPU kernels. Check for:

- **D2H synchronization points**: search for device-to-host copies in the codebase — `cudaMemcpy(..., DeviceToHost)`, `cudaStreamSynchronize`, and framework-specific equivalents (PyTorch: `.item()`, `.cpu()`, `.numpy()`; Candle: `.to_vec1()`; CUDA C++: `cudaMemcpyAsync` with sync). Each one stalls the GPU pipeline. Count them per inference step — eliminating unnecessary D2H syncs is often the single biggest win.
- **Excessive allocations**: check `cudaMalloc` count in `nsys stats -r cuda_api_sum`. ML frameworks (Candle, PyTorch eager) often allocate fresh GPU buffers for every intermediate tensor. Hundreds of allocations per step is a sign of framework overhead.
- **CPU-side compute**: sampling, tokenization, or postprocessing running on CPU while GPU idles. Shows up as gaps between kernel launches in the nsys timeline.

## 6. Optimization Strategies

### Memory-bound

Goal: reduce bytes moved per operation.

- **Kernel fusion** (see latency-bound for prioritization): every intermediate tensor written to global memory and read back is wasted bandwidth
- **Vectorized loads**: `uint4` (128-bit) = 8 BF16 or 4 FP32 per load. Requires pointer alignment. Verify with SASS (`LDG.E.128`).
- **Shared memory reuse**: cache data accessed by multiple threads (use stride D+1 to avoid bank conflicts)
- **Reduce precision**: FP8/FP4 weights cut memory traffic 2-4x with minimal accuracy loss. Keep accumulation in FP32.

### Compute-bound

Goal: do more useful math per cycle.

- **Fast math intrinsics**: `__expf()`, `__rsqrtf()`, `__fdividef()` — lower precision but 2-4x faster than full-precision equivalents
- **Warp shuffle**: `__shfl_down_sync(0xffffffff, val, offset)` for intra-warp reductions — avoids shared memory round-trip
- **ILP**: interleave independent operations so the scheduler can hide latency. `#pragma unroll` for known-trip-count loops.
- **Tensor cores**: WMMA/MMA for matrix ops — 8-16x throughput over FP32 CUDA cores but requires specific data layouts and alignment

### Latency-bound (launch overhead)

- **Kernel fusion** — the primary remedy. Prioritize candidates by:
  1. `launches_per_step × layers` — a 3-kernel sequence in a 32-layer model eliminates 96 launches when fused
  2. Memory traffic eliminated — fusing `residual_add + rmsnorm` saves one full tensor round-trip per layer
  3. Implementation complexity — simple elementwise fusions (sigmoid×mul, silu×mul) are easy wins; attention-level fusions require careful shared memory management
- **Eliminate D2H sync points** — move computation to GPU, batch device-to-host transfers to end of step (see section 5 "host-side bottlenecks" for how to find them)
- **Increase occupancy** — `__launch_bounds__(maxThreads, minBlocks)` when ncu shows low occupancy limiting throughput

## 7. SASS Inspection (optional)

Use when profiling shows an unexplained gap between expected and actual performance, or to verify the compiler is generating what you intended (vectorized loads, no spills, etc.).

```bash
cuobjdump -sass /tmp/kernel.cubin     # SASS disassembly
nvdisasm /tmp/kernel.cubin            # SASS with control flow
```

Quick checks (you don't need to read every instruction):

- **Grep for `STL`/`LDL`** — local memory access means register spills. If present, reduce register pressure.
- **Grep for `LDG.E.128`/`STG.E.128`** — confirms vectorized 128-bit loads/stores are being generated. If you wrote `uint4` loads but see `LDG.E.32`, alignment or type issues prevented vectorization.
- **Count `BAR.SYNC`** — compare to your `__syncthreads()` count. More barriers than expected means the compiler inserted extra synchronization.

## 8. Verify Correctness

**Always verify before benchmarking.** A faster kernel that produces wrong results is not an optimization.

```bash
# Project tests
cargo test / pytest / make test

# Memory errors (out-of-bounds, leaks, misaligned access)
compute-sanitizer --tool memcheck ./app [args]

# Shared memory data races
compute-sanitizer --tool racecheck ./app [args]

# Uninitialized device memory reads
compute-sanitizer --tool initcheck ./app [args]

# Sync hazards (divergent __syncthreads)
compute-sanitizer --tool synccheck ./app [args]
```

`compute-sanitizer` has significant overhead (10-100x). Use small inputs for correctness checks, full inputs for benchmarks.

## 9. Compare Results

Run the **same benchmark** as baseline — same workload, same iteration count, same GPU state (no other processes competing for bandwidth).

```bash
./benchmark --iterations 5 --json-output /tmp/bench-after.json
```

Compare against the baseline numbers you recorded in step 1:

```text
Metric           Baseline    After       Delta
─────────────    ────────    ────────    ──────
Throughput       37.1 tok/s  39.8 tok/s  +7.3%
Stddev           ±0.4        ±0.5
Launches/token   622         478         -23%
GPU ms/token     26.2        24.1        -8.0%
```

Decision framework:

- **Delta < stddev of either run** → indistinguishable from noise. Run more iterations (10+) or use a longer workload before concluding.
- **Delta > stddev and > 3%** → real improvement, accept the change
- **Any regression in throughput** → revert, then analyze why (check if a different kernel got slower — fusion can shift pressure)

Also check for side effects:

- Peak GPU memory (fusion can increase shared memory or register usage)
- Correctness (re-run tests if not done in step 8)
- Performance at different workloads (a prefill optimization might hurt decode, or vice versa)

## Common Pitfalls

- **Grid dimension limits**: y and z max at 65535; only x supports 2^31-1. Use `blockIdx.x` for sequence/batch dimensions that can exceed 65535.
- **Shared memory bank conflicts**: 32 banks x 4 bytes. Power-of-2 column strides cause conflicts. Pad to stride D+1.
- **Divergent `__syncthreads()`**: undefined behavior. All threads in a block must reach the same barrier. Use `compute-sanitizer --tool synccheck` to detect.
- **Vectorized load alignment**: `uint4` loads require 16-byte aligned pointers. Verify alignment before using wide loads.
- **Profiling overhead**: nsys adds ~5-10% overhead, ncu adds 10-50x. Profile numbers are relative (compare before/after), not absolute production performance.
- **`__launch_bounds__` vs `--maxrregcount`**: `--maxrregcount` is per-file (all kernels); `__launch_bounds__` is per-kernel. Prefer per-kernel control.
- **Occupancy chasing**: >60% occupancy is usually sufficient. Maximizing occupancy by forcing lower register counts can cause spills that hurt more than the occupancy helps.
- **Unified memory systems** (Grace Hopper, DGX Spark): no PCIe transfer overhead, but bandwidth can be 3-7x lower than discrete GDDR7. This means batch=1 decode is almost always memory-bound — prioritize weight compression (FP8/FP4) and fusion over compute optimizations.
- **Hidden D2H syncs in hot loops**: check section 5 "host-side bottlenecks" — a single unnecessary GPU→CPU sync per layer × 32 layers = 32 pipeline stalls per step. Often the biggest win before touching any kernel code.
- **Framework allocator overhead**: eager-mode ML frameworks allocate/free GPU buffers for every intermediate tensor. At 600+ allocations per inference step, `cudaMalloc` overhead itself becomes measurable. Check `cuda_api_sum` for `cudaMalloc` call counts.
- **Fusing the wrong thing**: profile first. A kernel taking 0.1% of GPU time isn't worth fusing no matter how many launches it has. Focus on the top-3 time consumers and the top launch-count contributors.

Overview

This skill implements an iterative CUDA kernel refinement loop focused on improving throughput, reducing bandwidth, and lowering latency using NVIDIA tooling (nsys, ncu, ptxas, SASS). It guides you from establishing a reproducible benchmark baseline through targeted profiling, bottleneck classification, single-change optimization cycles, correctness checks, and quantitative comparison. The process enforces one change per cycle so improvements are attributable and regressions are reversible.

How this skill works

Start by recording a stable baseline benchmark with controlled workload parameters and iterations. Capture an nsys trace for kernel timing and API/transfer patterns, export to SQLite for targeted queries, then drill into register pressure with ptxas/cubin output and micro-architecture counters with ncu. Classify kernels as memory-, compute-, or latency-bound, apply focused optimizations (fusion, vectorized loads, tensor cores, fast-math, shared memory reuse, ILP), verify correctness with compute-sanitizer and tests, and re-run the same benchmark to compare results. Repeat the loop with a single, isolated change each iteration.

When to use it

You need measurable kernel speedups or bandwidth reduction
High launch count or many small kernels per step
Profiling shows high memory traffic or low SM utilization
You suspect register spills, occupancy limits, or launch overhead
Comparing alternative implementations or validating fusion impact

Best practices

Make a reproducible baseline: same workload, iterations, and GPU state for all comparisons
One change per cycle and revert on regression to isolate cause/effect
Use nsys→SQLite for rapid hotspot and launch/grid inspection before deep dives
Use ptxas/cubin to check register usage and spills; lower registers to raise occupancy only when safe
Prefer fusion and elimination of D2H syncs for latency-bound cases; prefer vectorized loads and reduced precision for memory-bound kernels
Run compute-sanitizer and project tests after every functional change before benchmarking

Example use cases

Fuse elementwise postprocessing (residual_add + norm) across layers to cut launches and bandwidth
Use ncu SpeedOfLight to decide if a custom matmul should target tensor cores or wider loads
Diagnose and eliminate per-step cudaMallocs shown by nsys API summary to reduce allocator overhead
Convert FP32 weights to FP8/FP4 and validate accuracy tradeoffs while measuring bandwidth improvement
Investigate a single slow kernel with cuobjdump SASS to verify vectorized LDG.E.128 loads and absence of register spills

FAQ

How many iterations are enough for a stable baseline?

Run enough iterations to get coefficient of variation under ~5%; if results are noisy, increase iterations or run a longer workload.

What do I do if ncu errors with permission issues?

Either run ncu as root (sudo) or set the NVreg_RestrictProfilingToAdminUsers kernel module option and reboot to allow counter access.

When should I use full ncu sets with roofline?

Only when default SpeedOfLight/Occupancy sections leave unanswered questions; full sets are 10–50× slower and useful for final analysis or write-ups.