home / skills / huiali / rust-skills / rust-performance

rust-performance skill

/skills/rust-performance

This skill helps you optimize Rust performance by guiding profiling, memory layout, and NUMA-aware strategies to reduce hot-path costs.

npx playbooks add skill huiali/rust-skills --skill rust-performance

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
8.1 KB
---
name: rust-performance
description: Performance optimization expert covering profiling, benchmarking, memory allocation, SIMD, cache optimization, false sharing, lock contention, and NUMA-aware programming.
metadata:
  triggers:
    - performance
    - optimization
    - benchmark
    - profiling
    - allocation
    - SIMD
    - cache
    - slow
    - latency
    - throughput
---


## Optimization Priority

```
1. Algorithm choice      (10x - 1000x)   ← Biggest impact
2. Data structure        (2x - 10x)
3. Reduce allocations    (2x - 5x)
4. Cache optimization    (1.5x - 3x)
5. SIMD/parallelism      (2x - 8x)
```

**Warning**: Premature optimization is the root of all evil. Make it work first, then optimize hot paths.


## Solution Patterns

### Pattern 1: Pre-allocation

```rust
// ❌ Bad: grows dynamically
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i);
}

// ✅ Good: pre-allocate known size
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i);
}
```

### Pattern 2: Avoid Unnecessary Clones

```rust
// ❌ Bad: unnecessary clone
fn process(item: &Item) {
    let data = item.data.clone();
    // use data...
}

// ✅ Good: use reference
fn process(item: &Item) {
    let data = &item.data;
    // use data...
}
```

### Pattern 3: Batch Operations

```rust
// ❌ Bad: multiple database calls
for user_id in user_ids {
    db.update(user_id, status)?;
}

// ✅ Good: batch update
db.update_all(user_ids, status)?;
```

### Pattern 4: Small Object Optimization

```rust
use smallvec::SmallVec;

// ✅ No heap allocation for ≤16 items
let mut vec: SmallVec<[u8; 16]> = SmallVec::new();
```

### Pattern 5: Parallel Processing

```rust
use rayon::prelude::*;

let sum: i32 = data
    .par_iter()
    .map(|x| expensive_computation(x))
    .sum();
```


## Profiling Tools

| Tool | Purpose |
|------|---------|
| `cargo bench` | Criterion benchmarks |
| `perf` / `flamegraph` | CPU flame graphs |
| `heaptrack` | Allocation tracking |
| `valgrind --tool=cachegrind` | Cache analysis |
| `dhat` | Heap allocation profiling |


## Common Optimizations

### Anti-Patterns to Fix

| Anti-Pattern | Why Bad | Correct Approach |
|--------------|---------|------------------|
| Clone to avoid lifetimes | Performance cost | Proper ownership design |
| Box everything | Indirection overhead | Prefer stack allocation |
| HashMap for small data | Hash overhead too high | Vec + linear search |
| String concatenation in loop | O(n²) | `with_capacity` or `format!` |
| LinkedList | Cache-unfriendly | `Vec` or `VecDeque` |


## Advanced: False Sharing

### Symptom

```rust
// ❌ Problem: multiple AtomicU64 in one struct
struct ShardCounters {
    inflight: AtomicU64,
    completed: AtomicU64,
}
```

- One CPU core at 90%+
- High LLC miss rate in perf
- Many atomic RMW operations
- Adding threads makes it slower

### Diagnosis

```bash
# Perf analysis
perf stat -d your_program
# Look for LLC-load-misses and locked-instrs

# Flamegraph
cargo flamegraph
# Find atomic fetch_add hotspots
```

### Solution: Cache Line Padding

```rust
// ✅ Each field in separate cache line
#[repr(align(64))]
struct PaddedAtomicU64(AtomicU64);

struct ShardCounters {
    inflight: PaddedAtomicU64,
    completed: PaddedAtomicU64,
}
```


## Lock Contention Optimization

### Symptom

```rust
// ❌ All threads compete for single lock
let shared: Arc<Mutex<HashMap<String, usize>>> =
    Arc::new(Mutex::new(HashMap::new()));
```

- Most time spent in mutex lock/unlock
- Performance degrades with more threads
- High system time percentage

### Solution: Thread-Local Sharding

```rust
// ✅ Each thread has local HashMap, merge at end
pub fn parallel_count(data: &[String], num_threads: usize)
    -> HashMap<String, usize>
{
    let mut handles = Vec::new();

    for chunk in data.chunks(data.len() / num_threads) {
        handles.push(thread::spawn(move || {
            let mut local = HashMap::new();
            for key in chunk {
                *local.entry(key.clone()).or_insert(0) += 1;
            }
            local  // Return local counts
        }));
    }

    // Merge all local results
    let mut result = HashMap::new();
    for handle in handles {
        for (k, v) in handle.join().unwrap() {
            *result.entry(k).or_insert(0) += v;
        }
    }
    result
}
```


## NUMA Awareness

### Problem

```rust
// Multi-socket server, memory allocated on remote NUMA node
let pool = ArenaPool::new(num_threads);
// Rayon work-stealing causes tasks to run on any thread
// Cross-NUMA access causes severe memory migration latency
```

### Solution

```rust
// 1. NUMA node binding
let numa_node = detect_numa_node();
let pool = NumaAwarePool::new(numa_node);

// 2. Use unified allocator (jemalloc)
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

// 3. Avoid cross-NUMA object clones
// Borrow directly, don't copy data
```

### Tools

```bash
# Check NUMA topology
numactl --hardware

# Bind to NUMA node
numactl --cpunodebind=0 --membind=0 ./my_program
```


## Data Structure Selection

| Scenario | Choice | Reason |
|----------|--------|--------|
| High-concurrency writes | DashMap or sharding | Reduces lock contention |
| Read-heavy, few writes | RwLock<HashMap> | Read locks don't block |
| Small dataset | Vec + linear search | HashMap overhead higher |
| Fixed keys | Enum + array | Zero hash overhead |

### Read-Heavy Example

```rust
// ✅ Many reads, few updates
struct Config {
    map: RwLock<HashMap<String, ConfigValue>>,
}

impl Config {
    pub fn get(&self, key: &str) -> Option<ConfigValue> {
        self.map.read().unwrap().get(key).cloned()
    }

    pub fn update(&self, key: String, value: ConfigValue) {
        self.map.write().unwrap().insert(key, value);
    }
}
```


## Common Performance Traps

| Trap | Symptom | Solution |
|------|---------|----------|
| Adjacent atomic variables | False sharing | `#[repr(align(64))]` |
| Global Mutex | Lock contention | Thread-local + merge |
| Cross-NUMA allocation | Memory migration | NUMA-aware allocation |
| Frequent small allocations | Allocator pressure | Object pooling |
| Dynamic string keys | Extra allocations | Use integer IDs |


## Review Checklist

When optimizing performance:

- [ ] Profiled to identify bottleneck
- [ ] Bottleneck confirmed with measurements
- [ ] Algorithm is optimal for use case
- [ ] Data structure appropriate
- [ ] Unnecessary allocations removed
- [ ] Parallelism exploited where beneficial
- [ ] Cache-friendly data layout
- [ ] Lock contention minimized
- [ ] Benchmarks show improvement
- [ ] Code still readable and maintainable


## Verification Commands

```bash
# Benchmark
cargo bench

# Profile with perf
perf stat -d ./target/release/your_program

# Generate flamegraph
cargo flamegraph --release

# Heap profiling
valgrind --tool=dhat ./target/release/your_program

# Cache analysis
valgrind --tool=cachegrind ./target/release/your_program

# NUMA topology
numactl --hardware
```


## Common Pitfalls

### 1. Premature Optimization

**Symptom**: Optimizing before profiling

**Fix**: Profile first, optimize hot paths only

### 2. Micro-optimizing Cold Paths

**Symptom**: Spending time on code that rarely runs

**Fix**: Focus on hot loops (90% of time in 10% of code)

### 3. Trading Readability for Minimal Gains

**Symptom**: Complex code for <5% improvement

**Fix**: Only optimize if gain is significant (>20%)


## Performance Diagnostic Workflow

```
1. Identify symptom (slow, high CPU, high memory)
   ↓
2. Profile with appropriate tool
   - CPU → perf/flamegraph
   - Memory → heaptrack/dhat
   - Cache → cachegrind
   ↓
3. Find hotspot (function/line)
   ↓
4. Understand why it's slow
   - Algorithm? Data structure? Allocation?
   ↓
5. Apply targeted optimization
   ↓
6. Benchmark to confirm improvement
   ↓
7. Repeat if not fast enough
```


## Related Skills

- **rust-concurrency** - Parallel processing patterns
- **rust-async** - Async performance optimization
- **rust-unsafe** - Zero-cost abstractions with unsafe
- **rust-coding** - Writing performant idiomatic code
- **rust-anti-pattern** - Performance anti-patterns to avoid


## Localized Reference

- **Chinese version**: [SKILL_ZH.md](./SKILL_ZH.md) - 完整中文版本,包含所有内容

Overview

This skill is a Rust performance optimization expert that guides profiling, benchmarking, memory allocation, SIMD, cache layout, false sharing, lock contention, and NUMA-aware programming. It focuses on measurable improvements: identify hotspots, apply targeted changes, and verify gains with benchmarks and profiles. The approach emphasizes algorithm and data-structure selection first, then lower-level optimizations only where they matter.

How this skill works

I inspect runtime symptoms (high CPU, memory growth, latency), recommend the right profiling tool (perf, flamegraph, heaptrack, valgrind/dhat) and map hotspots to concrete corrective patterns. I propose prioritized fixes—algorithm or data-structure changes, pre-allocation, reducing clones, cache-line layout, sharding locks, SIMD/parallelism, and NUMA binding—and produce verification commands and benchmarks to confirm improvements.

When to use it

  • Application exhibits high CPU, latency, or memory without clear cause
  • Scaling across cores causes slowdown or high system time
  • Observed high LLC misses, atomic contention, or mutex hotspots
  • Deploying on multi-socket servers with NUMA-induced latency
  • You need measurable throughput or tail-latency improvements

Best practices

  • Profile before changing code; optimize hot paths only
  • Prioritize algorithm and data-structure changes before micro-optimizations
  • Pre-allocate buffers and batch I/O to reduce allocator pressure
  • Avoid unnecessary clones and heap indirection; prefer stack or SmallVec for small items
  • Shard data structures or use thread-local state to eliminate global lock contention
  • Measure after each change with cargo bench and perf/flamegraph

Example use cases

  • Speed up a hot loop by replacing O(n²) behavior with an O(n log n) algorithm and verifying with flamegraph and benches
  • Fix threading slowdown by converting a global Mutex<HashMap> into per-thread local maps and merging at the end
  • Resolve false sharing by aligning atomic counters to cache-line boundaries with repr(align(64))
  • Reduce allocator pressure by switching frequent small allocations to object pools or SmallVec
  • Improve multi-socket throughput by binding threads to NUMA nodes and using a NUMA-aware allocator

FAQ

What should I optimize first?

Start with algorithm and data-structure selection; those yield the largest gains. Only optimize micro-level details after profiling confirms they’re on the hot path.

Which profiler to use for CPU vs memory vs cache?

Use perf + flamegraph for CPU hotspots, heaptrack or dhat for allocations, and valgrind --tool=cachegrind for cache analysis.