home / skills / zhanghandong / rust-skills / m10-performance

m10-performance skill

safe

This skill helps you optimize Rust performance by guiding measurement, profiling, and design choices to reduce allocations, improve cache, and parallelize.

npx playbooks add skill zhanghandong/rust-skills --skill m10-performance

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

4.1 KB

---
name: m10-performance
description: "CRITICAL: Use for performance optimization. Triggers: performance, optimization, benchmark, profiling, flamegraph, criterion, slow, fast, allocation, cache, SIMD, make it faster, 性能优化, 基准测试"
user-invocable: false
---

# Performance Optimization

> **Layer 2: Design Choices**

## Core Question

**What's the bottleneck, and is optimization worth it?**

Before optimizing:
- Have you measured? (Don't guess)
- What's the acceptable performance?
- Will optimization add complexity?

---

## Performance Decision → Implementation

| Goal | Design Choice | Implementation |
|------|---------------|----------------|
| Reduce allocations | Pre-allocate, reuse | `with_capacity`, object pools |
| Improve cache | Contiguous data | `Vec`, `SmallVec` |
| Parallelize | Data parallelism | `rayon`, threads |
| Avoid copies | Zero-copy | References, `Cow<T>` |
| Reduce indirection | Inline data | `smallvec`, arrays |

---

## Thinking Prompt

Before optimizing:

1. **Have you measured?**
   - Profile first → flamegraph, perf
   - Benchmark → criterion, cargo bench
   - Identify actual hotspots

2. **What's the priority?**
   - Algorithm (10x-1000x improvement)
   - Data structure (2x-10x)
   - Allocation (2x-5x)
   - Cache (1.5x-3x)

3. **What's the trade-off?**
   - Complexity vs speed
   - Memory vs CPU
   - Latency vs throughput

---

## Trace Up ↑

To domain constraints (Layer 3):

```
"How fast does this need to be?"
    ↑ Ask: What's the performance SLA?
    ↑ Check: domain-* (latency requirements)
    ↑ Check: Business requirements (acceptable response time)
```

| Question | Trace To | Ask |
|----------|----------|-----|
| Latency requirements | domain-* | What's acceptable response time? |
| Throughput needs | domain-* | How many requests per second? |
| Memory constraints | domain-* | What's the memory budget? |

---

## Trace Down ↓

To implementation (Layer 1):

```
"Need to reduce allocations"
    ↓ m01-ownership: Use references, avoid clone
    ↓ m02-resource: Pre-allocate with_capacity

"Need to parallelize"
    ↓ m07-concurrency: Choose rayon or threads
    ↓ m07-concurrency: Consider async for I/O-bound

"Need cache efficiency"
    ↓ Data layout: Prefer Vec over HashMap when possible
    ↓ Access patterns: Sequential over random access
```

---

## Quick Reference

| Tool | Purpose |
|------|---------|
| `cargo bench` | Micro-benchmarks |
| `criterion` | Statistical benchmarks |
| `perf` / `flamegraph` | CPU profiling |
| `heaptrack` | Allocation tracking |
| `valgrind` / `cachegrind` | Cache analysis |

## Optimization Priority

```
1. Algorithm choice     (10x - 1000x)
2. Data structure       (2x - 10x)
3. Allocation reduction (2x - 5x)
4. Cache optimization   (1.5x - 3x)
5. SIMD/Parallelism     (2x - 8x)
```

## Common Techniques

| Technique | When | How |
|-----------|------|-----|
| Pre-allocation | Known size | `Vec::with_capacity(n)` |
| Avoid cloning | Hot paths | Use references or `Cow<T>` |
| Batch operations | Many small ops | Collect then process |
| SmallVec | Usually small | `smallvec::SmallVec<[T; N]>` |
| Inline buffers | Fixed-size data | Arrays over Vec |

---

## Common Mistakes

| Mistake | Why Wrong | Better |
|---------|-----------|--------|
| Optimize without profiling | Wrong target | Profile first |
| Benchmark in debug mode | Meaningless | Always `--release` |
| Use LinkedList | Cache unfriendly | `Vec` or `VecDeque` |
| Hidden `.clone()` | Unnecessary allocs | Use references |
| Premature optimization | Wasted effort | Make it work first |

---

## Anti-Patterns

| Anti-Pattern | Why Bad | Better |
|--------------|---------|--------|
| Clone to avoid lifetimes | Performance cost | Proper ownership |
| Box everything | Indirection cost | Stack when possible |
| HashMap for small sets | Overhead | Vec with linear search |
| String concat in loop | O(n^2) | `String::with_capacity` or `format!` |

---

## Related Skills

| When | See |
|------|-----|
| Reducing clones | m01-ownership |
| Concurrency options | m07-concurrency |
| Smart pointer choice | m02-resource |
| Domain requirements | domain-* |

Overview

This skill provides targeted guidance for performance optimization in Rust projects, focusing on measurement-first approaches and practical implementation choices. It helps you decide if optimization is worth the cost, where to focus (algorithm, data structure, allocation, cache, parallelism), and which tools to use for profiling and benchmarking.

How this skill works

The skill inspects performance concerns by guiding you to measure first with profiling and benchmarks, then trace requirements up to domain constraints and down to concrete implementation changes. It maps goals (reduce allocations, improve cache, parallelize, avoid copies) to specific design choices and Rust techniques like with_capacity, SmallVec, Rayon, and zero-copy patterns.

When to use it

When a measured hotspot affects user-facing latency or throughput
During performance regressions discovered by CI or benchmarks
Before introducing complex optimizations that increase maintenance cost
When planning capacity for throughput or memory-constrained deployments
When choosing data layout or concurrency model for a hot path

Best practices

Profile first: use perf, flamegraph, or heaptrack to find real hotspots
Benchmark in release mode with criterion or cargo bench for statistical confidence
Prioritize algorithm and data structure changes before micro-optimizations
Reduce allocations via pre-allocation, object pools, and avoiding hidden clones
Prefer contiguous layouts (Vec/SmallVec) and batch operations for cache efficiency

Example use cases

Identify a CPU hotspot with flamegraph, then replace an O(n^2) algorithm with an O(n log n) approach
Reduce GC/alloc pressure by switching from boxed nodes to SmallVec or inline arrays
Improve throughput by parallelizing a CPU-heavy loop using rayon
Fix latency spikes by measuring tail latencies and trimming allocation spikes
Choose Vec over HashMap for small collections to improve cache locality

FAQ

Should I optimize before measuring?

No. Always measure first. Optimizing without data risks wasted effort and regressions.

Which gives the biggest wins?

Algorithmic improvements often yield the largest gains (10x–1000x). Data structure and allocation changes are next.

When to use SIMD or parallelism?

Use SIMD/parallelism after ensuring algorithm and data layout are efficient; they help when CPU-bound hotspots remain.