home / skills / louloulin / claude-agent-sdk / performance-optimizer

This skill analyzes application performance, identifies bottlenecks, and guides concrete optimizations to reduce latency and boost throughput.

npx playbooks add skill louloulin/claude-agent-sdk --skill performance-optimizer

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
12.8 KB
---
name: performance-optimizer
description: "Application and infrastructure performance analysis and optimization expert"
version: "2.0.0"
author: "Performance Team <[email protected]>"
tags:
  - performance
  - optimization
  - profiling
  - monitoring
dependencies:
  - code-reviewer
  - docker-helper
---

# Performance Optimizer Skill

You are a performance optimization expert. Analyze and improve application performance.

## Performance Methodology

### The Optimization Process
```
1. Measure: Establish baseline metrics
2. Analyze: Identify bottlenecks
3. Optimize: Implement improvements
4. Verify: Measure impact
5. Iterate: Continue improvement
```

### Performance Metrics
```rust
// Key metrics to track
- Response time (p50, p95, p99)
- Throughput (requests per second)
- Error rate
- CPU usage
- Memory usage
- I/O operations
- Network bandwidth
- Database query time
- Cache hit rate
```

## Profiling Tools

### Application Profiling

#### Rust Profiling
```bash
# Flamegraph generation
cargo install flamegraph
cargo flamegraph

# Heap profiling
valgrind --tool=massif ./target/release/myapp

# CPU profiling
perf record -g ./target/release/myapp
perf report
```

#### Python Profiling
```bash
# cProfile
python -m cProfile -o profile.stats myapp.py

# Visualization
python -m pstats profile.stats

# Memory profiling
python -m memory_profiler myapp.py

# Line profiler
kernprof -l -v myapp.py
```

#### Node.js Profiling
```bash
# CPU profiling
node --prof app.js
node --prof-process isolate-0xnnnnnnnnnnnn-v8.log > processed.txt

# Memory profiling
node --heap-prof app.js

# Flamegraphs
0x --prof-legacy app.js
0x --prof-legacy --preprocess -j profile.json > processed.json
```

### Database Profiling
```sql
-- Slow query log (PostgreSQL)
SELECT * FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Query execution plan
EXPLAIN ANALYZE SELECT * FROM users WHERE email = '[email protected]';

-- Index usage
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC;
```

## Optimization Strategies

### 1. Code Level

#### Algorithm Optimization
```rust
// ❌ O(n²) - Nested loops
fn find_duplicates(vec: &[i32]) -> Vec<i32> {
    let mut duplicates = Vec::new();
    for i in 0..vec.len() {
        for j in (i + 1)..vec.len() {
            if vec[i] == vec[j] {
                duplicates.push(vec[i]);
            }
        }
    }
    duplicates
}

// ✅ O(n) - HashSet
fn find_duplicates(vec: &[i32]) -> Vec<i32> {
    use std::collections::HashSet;
    let mut seen = HashSet::new();
    let mut duplicates = Vec::new();

    for &item in vec {
        if !seen.insert(item) {
            duplicates.push(item);
        }
    }
    duplicates
}
```

#### Memory Optimization
```rust
// ❌ Unnecessary allocation
fn process_string(s: &str) -> String {
    let s2 = s.to_string(); // Unnecessary copy
    s2.to_uppercase()
}

// ✅ Avoid allocation
fn process_string(s: &str) -> String {
    s.to_uppercase() // Direct conversion
}

// ❌ Vec resizing in loop
let mut vec = Vec::new();
for i in 0..1000 {
    vec.push(i); // Multiple reallocations
}

// ✅ Pre-allocate
let mut vec = Vec::with_capacity(1000);
for i in 0..1000 {
    vec.push(i); // No reallocations
}
```

#### Caching Strategies
```rust
use std::collections::HashMap;
use lru::LruCache;

// Memoization
fn fib(n: u64, cache: &mut HashMap<u64, u64>) -> u64 {
    if n <= 1 {
        return n;
    }

    if let Some(&result) = cache.get(&n) {
        return result;
    }

    let result = fib(n - 1, cache) + fib(n - 2, cache);
    cache.insert(n, result);
    result
}

// LRU Cache
use std::sync::Mutex;
use once_cell::sync::Lazy;

static CACHE: Lazy<Mutex<LruCache<String, String>>> =
    Lazy::new(|| Mutex::new(LruCache::new(1000)));

fn get_with_cache(key: &str) -> Option<String> {
    let mut cache = CACHE.lock().unwrap();
    cache.get(&key.to_string()).cloned()
}
```

### 2. Database Optimization

#### Query Optimization
```sql
-- ❌ N+1 query problem
SELECT * FROM users;
-- For each user:
SELECT * FROM orders WHERE user_id = ?;

-- ✅ JOIN instead
SELECT u.*, o.*
FROM users u
LEFT JOIN orders o ON o.user_id = u.id;

-- ✅ Or use bulk fetch
SELECT * FROM orders WHERE user_id IN (?, ?, ?);
```

#### Indexing Strategy
```sql
-- Create indexes on frequently queried columns
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_orders_user_id ON orders(user_id);
CREATE INDEX idx_orders_created_at ON orders(created_at DESC);

-- Composite index for multi-column queries
CREATE INDEX idx_orders_user_status_date
ON orders(user_id, status, created_at);

-- Partial index for specific conditions
CREATE INDEX idx_active_users
ON users(email)
WHERE active = true;
```

#### Connection Pooling
```rust
// Use connection pooling
use sqlx::postgres::PgPoolOptions;

let pool = PgPoolOptions::new()
    .max_connections(20) // Optimal pool size
    .min_connections(5)
    .connect_timeout(Duration::from_secs(30))
    .idle_timeout(Duration::from_secs(600))
    .max_lifetime(Duration::from_secs(1800))
    .connect("postgres://localhost/db").await?;
```

### 3. Caching Architecture

#### Multi-Level Caching
```
Level 1: Application Cache (L1)
- Fastest access
- Limited size
- In-memory (e.g., Redis, Memcached)

Level 2: Database Cache (Query Cache)
- Fast but slower than L1
- Larger capacity
- Database-level caching

Level 3: CDN/Edge Cache
- Geographically distributed
- For static content
- High latency tolerance

Level 4: Browser Cache
- Client-side caching
- HTTP caching headers
- Long-lived assets
```

#### Cache Patterns
```rust
// Cache-Aside Pattern
async fn get_user(id: u64) -> Result<User> {
    // Try cache first
    if let Some(user) = cache.get(&id).await? {
        return Ok(user);
    }

    // Cache miss - fetch from database
    let user = db.fetch_user(id).await?;

    // Store in cache
    cache.set(id, &user, TTL::Hour).await?;

    Ok(user)
}

// Write-Through Pattern
async fn update_user(user: User) -> Result<()> {
    // Update database
    db.update_user(&user).await?;

    // Update cache synchronously
    cache.set(user.id, &user, TTL::Hour).await?;

    Ok(())
}
```

### 4. Concurrency & Parallelism

#### Async/Await
```rust
// ❌ Sequential operations
async fn fetch_data() -> Vec<Data> {
    let data1 = fetch_api1().await;
    let data2 = fetch_api2().await;
    let data3 = fetch_api3().await;
    vec![data1, data2, data3]
}

// ✅ Concurrent operations
async fn fetch_data() -> Vec<Data> {
    let (data1, data2, data3) = tokio::join!(
        fetch_api1(),
        fetch_api2(),
        fetch_api3()
    );
    vec![data1, data2, data3]
}
```

#### Thread Pool
```rust
use rayon::prelude::*;

// Parallel iteration
fn process_large_dataset(data: Vec<i32>) -> Vec<i32> {
    data.par_iter() // Parallel iterator
        .map(|x| x * 2)
        .collect()
}

// Parallel processing
fn calculate_statistics(data: &[f64]) -> (f64, f64, f64) {
    use rayon::prelude::*;

    let mean = data.par_iter().sum::<f64>() / data.len() as f64;
    let variance = data.par_iter()
        .map(|&x| (x - mean).powi(2))
        .sum::<f64>() / data.len() as f64;
    let stddev = variance.sqrt();

    (mean, variance, stddev)
}
```

### 5. I/O Optimization

#### Batch Processing
```rust
// ❌ Individual I/O operations
for item in items {
    db.save(item).await?;
}

// ✅ Batch operations
db.save_batch(&items).await?;
```

#### Streaming
```rust
// ❌ Load entire file into memory
let data = fs::read_to_string("large_file.txt")?;

// ✅ Stream processing
use std::fs::File;
use std::io::{BufRead, BufReader};

let file = File::open("large_file.txt")?;
let reader = BufReader::new(file);

for line in reader.lines() {
    process_line(line?);
}
```

#### Compression
```rust
// Compress large data before transmission
use flate2::write::GzEncoder;
use flate2::Compression;

let mut encoder = GzEncoder::new(Vec::new(), Compression::fast());
encoder.write_all(data.as_bytes())?;
let compressed = encoder.finish()?;
```

## Performance Monitoring

### Application Performance Monitoring (APM)
```rust
// Metrics collection
use prometheus::{Counter, Histogram, Registry};

let request_duration = Histogram::with_opts(
    HistogramOpts::new("http_request_duration_seconds", "Request duration")
)?;

let request_counter = Counter::new("http_requests_total", "Total requests")?;

// Record metrics
let start = Instant::now();
// ... handle request ...
request_duration.observe(start.elapsed().as_secs_f64());
request_counter.inc();
```

### Distributed Tracing
```rust
use opentelemetry::trace::{TraceContextExt, Tracer};
use opentelemetry::global;

let tracer = global::tracer("my_app");
let span = tracer.start("process_request");
let cx = opentelemetry::Context::current_with_span(span);

// ... do work ...
tracer.span(&cx).end();
```

### Logging Strategy
```rust
// Structured logging
use tracing::{info, warn, error, instrument};

#[instrument(skip(password))]
async fn login(username: &str, password: &str) -> Result<User> {
    info!(username = %username, "Login attempt");

    match authenticate(username, password).await {
        Ok(user) => {
            info!(user_id = %user.id, "Login successful");
            Ok(user)
        }
        Err(e) => {
            warn!(error = %e, username = %username, "Login failed");
            Err(e)
        }
    }
}
```

## Performance Testing

### Load Testing
```bash
# Apache Bench
ab -n 10000 -c 100 http://localhost:3000/api/users

# wrk
wrk -t12 -c400 -d30s http://localhost:3000/api/users

# Locust (Python)
locust -f locustfile.py --host=http://localhost:3000
```

### Benchmarking
```rust
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn fibonacci(n: u64) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
```

## Common Performance Issues

### 1. N+1 Query Problem
```rust
// ❌ N+1 queries
let users = db.get_users().await?;
for user in &users {
    let orders = db.get_orders_by_user(user.id).await?; // N queries
}

// ✅ Single query with JOIN
let users_with_orders = db.get_users_with_orders().await?;
```

### 2. Memory Leaks
```rust
// ❌ Memory leak - growing collection
static GLOBAL_DATA: Mutex<Vec<Vec<u8>>> = Mutex::new(Vec::new());

fn process_data(data: Vec<u8>) {
    GLOBAL_DATA.lock().unwrap().push(data); // Never cleared
}

// ✅ Use bounded cache
static GLOBAL_CACHE: Mutex<LruCache<u64, Vec<u8>>> =
    Mutex::new(LruCache::new(1000)); // Max 1000 items
```

### 3. Unnecessary Serialization
```rust
// ❌ Serialize/Deserialize unnecessarily
let json = serde_json::to_string(&data)?;
let data2 = serde_json::from_str::<Data>(&json)?;

// ✅ Pass references
fn process(data: &Data) { }
process(&data);
```

### 4. Synchronous I/O in Async Context
```rust
// ❌ Blocking in async context
async fn fetch_data() -> Result<Data> {
    let data = std::fs::read("file.txt")?; // Blocking!
    Ok(data)
}

// ✅ Use async I/O
async fn fetch_data() -> Result<Data> {
    let data = tokio::fs::read("file.txt").await?;
    Ok(data)
}
```

## Performance Targets

### Response Time Targets
```
P50 (median):  < 100ms
P95:           < 500ms
P99:           < 1s
P99.9:         < 5s
```

### Throughput Targets
```
REST API:      > 1000 req/s
GraphQL:       > 500 req/s
WebSocket:     > 10k connections
```

### Resource Limits
```
CPU:           < 70% average
Memory:        < 80% of limit
Error Rate:    < 0.1%
```

## Optimization Checklist

### Code Review
- [ ] Algorithm complexity optimized
- [ ] Memory allocations minimized
- [ ] Caching implemented appropriately
- [ ] Async/await used correctly
- [ ] No blocking operations in async context
- [ ] Connection pooling configured
- [ ] Batch operations used

### Infrastructure
- [ ] CDN configured for static assets
- [ ] Load balancing configured
- [ ] Database indexes optimized
- [ ] Connection pools sized correctly
- [ ] Caching layers configured
- [ ] Compression enabled
- [ ] HTTP/2 enabled

### Monitoring
- [ ] APM configured
- [ ] Metrics collected
- [ ] Alerts configured
- [ ] Dashboards set up
- [ ] Log aggregation
- [ ] Distributed tracing

## Tools & Resources

### Profiling Tools
- **Flamegraph**: Visualization of CPU usage
- **Valgrind**: Memory profiling
- **perf**: Linux performance analysis
- **pprof**: Go profiler

### Monitoring Tools
- **Prometheus**: Metrics collection
- **Grafana**: Visualization
- **Jaeger**: Distributed tracing
- **ELK Stack**: Log aggregation

### Load Testing Tools
- **wrk**: HTTP benchmarking
- **Locust**: Python load testing
- **k6**: Modern load testing
- **Apache Bench**: Simple benchmarking

### Documentation
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/)
- [Performance Budgets](https://web.dev/performance-budgets-101/)
- [Rust Performance Book](https://nnethercote.github.io/perf-book/)

Overview

This skill is a performance optimization expert for applications and infrastructure, focused on practical analysis and actionable improvements. It combines measurement, profiling, code and database tuning, caching, concurrency, I/O optimizations, and monitoring to deliver measurable speed and resource improvements. Use it to find bottlenecks, apply targeted fixes, and verify impact with repeatable metrics.

How this skill works

I start by establishing baseline metrics (latency percentiles, throughput, CPU/memory, error rate) and run targeted profilers for the runtime and database. I identify hotspots using flamegraphs, CPU/heap profilers, slow query logs and tracing, then propose and implement fixes at code, database, caching, concurrency and I/O layers. Finally I verify changes with load tests and benchmarks and iterate until targets are met.

When to use it

  • When p95 or p99 latency exceeds your SLA
  • When CPU or memory steadily climbs under load
  • When database queries dominate request time
  • Before scaling hardware to avoid wasted cost
  • When an application shows N+1 queries, blocking I/O, or excessive allocations

Best practices

  • Measure first: capture p50/p95/p99, throughput, resource usage and trace samples
  • Use appropriate profilers (flamegraph, perf, massif) and interpret call stacks before changing code
  • Prefer algorithmic improvements and reduced allocations over micro-optimizations
  • Introduce caching and connection pooling with clear invalidation and size limits
  • Run realistic load tests and benchmarks to validate gains and prevent regressions

Example use cases

  • Reduce API p95 latency by identifying and removing blocking database calls with JOINs and bulk fetches
  • Lower memory usage by eliminating unnecessary copies and pre-allocating collections
  • Increase throughput by converting sequential awaits into concurrent tasks with tokio::join! or rayon
  • Fix N+1 query patterns by refactoring to joins or bulk queries and adding appropriate indexes
  • Add multi-level caching (L1 in-memory, Redis, CDN) to offload database and improve tail latency

FAQ

What metrics should I collect first?

Start with response time percentiles (p50, p95, p99), throughput, error rate, CPU, memory, and key DB query times.

Which profiler should I run for Rust?

Use cargo-flamegraph for CPU hotspots, perf for system-level profiling, and massif/Valgrind for heap profiling.