home / skills / nickcrew / claude-cortex / python-performance-optimization

python-performance-optimization skill

needs review

/skills/python-performance-optimization

This skill helps you profile and optimize Python code to reduce latency and resource usage through proven techniques.

npx playbooks add skill nickcrew/claude-cortex --skill python-performance-optimization

Review the files below or copy the command above to add this skill to your agents.

Files (6)

SKILL.md

6.6 KB

---
name: python-performance-optimization
description: Python performance optimization patterns using profiling, algorithmic improvements, and acceleration techniques. Use when optimizing slow Python code, reducing memory usage, or improving application throughput and latency.
---

# Python Performance Optimization

Expert guidance for profiling, optimizing, and accelerating Python applications through systematic analysis, algorithmic improvements, efficient data structures, and acceleration techniques.

## When to Use This Skill

- Code runs too slowly for production requirements
- High CPU usage or memory consumption issues
- Need to reduce API response times or batch processing duration
- Application fails to scale under load
- Optimizing data processing pipelines or scientific computing
- Reducing cloud infrastructure costs through efficiency gains
- Profile-guided optimization after measuring performance bottlenecks

## Core Concepts

**The Golden Rule**: Never optimize without profiling first. 80% of execution time is spent in 20% of code.

**Optimization Hierarchy** (in priority order):
1. **Algorithm complexity** - O(n²) → O(n log n) provides exponential gains
2. **Data structure choice** - List → Set for lookups (10,000x faster)
3. **Language features** - Comprehensions, built-ins, generators
4. **Caching** - Memoization for repeated calculations
5. **Compiled extensions** - NumPy, Numba, Cython for hot paths
6. **Parallelism** - Multiprocessing for CPU-bound work

**Key Principle**: Algorithmic improvements beat micro-optimizations every time.

## Quick Reference

Load detailed guides for specific optimization areas:

| Task | Load reference |
| --- | --- |
| Profile code and find bottlenecks | `skills/python-performance-optimization/references/profiling.md` |
| Algorithm and data structure optimization | `skills/python-performance-optimization/references/algorithms.md` |
| Memory optimization and generators | `skills/python-performance-optimization/references/memory.md` |
| String concatenation and file I/O | `skills/python-performance-optimization/references/string-io.md` |
| NumPy, Numba, Cython, multiprocessing | `skills/python-performance-optimization/references/acceleration.md` |

## Optimization Workflow

### Phase 1: Measure
1. **Profile with cProfile** - Identify slow functions
2. **Line profile hot paths** - Find exact slow lines
3. **Memory profile** - Check for memory bottlenecks
4. **Benchmark baseline** - Record current performance

### Phase 2: Analyze
1. **Check algorithm complexity** - Is it O(n²) or worse?
2. **Evaluate data structures** - Are you using lists for lookups?
3. **Identify repeated work** - Can results be cached?
4. **Find I/O bottlenecks** - Database queries, file operations

### Phase 3: Optimize
1. **Improve algorithms first** - Biggest impact
2. **Use appropriate data structures** - Set/dict for O(1) lookups
3. **Apply caching** - `@lru_cache` for expensive functions
4. **Use generators** - For large datasets
5. **Leverage NumPy/Numba** - For numerical code
6. **Parallelize** - Multiprocessing for CPU-bound tasks

### Phase 4: Validate
1. **Re-profile** - Verify improvements
2. **Benchmark** - Measure speedup quantitatively
3. **Test correctness** - Ensure optimizations didn't break functionality
4. **Document** - Explain why optimization was needed

## Common Optimization Patterns

### Pattern 1: Replace List with Set for Lookups
```python
# Slow: O(n) lookup
if item in large_list:  # Bad

# Fast: O(1) lookup
if item in large_set:   # Good
```

### Pattern 2: Use Comprehensions
```python
# Slower
result = []
for i in range(n):
    result.append(i * 2)

# Faster (35% speedup)
result = [i * 2 for i in range(n)]
```

### Pattern 3: Cache Expensive Calculations
```python
from functools import lru_cache

@lru_cache(maxsize=None)
def expensive_function(n):
    # Result cached automatically
    return complex_calculation(n)
```

### Pattern 4: Use Generators for Large Data
```python
# Memory inefficient
def read_file(path):
    return [line for line in open(path)]  # Loads entire file

# Memory efficient
def read_file(path):
    for line in open(path):  # Streams line by line
        yield line.strip()
```

### Pattern 5: Vectorize with NumPy
```python
# Pure Python: ~500ms
result = sum(i**2 for i in range(1000000))

# NumPy: ~5ms (100x faster)
import numpy as np
result = np.sum(np.arange(1000000)**2)
```

## Common Mistakes to Avoid

1. **Optimizing before profiling** - You'll optimize the wrong code
2. **Using lists for membership tests** - Use sets/dicts instead
3. **String concatenation in loops** - Use `"".join()` or `StringIO`
4. **Loading entire files into memory** - Use generators
5. **N+1 database queries** - Use JOINs or batch queries
6. **Ignoring built-in functions** - They're C-optimized and fast
7. **Premature optimization** - Focus on algorithmic improvements first
8. **Not benchmarking** - Always measure improvements quantitatively

## Decision Tree

**Start here**: Profile with cProfile to find bottlenecks

**Hot path is algorithm?**
- Yes → Check complexity, improve algorithm, use better data structures
- No → Continue

**Hot path is computation?**
- Numerical loops → NumPy or Numba
- CPU-bound → Multiprocessing
- Already fast enough → Done

**Hot path is memory?**
- Large data → Generators, streaming
- Many objects → `__slots__`, object pooling
- Caching needed → `@lru_cache` or custom cache

**Hot path is I/O?**
- Database → Batch queries, indexes, connection pooling
- Files → Buffering, streaming
- Network → Async I/O, request batching

## Best Practices

1. **Profile before optimizing** - Measure to find real bottlenecks
2. **Optimize algorithms first** - O(n²) → O(n) beats micro-optimizations
3. **Use appropriate data structures** - Set/dict for lookups, not lists
4. **Leverage built-ins** - C-implemented built-ins are faster than pure Python
5. **Avoid premature optimization** - Optimize hot paths identified by profiling
6. **Use generators for large data** - Reduce memory usage with lazy evaluation
7. **Batch operations** - Minimize overhead from syscalls and network requests
8. **Cache expensive computations** - Use `@lru_cache` or custom caching
9. **Consider NumPy/Numba** - Vectorization and JIT for numerical code
10. **Parallelize CPU-bound work** - Use multiprocessing to utilize all cores

## Resources

- **Python Performance**: https://wiki.python.org/moin/PythonSpeed
- **cProfile**: https://docs.python.org/3/library/profile.html
- **NumPy**: https://numpy.org/doc/stable/user/absolute_beginners.html
- **Numba**: https://numba.pydata.org/
- **Cython**: https://cython.readthedocs.io/
- **High Performance Python** (Book by Gorelick & Ozsvald)

Overview

This skill provides practical patterns and a workflow for profiling and optimizing Python applications to reduce latency, lower memory usage, and improve throughput. It emphasizes measurement-first optimization, algorithmic improvements, and targeted acceleration using libraries like NumPy, Numba, and multiprocessing. The guidance is focused on repeatable steps and high-impact changes rather than micro-optimizations.

How this skill works

The skill guides you through a four-phase workflow: measure (cProfile, line and memory profilers, benchmarking), analyze (complexity, data-structure choices, I/O hotspots), optimize (algorithm changes, caching, generators, vectorization, parallelism), and validate (re-profile and benchmark). It identifies hot paths, recommends replacement patterns (e.g., list→set, comprehensions, @lru_cache), and suggests when to adopt compiled extensions or parallelism.

When to use it

Application response times or throughput are below requirements
High CPU usage or uncontrolled memory growth is observed
Data pipelines or scientific code run too slowly to scale
Cloud costs are high due to inefficient code
After profiling reveals a clear bottleneck to optimize

Best practices

Always profile before changing code; target the 20% of code that consumes 80% of time
Fix algorithmic complexity before micro-optimizing (O(n²) → O(n log n) or O(n))
Choose the right data structures: use set/dict for membership and fast lookups
Use generators and streaming to reduce peak memory usage
Leverage C-optimized libraries (NumPy) or JIT (Numba) for numeric hotspots
Re-profile and benchmark after each change and keep tests to ensure correctness

Example use cases

Speeding up a batch ETL job by replacing quadratic routines with linear algorithms and streaming data
Reducing API latency by moving expensive repeated computations behind @lru_cache
Accelerating numerical simulations by vectorizing loops with NumPy or applying Numba JIT to hot functions
Scaling CPU-bound tasks across cores with multiprocessing or job queues
Lowering memory footprint of file processing by converting list-based reads to generator-based streaming

FAQ

What profiler should I start with?

Begin with cProfile to find slow functions, then use line_profiler for line-level hotspots and a memory profiler for allocation issues.

When should I use NumPy vs Numba vs multiprocessing?

Use NumPy to vectorize array computations; use Numba for JIT-compiling tight numerical loops that can’t be vectorized; use multiprocessing for coarse-grained CPU-bound tasks that parallelize across data partitions.