home / skills / nickcrew / claude-cortex / python-performance-optimization
This skill helps you profile and optimize Python code to reduce latency and resource usage through proven techniques.
npx playbooks add skill nickcrew/claude-cortex --skill python-performance-optimizationReview the files below or copy the command above to add this skill to your agents.
---
name: python-performance-optimization
description: Python performance optimization patterns using profiling, algorithmic improvements, and acceleration techniques. Use when optimizing slow Python code, reducing memory usage, or improving application throughput and latency.
---
# Python Performance Optimization
Expert guidance for profiling, optimizing, and accelerating Python applications through systematic analysis, algorithmic improvements, efficient data structures, and acceleration techniques.
## When to Use This Skill
- Code runs too slowly for production requirements
- High CPU usage or memory consumption issues
- Need to reduce API response times or batch processing duration
- Application fails to scale under load
- Optimizing data processing pipelines or scientific computing
- Reducing cloud infrastructure costs through efficiency gains
- Profile-guided optimization after measuring performance bottlenecks
## Core Concepts
**The Golden Rule**: Never optimize without profiling first. 80% of execution time is spent in 20% of code.
**Optimization Hierarchy** (in priority order):
1. **Algorithm complexity** - O(n²) → O(n log n) provides exponential gains
2. **Data structure choice** - List → Set for lookups (10,000x faster)
3. **Language features** - Comprehensions, built-ins, generators
4. **Caching** - Memoization for repeated calculations
5. **Compiled extensions** - NumPy, Numba, Cython for hot paths
6. **Parallelism** - Multiprocessing for CPU-bound work
**Key Principle**: Algorithmic improvements beat micro-optimizations every time.
## Quick Reference
Load detailed guides for specific optimization areas:
| Task | Load reference |
| --- | --- |
| Profile code and find bottlenecks | `skills/python-performance-optimization/references/profiling.md` |
| Algorithm and data structure optimization | `skills/python-performance-optimization/references/algorithms.md` |
| Memory optimization and generators | `skills/python-performance-optimization/references/memory.md` |
| String concatenation and file I/O | `skills/python-performance-optimization/references/string-io.md` |
| NumPy, Numba, Cython, multiprocessing | `skills/python-performance-optimization/references/acceleration.md` |
## Optimization Workflow
### Phase 1: Measure
1. **Profile with cProfile** - Identify slow functions
2. **Line profile hot paths** - Find exact slow lines
3. **Memory profile** - Check for memory bottlenecks
4. **Benchmark baseline** - Record current performance
### Phase 2: Analyze
1. **Check algorithm complexity** - Is it O(n²) or worse?
2. **Evaluate data structures** - Are you using lists for lookups?
3. **Identify repeated work** - Can results be cached?
4. **Find I/O bottlenecks** - Database queries, file operations
### Phase 3: Optimize
1. **Improve algorithms first** - Biggest impact
2. **Use appropriate data structures** - Set/dict for O(1) lookups
3. **Apply caching** - `@lru_cache` for expensive functions
4. **Use generators** - For large datasets
5. **Leverage NumPy/Numba** - For numerical code
6. **Parallelize** - Multiprocessing for CPU-bound tasks
### Phase 4: Validate
1. **Re-profile** - Verify improvements
2. **Benchmark** - Measure speedup quantitatively
3. **Test correctness** - Ensure optimizations didn't break functionality
4. **Document** - Explain why optimization was needed
## Common Optimization Patterns
### Pattern 1: Replace List with Set for Lookups
```python
# Slow: O(n) lookup
if item in large_list: # Bad
# Fast: O(1) lookup
if item in large_set: # Good
```
### Pattern 2: Use Comprehensions
```python
# Slower
result = []
for i in range(n):
result.append(i * 2)
# Faster (35% speedup)
result = [i * 2 for i in range(n)]
```
### Pattern 3: Cache Expensive Calculations
```python
from functools import lru_cache
@lru_cache(maxsize=None)
def expensive_function(n):
# Result cached automatically
return complex_calculation(n)
```
### Pattern 4: Use Generators for Large Data
```python
# Memory inefficient
def read_file(path):
return [line for line in open(path)] # Loads entire file
# Memory efficient
def read_file(path):
for line in open(path): # Streams line by line
yield line.strip()
```
### Pattern 5: Vectorize with NumPy
```python
# Pure Python: ~500ms
result = sum(i**2 for i in range(1000000))
# NumPy: ~5ms (100x faster)
import numpy as np
result = np.sum(np.arange(1000000)**2)
```
## Common Mistakes to Avoid
1. **Optimizing before profiling** - You'll optimize the wrong code
2. **Using lists for membership tests** - Use sets/dicts instead
3. **String concatenation in loops** - Use `"".join()` or `StringIO`
4. **Loading entire files into memory** - Use generators
5. **N+1 database queries** - Use JOINs or batch queries
6. **Ignoring built-in functions** - They're C-optimized and fast
7. **Premature optimization** - Focus on algorithmic improvements first
8. **Not benchmarking** - Always measure improvements quantitatively
## Decision Tree
**Start here**: Profile with cProfile to find bottlenecks
**Hot path is algorithm?**
- Yes → Check complexity, improve algorithm, use better data structures
- No → Continue
**Hot path is computation?**
- Numerical loops → NumPy or Numba
- CPU-bound → Multiprocessing
- Already fast enough → Done
**Hot path is memory?**
- Large data → Generators, streaming
- Many objects → `__slots__`, object pooling
- Caching needed → `@lru_cache` or custom cache
**Hot path is I/O?**
- Database → Batch queries, indexes, connection pooling
- Files → Buffering, streaming
- Network → Async I/O, request batching
## Best Practices
1. **Profile before optimizing** - Measure to find real bottlenecks
2. **Optimize algorithms first** - O(n²) → O(n) beats micro-optimizations
3. **Use appropriate data structures** - Set/dict for lookups, not lists
4. **Leverage built-ins** - C-implemented built-ins are faster than pure Python
5. **Avoid premature optimization** - Optimize hot paths identified by profiling
6. **Use generators for large data** - Reduce memory usage with lazy evaluation
7. **Batch operations** - Minimize overhead from syscalls and network requests
8. **Cache expensive computations** - Use `@lru_cache` or custom caching
9. **Consider NumPy/Numba** - Vectorization and JIT for numerical code
10. **Parallelize CPU-bound work** - Use multiprocessing to utilize all cores
## Resources
- **Python Performance**: https://wiki.python.org/moin/PythonSpeed
- **cProfile**: https://docs.python.org/3/library/profile.html
- **NumPy**: https://numpy.org/doc/stable/user/absolute_beginners.html
- **Numba**: https://numba.pydata.org/
- **Cython**: https://cython.readthedocs.io/
- **High Performance Python** (Book by Gorelick & Ozsvald)
This skill provides practical patterns and a workflow for profiling and optimizing Python applications to reduce latency, lower memory usage, and improve throughput. It emphasizes measurement-first optimization, algorithmic improvements, and targeted acceleration using libraries like NumPy, Numba, and multiprocessing. The guidance is focused on repeatable steps and high-impact changes rather than micro-optimizations.
The skill guides you through a four-phase workflow: measure (cProfile, line and memory profilers, benchmarking), analyze (complexity, data-structure choices, I/O hotspots), optimize (algorithm changes, caching, generators, vectorization, parallelism), and validate (re-profile and benchmark). It identifies hot paths, recommends replacement patterns (e.g., list→set, comprehensions, @lru_cache), and suggests when to adopt compiled extensions or parallelism.
What profiler should I start with?
Begin with cProfile to find slow functions, then use line_profiler for line-level hotspots and a memory profiler for allocation issues.
When should I use NumPy vs Numba vs multiprocessing?
Use NumPy to vectorize array computations; use Numba for JIT-compiling tight numerical loops that can’t be vectorized; use multiprocessing for coarse-grained CPU-bound tasks that parallelize across data partitions.