home / skills / pluginagentmarketplace / custom-plugin-python / python-performance

python-performance skill

/skills/python-performance

This skill helps you optimize Python performance through profiling, memory management, and high-performance techniques for faster, scalable code.

npx playbooks add skill pluginagentmarketplace/custom-plugin-python --skill python-performance

Review the files below or copy the command above to add this skill to your agents.

Files (4)
SKILL.md
10.1 KB
---
name: Python Performance
description: Master Python optimization techniques, profiling, memory management, and high-performance computing
version: "2.1.0"
sasmp_version: "1.3.0"
bonded_agent: 07-best-practices
bond_type: PRIMARY_BOND

# Skill Configuration
retry_strategy: exponential_backoff
observability:
  logging: true
  metrics: execution_time_improvement
---

# Python Performance Optimization

## Overview

Master performance optimization in Python. Learn to profile code, identify bottlenecks, optimize algorithms, manage memory efficiently, and leverage high-performance libraries for compute-intensive tasks.

## Learning Objectives

- Profile Python code to identify bottlenecks
- Optimize algorithms and data structures
- Manage memory efficiently
- Use compiled extensions (Cython, NumPy)
- Implement caching strategies
- Parallelize CPU-bound operations
- Benchmark and measure improvements

## Core Topics

### 1. Profiling & Benchmarking
- timeit module for micro-benchmarks
- cProfile for function-level profiling
- line_profiler for line-by-line analysis
- memory_profiler for memory usage
- py-spy for production profiling
- Flame graphs and visualization

**Code Example:**
```python
import timeit
import cProfile
import pstats

# 1. timeit for micro-benchmarks
def list_comprehension():
    return [x**2 for x in range(1000)]

def map_function():
    return list(map(lambda x: x**2, range(1000)))

# Compare performance
time_lc = timeit.timeit(list_comprehension, number=10000)
time_map = timeit.timeit(map_function, number=10000)
print(f"List comprehension: {time_lc:.4f}s")
print(f"Map function: {time_map:.4f}s")

# 2. cProfile for function profiling
def process_data():
    data = []
    for i in range(100000):
        data.append(i ** 2)
    return sum(data)

profiler = cProfile.Profile()
profiler.enable()
result = process_data()
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(10)

# 3. Line profiling (requires line_profiler package)
# @profile decorator (add manually for line_profiler)
def slow_function():
    total = 0
    for i in range(1000000):
        total += i ** 2
    return total

# Run with: kernprof -l -v script.py

# 4. Memory profiling
from memory_profiler import profile

@profile
def memory_intensive():
    large_list = [i for i in range(1000000)]
    large_dict = {i: i**2 for i in range(1000000)}
    return len(large_list) + len(large_dict)

# Run with: python -m memory_profiler script.py
```

### 2. Algorithm & Data Structure Optimization
- Choosing efficient data structures
- Time complexity analysis
- Generator expressions vs lists
- Set operations for lookups
- Deque for queue operations
- Bisect for sorted lists

**Code Example:**
```python
import bisect
from collections import deque, Counter, defaultdict
import time

# 1. List vs Set for membership testing
# Bad: O(n) lookup
def find_in_list(items, target):
    return target in items  # Linear search

# Good: O(1) lookup
def find_in_set(items, target):
    items_set = set(items)
    return target in items_set

items = list(range(100000))
# List: 0.001s, Set: 0.000001s (1000x faster!)

# 2. Generator expressions for memory efficiency
# Bad: Creates entire list in memory
squares_list = [x**2 for x in range(1000000)]  # ~4MB

# Good: Generates on-demand
squares_gen = (x**2 for x in range(1000000))   # ~128 bytes

# 3. Deque for efficient queue operations
# Bad: O(n) pop from beginning
queue_list = list(range(10000))
queue_list.pop(0)  # Slow

# Good: O(1) pop from both ends
queue_deque = deque(range(10000))
queue_deque.popleft()  # Fast

# 4. Bisect for maintaining sorted lists
# Bad: O(n) insertion into sorted list
sorted_list = []
for i in [5, 2, 8, 1, 9]:
    sorted_list.append(i)
    sorted_list.sort()

# Good: O(log n) insertion
sorted_list = []
for i in [5, 2, 8, 1, 9]:
    bisect.insort(sorted_list, i)

# 5. Counter for frequency counting
# Bad: Manual counting
word_count = {}
for word in words:
    if word in word_count:
        word_count[word] += 1
    else:
        word_count[word] = 1

# Good: Counter
word_count = Counter(words)
most_common = word_count.most_common(10)
```

### 3. Memory Management
- Memory allocation and garbage collection
- Object pooling
- Slots for memory-efficient classes
- Reference counting
- Weak references
- Memory leaks detection

**Code Example:**
```python
import gc
import sys
from weakref import WeakValueDictionary

# 1. __slots__ for memory-efficient classes
# Bad: Regular class (56 bytes per instance)
class RegularPoint:
    def __init__(self, x, y):
        self.x = x
        self.y = y

# Good: Slots class (32 bytes per instance - 43% smaller!)
class SlottedPoint:
    __slots__ = ['x', 'y']

    def __init__(self, x, y):
        self.x = x
        self.y = y

print(sys.getsizeof(RegularPoint(1, 2)))  # 56 bytes
print(sys.getsizeof(SlottedPoint(1, 2)))  # 32 bytes

# 2. Object pooling for expensive objects
class ObjectPool:
    def __init__(self, factory, max_size=10):
        self.factory = factory
        self.max_size = max_size
        self.pool = []

    def acquire(self):
        if self.pool:
            return self.pool.pop()
        return self.factory()

    def release(self, obj):
        if len(self.pool) < self.max_size:
            self.pool.append(obj)

# Usage
db_pool = ObjectPool(lambda: DatabaseConnection(), max_size=5)
conn = db_pool.acquire()
# Use connection
db_pool.release(conn)

# 3. Weak references to prevent memory leaks
class Cache:
    def __init__(self):
        self._cache = WeakValueDictionary()

    def get(self, key):
        return self._cache.get(key)

    def set(self, key, value):
        self._cache[key] = value

# 4. Manual garbage collection for large operations
def process_large_dataset():
    for batch in large_data:
        process_batch(batch)
        # Force garbage collection after each batch
        gc.collect()

# 5. Context managers for resource cleanup
class ManagedResource:
    def __enter__(self):
        self.resource = allocate_resource()
        return self.resource

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.resource.cleanup()
        return False
```

### 4. High-Performance Computing
- NumPy vectorization
- Numba JIT compilation
- Cython for C extensions
- Multiprocessing for parallelism
- Concurrent.futures
- Performance comparison

**Code Example:**
```python
import numpy as np
from numba import jit
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor

# 1. NumPy vectorization
# Bad: Python loops (slow)
def python_sum(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# Good: NumPy vectorization (100x faster!)
def numpy_sum(n):
    arr = np.arange(n)
    return np.sum(arr ** 2)

# Benchmark: python_sum(1000000) = 0.15s
#           numpy_sum(1000000)  = 0.002s

# 2. Numba JIT compilation
@jit(nopython=True)  # Compile to machine code
def fast_function(n):
    total = 0
    for i in range(n):
        total += i ** 2
    return total

# First call: compilation + execution
# Subsequent calls: 50x faster than pure Python!

# 3. Multiprocessing for CPU-bound tasks
def cpu_intensive_task(n):
    return sum(i * i for i in range(n))

# Single process
result = cpu_intensive_task(10000000)

# Multiple processes
with ProcessPoolExecutor(max_workers=4) as executor:
    ranges = [2500000, 2500000, 2500000, 2500000]
    results = executor.map(cpu_intensive_task, ranges)
    total = sum(results)

# 4x speedup on 4 cores!

# 4. Caching for expensive computations
from functools import lru_cache

@lru_cache(maxsize=128)
def fibonacci(n):
    if n < 2:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

# fibonacci(100) without cache: ~forever
# fibonacci(100) with cache: instant

# 5. Memory views for zero-copy operations
def process_array(data):
    # Bad: Creates copy
    subset = data[1000:2000]

    # Good: Zero-copy view
    view = memoryview(data)[1000:2000]
```

## Hands-On Practice

### Project 1: Performance Profiler
Build a comprehensive profiling tool.

**Requirements:**
- CPU profiling with cProfile
- Memory profiling
- Line-by-line analysis
- Visualization (flame graphs)
- HTML report generation
- Bottleneck identification

**Key Skills:** Profiling tools, visualization, analysis

### Project 2: Data Processing Pipeline
Optimize data processing pipeline.

**Requirements:**
- Load large CSV files (1GB+)
- Transform and clean data
- Aggregate statistics
- Compare Python/NumPy/Pandas approaches
- Measure memory usage
- Optimize to <2GB RAM

**Key Skills:** NumPy, memory optimization, benchmarking

### Project 3: Parallel Computing
Implement parallel algorithms.

**Requirements:**
- Matrix multiplication
- Image processing
- Monte Carlo simulation
- Compare threading/multiprocessing/asyncio
- Measure speedup
- Handle shared state

**Key Skills:** Parallelism, performance measurement

## Assessment Criteria

- [ ] Profile code to identify bottlenecks
- [ ] Choose appropriate data structures
- [ ] Optimize algorithms for time complexity
- [ ] Manage memory efficiently
- [ ] Use vectorization where applicable
- [ ] Implement effective caching
- [ ] Parallelize CPU-bound operations

## Resources

### Official Documentation
- [Python Performance Tips](https://wiki.python.org/moin/PythonSpeed/PerformanceTips) - Official tips
- [NumPy Docs](https://numpy.org/doc/) - NumPy documentation
- [Numba Docs](https://numba.pydata.org/) - JIT compilation

### Learning Platforms
- [High Performance Python](https://www.oreilly.com/library/view/high-performance-python/9781492055013/) - O'Reilly book
- [Python Performance](https://realpython.com/python-performance/) - Real Python guide
- [Optimizing Python](https://www.youtube.com/watch?v=zQeYx87mfyw) - PyCon talks

### Tools
- [cProfile](https://docs.python.org/3/library/profile.html) - CPU profiling
- [memory_profiler](https://pypi.org/project/memory-profiler/) - Memory profiling
- [py-spy](https://github.com/benfred/py-spy) - Sampling profiler
- [Scalene](https://github.com/plasma-umass/scalene) - CPU/GPU/memory profiler

## Next Steps

After mastering Python performance, explore:
- **Cython** - C extensions for Python
- **PyPy** - Alternative Python interpreter
- **Dask** - Parallel computing library
- **CUDA** - GPU programming with Python

Overview

This skill teaches practical Python performance optimization: profiling, bottleneck analysis, memory management, algorithm tuning, and high-performance libraries. It focuses on measurable improvements using profiling tools, vectorization, JIT/compiled extensions, and parallelism. Hands-on projects guide you from micro-benchmarks to production-ready optimizations.

How this skill works

The skill inspects code with profiling tools (timeit, cProfile, line_profiler, memory_profiler, py-spy) to locate CPU and memory hotspots. It teaches replacing slow patterns with efficient data structures, generator usage, NumPy vectorization, Numba/Cython acceleration, and multiprocessing or concurrent.futures for parallelism. You learn to benchmark changes, generate visual reports (flame graphs, HTML) and validate real speedups and memory reductions.

When to use it

  • You need to reduce latency or throughput bottlenecks in production code
  • Memory use grows uncontrolled or you suspect leaks
  • A data pipeline or numeric workload is slow and can benefit from vectorization or JIT
  • You must scale CPU-bound tasks across multiple cores or machines
  • Before rewriting code in a lower-level language, to find the real hotspots

Best practices

  • Profile before optimizing—measure first, change second
  • Prefer algorithmic changes and better data structures over micro-optimizations
  • Use generators and memory views to avoid unnecessary copies
  • Apply NumPy/Numba/Cython only when profiling shows Python-level loops are the bottleneck
  • Benchmark with representative data and repeat tests to avoid noise
  • Limit global state and use context managers or object pools for deterministic cleanup

Example use cases

  • Speed up a numeric kernel by replacing Python loops with NumPy or a Numba JIT function
  • Reduce peak RAM of a CSV ETL job using streaming, generators, and chunked processing
  • Identify and fix a memory leak using memory_profiler, gc, and weak references
  • Parallelize a Monte Carlo simulation with ProcessPoolExecutor for near-linear speedup on CPU cores
  • Build a profiling report with cProfile and flame graphs to justify architectural changes

FAQ

How do I know where to start optimizing?

Start with a profiler (cProfile or py-spy) to find the functions consuming the most time or memory; optimize the top hotspots first.

When should I use NumPy vs Numba vs Cython?

Use NumPy for vectorizable array work, Numba for numeric loops that can be JIT-compiled easily, and Cython when you need C-level control or the best possible speed after profiling.