home / skills / benchflow-ai / skillsbench / parallel-processing

This skill helps accelerate computational tasks by parallelizing work across cores using joblib for grid search and batch processing.

npx playbooks add skill benchflow-ai/skillsbench --skill parallel-processing

Review the files below or copy the command above to add this skill to your agents.

Files (1)
SKILL.md
2.0 KB
---
name: parallel-processing
description: Parallel processing with joblib for grid search and batch computations. Use when speeding up computationally intensive tasks across multiple CPU cores.
---

# Parallel Processing with joblib

Speed up computationally intensive tasks by distributing work across multiple CPU cores.

## Basic Usage

```python
from joblib import Parallel, delayed

def process_item(x):
    """Process a single item."""
    return x ** 2

# Sequential
results = [process_item(x) for x in range(100)]

# Parallel (uses all available cores)
results = Parallel(n_jobs=-1)(
    delayed(process_item)(x) for x in range(100)
)
```

## Key Parameters

- **n_jobs**: `-1` for all cores, `1` for sequential, or specific number
- **verbose**: `0` (silent), `10` (progress), `50` (detailed)
- **backend**: `'loky'` (CPU-bound, default) or `'threading'` (I/O-bound)

## Grid Search Example

```python
from joblib import Parallel, delayed
from itertools import product

def evaluate_params(param_a, param_b):
    """Evaluate one parameter combination."""
    score = expensive_computation(param_a, param_b)
    return {'param_a': param_a, 'param_b': param_b, 'score': score}

# Define parameter grid
params = list(product([0.1, 0.5, 1.0], [10, 20, 30]))

# Parallel grid search
results = Parallel(n_jobs=-1, verbose=10)(
    delayed(evaluate_params)(a, b) for a, b in params
)

# Filter results
results = [r for r in results if r is not None]
best = max(results, key=lambda x: x['score'])
```

## Pre-computing Shared Data

When all tasks need the same data, pre-compute it once:

```python
# Pre-compute once
shared_data = load_data()

def process_with_shared(params, data):
    return compute(params, data)

# Pass shared data to each task
results = Parallel(n_jobs=-1)(
    delayed(process_with_shared)(p, shared_data)
    for p in param_list
)
```

## Performance Tips

- Only worth it for tasks taking >0.1s per item (overhead cost)
- Watch memory usage - each worker gets a copy of data
- Use `verbose=10` to monitor progress

Overview

This skill provides practical patterns for parallel processing with joblib to speed up grid search and batch computations across multiple CPU cores. It focuses on simple, reliable recipes for parallel loops, grid searches, and sharing pre-computed data to reduce redundant work. Use it to cut runtime for CPU-bound tasks while controlling memory and overhead.

How this skill works

The skill shows how to wrap individual tasks with joblib.delayed and execute them with Parallel(n_jobs=...). It covers choosing a backend (loky for CPU-bound, threading for I/O-bound), setting verbosity to monitor progress, and collecting results. It also demonstrates pre-computing shared data once and passing it into worker calls to avoid repeated loading or computation.

When to use it

  • Speeding up independent, CPU-bound computations that take noticeably longer than ~0.1s each.
  • Running grid searches or parameter sweeps where each configuration can be evaluated independently.
  • Batch processing of many items (data transforms, simulations, model evaluations).
  • When you have a multi-core machine and want a simple parallelism layer without rewriting code for multiprocessing.

Best practices

  • Use n_jobs=-1 to use all cores or set a specific number to leave room for other processes.
  • Prefer backend='loky' for CPU-bound tasks and backend='threading' for I/O-bound tasks.
  • Pre-compute and pass shared data to avoid redundant loading; watch memory since workers may copy data.
  • Only parallelize tasks that are sufficiently heavy (>~0.1s) to amortize joblib overhead.
  • Enable verbose (e.g., verbose=10) when running long jobs to monitor progress and diagnose stalls.

Example use cases

  • Grid searching hyperparameters where each evaluation runs an expensive simulation or training step.
  • Applying a costly transformation to millions of records by splitting work into chunks across cores.
  • Running independent model evaluations in ensemble or cross-validation workflows in parallel.
  • Pre-processing large datasets by loading shared reference data once and applying per-item computations.

FAQ

What n_jobs value should I pick?

Start with n_jobs=-1 to use all cores; if you need responsiveness or run other heavy processes, choose a lower number like max(1, cpu_count()-1).

How do I avoid high memory usage?

Pre-compute only essential shared data, use smaller data slices per task, and reduce n_jobs so fewer worker processes hold copies concurrently.