home / skills / physics91 / claude-vibe / python-data-reviewer

python-data-reviewer skill

Q: Is apply always bad?

Not always. apply is fine for complex row-wise logic that cannot be vectorized, but for simple element-wise transforms use built-in vectorized methods (.str, np.where, arithmetic) for much better performance.

needs review

/skills/python-data-reviewer

This skill helps review Pandas/NumPy code for vectorization, memory usage, validation, and performance to optimize data processing workflows.

npx playbooks add skill physics91/claude-vibe --skill python-data-reviewer

Review the files below or copy the command above to add this skill to your agents.

Files (1)

SKILL.md

8.2 KB

---
name: python-data-reviewer
description: |
  WHEN: Pandas/NumPy code review, data processing, vectorization, memory optimization
  WHAT: Vectorization patterns + Memory efficiency + Data validation + Performance optimization + Best practices
  WHEN NOT: Web framework → fastapi/django/flask-reviewer, General Python → python-reviewer
---

# Python Data Reviewer Skill

## Purpose
Reviews data science code for Pandas/NumPy efficiency, memory usage, and best practices.

## When to Use
- Pandas/NumPy code review
- Data processing optimization
- Memory efficiency check
- "Why is my data code slow?"
- ETL pipeline review

## Project Detection
- `pandas`, `numpy` in requirements.txt
- `.ipynb` Jupyter notebooks
- `data/`, `notebooks/` directories
- DataFrame operations in code

## Workflow

### Step 1: Analyze Project
```
**Pandas**: 2.0+
**NumPy**: 1.24+
**Other**: polars, dask, vaex
**Visualization**: matplotlib, seaborn, plotly
**ML**: scikit-learn, xgboost
```

### Step 2: Select Review Areas
**AskUserQuestion:**
```
"Which areas to review?"
Options:
- Full data code review (recommended)
- Vectorization and performance
- Memory optimization
- Data validation
- Code organization
multiSelect: true
```

## Detection Rules

### Vectorization
| Check | Recommendation | Severity |
|-------|----------------|----------|
| iterrows() loop | Use vectorized operations | CRITICAL |
| apply() with simple func | Use built-in vectorized | HIGH |
| Manual loop over array | Use NumPy broadcasting | HIGH |
| List comprehension on Series | Use .map() or vectorize | MEDIUM |

```python
# BAD: iterrows (extremely slow)
for idx, row in df.iterrows():
    df.loc[idx, "total"] = row["price"] * row["quantity"]

# GOOD: Vectorized operation
df["total"] = df["price"] * df["quantity"]

# BAD: apply with simple operation
df["upper_name"] = df["name"].apply(lambda x: x.upper())

# GOOD: Built-in string method
df["upper_name"] = df["name"].str.upper()

# BAD: apply with condition
df["status"] = df["score"].apply(lambda x: "pass" if x >= 60 else "fail")

# GOOD: np.where or np.select
df["status"] = np.where(df["score"] >= 60, "pass", "fail")

# Multiple conditions
conditions = [
    df["score"] >= 90,
    df["score"] >= 60,
    df["score"] < 60,
]
choices = ["A", "B", "F"]
df["grade"] = np.select(conditions, choices)
```

### Memory Optimization
| Check | Recommendation | Severity |
|-------|----------------|----------|
| int64 for small ints | Use int8/int16/int32 | MEDIUM |
| object dtype for categories | Use category dtype | HIGH |
| Loading full file | Use chunks or usecols | HIGH |
| Keeping unused columns | Drop early | MEDIUM |

```python
# BAD: Default dtypes waste memory
df = pd.read_csv("large_file.csv")  # All int64, object

# GOOD: Specify dtypes
dtype_map = {
    "id": "int32",
    "age": "int8",
    "status": "category",
    "price": "float32",
}
df = pd.read_csv("large_file.csv", dtype=dtype_map)

# GOOD: Load only needed columns
df = pd.read_csv(
    "large_file.csv",
    usecols=["id", "name", "price"],
    dtype=dtype_map,
)

# GOOD: Process in chunks
chunks = pd.read_csv("huge_file.csv", chunksize=100_000)
result = pd.concat([process_chunk(chunk) for chunk in chunks])

# Memory check
print(df.info(memory_usage="deep"))

# Convert object to category if low cardinality
for col in df.select_dtypes(include=["object"]).columns:
    if df[col].nunique() / len(df) < 0.5:  # < 50% unique
        df[col] = df[col].astype("category")
```

### Data Validation
| Check | Recommendation | Severity |
|-------|----------------|----------|
| No null check | Validate nulls early | HIGH |
| No dtype validation | Assert expected types | MEDIUM |
| No range validation | Check value bounds | MEDIUM |
| Silent data issues | Raise or log warnings | HIGH |

```python
# GOOD: Data validation function
def validate_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Validate and clean input DataFrame."""

    # Required columns
    required_cols = ["id", "name", "price", "quantity"]
    missing = set(required_cols) - set(df.columns)
    if missing:
        raise ValueError(f"Missing columns: {missing}")

    # Null checks
    null_counts = df[required_cols].isnull().sum()
    if null_counts.any():
        logger.warning(f"Null values found:\n{null_counts[null_counts > 0]}")

    # Type validation
    assert df["id"].dtype in ["int32", "int64"], "id must be integer"
    assert df["price"].dtype in ["float32", "float64"], "price must be float"

    # Range validation
    invalid_prices = df[df["price"] < 0]
    if len(invalid_prices) > 0:
        logger.warning(f"Found {len(invalid_prices)} negative prices")
        df = df[df["price"] >= 0]

    # Duplicate check
    duplicates = df.duplicated(subset=["id"])
    if duplicates.any():
        logger.warning(f"Removing {duplicates.sum()} duplicates")
        df = df.drop_duplicates(subset=["id"])

    return df
```

### NumPy Patterns
| Check | Recommendation | Severity |
|-------|----------------|----------|
| Python loop over array | Use broadcasting | CRITICAL |
| np.append in loop | Pre-allocate array | HIGH |
| Repeated array creation | Reuse arrays | MEDIUM |
| Copy when not needed | Use views | MEDIUM |

```python
# BAD: Python loop
result = []
for i in range(len(arr)):
    result.append(arr[i] * 2 + 1)
result = np.array(result)

# GOOD: Vectorized
result = arr * 2 + 1

# BAD: np.append in loop (creates new array each time)
result = np.array([])
for x in data:
    result = np.append(result, process(x))

# GOOD: Pre-allocate
result = np.empty(len(data))
for i, x in enumerate(data):
    result[i] = process(x)

# BETTER: Vectorize if possible
result = np.vectorize(process)(data)

# BEST: Pure NumPy operation
result = np.sqrt(data) + np.log(data)

# BAD: Unnecessary copy
subset = arr[arr > 0].copy()  # copy() often not needed

# GOOD: View when modification not needed
subset = arr[arr > 0]  # Returns view
```

### Pandas Best Practices
| Check | Recommendation | Severity |
|-------|----------------|----------|
| Chained indexing | Use .loc/.iloc | HIGH |
| inplace=True | Assign result instead | MEDIUM |
| reset_index() abuse | Keep index when useful | LOW |
| df.append (deprecated) | Use pd.concat | HIGH |

```python
# BAD: Chained indexing (unpredictable)
df[df["price"] > 100]["status"] = "premium"  # May not work!

# GOOD: Use .loc
df.loc[df["price"] > 100, "status"] = "premium"

# BAD: df.append (deprecated)
result = pd.DataFrame()
for chunk in chunks:
    result = result.append(chunk)

# GOOD: pd.concat
result = pd.concat(chunks, ignore_index=True)

# BAD: Multiple operations creating copies
df = df.dropna()
df = df.reset_index(drop=True)
df = df.sort_values("date")

# GOOD: Method chaining
df = (
    df
    .dropna()
    .reset_index(drop=True)
    .sort_values("date")
)
```

## Response Template
```
## Python Data Code Review Results

**Project**: [name]
**Pandas**: 2.1 | **NumPy**: 1.26 | **Size**: ~1M rows

### Vectorization
| Status | File | Issue |
|--------|------|-------|
| CRITICAL | process.py:45 | iterrows() loop - use vectorized ops |

### Memory
| Status | File | Issue |
|--------|------|-------|
| HIGH | load.py:23 | object dtype for 'status' - use category |

### Data Validation
| Status | File | Issue |
|--------|------|-------|
| HIGH | etl.py:67 | No null value handling |

### NumPy
| Status | File | Issue |
|--------|------|-------|
| HIGH | calc.py:12 | np.append in loop - pre-allocate |

### Recommended Actions
1. [ ] Replace iterrows with vectorized operations
2. [ ] Convert low-cardinality columns to category
3. [ ] Add data validation before processing
4. [ ] Pre-allocate NumPy arrays in loops
```

## Performance Tips
```python
# Profile memory usage
df.info(memory_usage="deep")

# Profile time
%timeit df.apply(func)  # Jupyter magic
%timeit df["col"].map(func)

# Use query() for complex filters
df.query("price > 100 and category == 'A'")

# Use eval() for complex expressions
df.eval("profit = revenue - cost")
```

## Best Practices
1. **Vectorize**: Avoid loops, use broadcasting
2. **Dtypes**: Use smallest sufficient type
3. **Validate**: Check data quality early
4. **Profile**: Measure before optimizing
5. **Chunk**: Process large files incrementally

## Integration
- `python-reviewer`: General Python patterns
- `perf-analyzer`: Performance profiling
- `coverage-analyzer`: Test coverage for data code

Overview

This skill reviews Pandas and NumPy data code for performance, memory usage, and data quality. It identifies non-vectorized patterns, inefficient dtype choices, missing validation, and common anti-patterns. The goal is actionable recommendations to speed processing, reduce memory, and make pipelines more robust.

How this skill works

The reviewer scans project files and notebooks for Pandas/NumPy patterns, detects loops, apply/iterrows usage, dtype issues, and unsafe indexing. It flags costly operations like np.append-in-loops, object dtypes for low-cardinality data, and absent null/type checks. For each finding it suggests concrete replacements (vectorized ops, dtype maps, chunking, validation functions) and severity levels to prioritize fixes.

When to use it

Reviewing ETL or preprocessing notebooks and scripts that use pandas/numpy
Diagnosing slow data transformations or excessive memory use
Optimizing production data pipelines before deployment
Auditing data validation and quality checks in ingestion code
Refactoring code to use vectorized operations and smaller dtypes

Best practices

Prefer vectorized Pandas/NumPy operations over iterrows/apply loops
Specify dtypes on read and convert low-cardinality object columns to category
Validate required columns, nulls, ranges, and duplicates early
Process very large files in chunks and load only needed columns
Avoid chained indexing; use .loc/.iloc and assign results instead

Example use cases

Convert iterrows/apply loops to vectorized arithmetic or np.where
Reduce memory by mapping int64->int32/int8 and object->category
Add upfront validate_dataframe(df) to check columns, nulls, and ranges
Replace np.append in loops with pre-allocated arrays or pure NumPy ops
Profile hotspots with df.info(memory_usage='deep') and %timeit before changes

FAQ

How do I know when to use category dtype?

If a column is object/string and has low cardinality (example: unique count < ~50% of rows), converting to category typically cuts memory and speeds joins/groupbys.

Is apply always bad?

Not always. apply is fine for complex row-wise logic that cannot be vectorized, but for simple element-wise transforms use built-in vectorized methods (.str, np.where, arithmetic) for much better performance.