home / skills / benchflow-ai / skillsbench / data_cleaning

data_cleaning skill

safe

/tasks/trend-anomaly-causal-inference/environment/skills/data_cleaning

This skill helps you clean messy tabular data by deduplicating, imputing missing values, processing text, and handling outliers to ready datasets for analysis.

npx playbooks add skill benchflow-ai/skillsbench --skill data_cleaning

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

4.2 KB

---
name: data_cleaning
description: Clean messy tabular datasets with deduplication, missing value imputation, outlier handling, and text processing. Use when dealing with dirty data that has duplicates, nulls, or inconsistent formatting.
---

# Data Cleaning Framework

Comprehensive, modular data cleaning framework for tabular datasets. Provides strategy-based cleaning operations including deduplication, missing value imputation, text processing, and outlier handling through a configurable pipeline.

## Core Components

### CleaningStrategies
Collection of static methods for specific cleaning operations:

#### Deduplication
- `remove_duplicates(df, subset=None)` - Remove duplicate rows given the context of the prompt or the context of the dataset, Carefully evaluate if it still makes sense to allow duplicates exist in the downstream analysis. You should always check the data and remove them based on the logic of meta data and downstream goal.

#### Missing Value Handling
Logic: If the column is not critical to affect the dowstream task and the cost of dropping the whole line is not neccessary, you should try to impute them, otherwise you can drop.
- `drop_missing(df, columns)` - Drop rows with missing critical values (that will affect the downstream task's peformance) in specified columns.
- `impute_median(df, columns)` - Impute with median for numerical columns if the column is numerical data
- `impute_knn(df, target_features, n_neighbors=5)` - KNN-based imputation such that the target column value potentially can be inferred by other columns. I.e. you have to carefully see the semantic and logical relationship between columns in terms of real life cases and make decisions to use this method for better data imputation performance.
- `impute_mode(df, columns)` - Impute with mode for categorical columns if the target column is relatively independent to other columns and should just be filled with an acceptable value.

#### Text Processing
- `process_text(df, columns, operation)` - Text operations: some columns might need conduct text related processing to get numbers or remove unnecessary content, you need to dertermine if this is needed based on preview the column values regarding the down stream goal. (extract_numbers, clean_whitespace, extract_email, lowercase, remove_special)

#### Outlier Handling
Logic: Again you need to make sure if removing outliers is costly or not in terms of the sample size or critical columns under your investigation. Keep the rows by replacing the outlier value with a more reasonable value, or simply remove the whole row in one of the two ways if it barely has any impact.
- `cap_outliers_iqr(df, columns, multiplier=1.5)` - Winsorization using IQR to keep reasonable signals of the data;
- `remove_outliers_iqr(df, columns, multiplier=1.5)` - Remove outliers using IQR
- `remove_outliers_zscore(df, columns, threshold=3.0)` - Remove outliers using Z-score

### DataCleaningPipeline
Orchestrates multiple cleaning steps with logging and progress tracking.

## Usage

```python
from data_cleaning import CleaningStrategies, DataCleaningPipeline

# Create pipeline
pipeline = DataCleaningPipeline(name="Transaction Data")

# Add cleaning steps
pipeline.add_step(
    CleaningStrategies.remove_duplicates,
    subset=['customer_id', 'order_date'],
    description="Remove duplicate orders"
).add_step(
    CleaningStrategies.drop_missing,
    columns=['customer_id', 'amount'],
    description="Drop rows with missing critical fields"
).add_step(
    CleaningStrategies.impute_knn,
    target_features={
        'income': {'features': ['age', 'education'], 'type': 'numeric'}
    },
    description="KNN imputation for income"
).add_step(
    CleaningStrategies.cap_outliers_iqr,
    columns=['amount'],
    multiplier=1.5,
    description="Cap price outliers"
)

# Execute pipeline
df_clean = pipeline.execute(raw_df, verbose=True)

# Get execution log
log_df = pipeline.get_log()
```

## Input
- Any tabular DataFrame requiring cleaning

## Output
- Cleaned DataFrame with execution log available

## Best Practices
- Chain multiple steps using method chaining
- Use verbose=True to monitor pipeline progress
- Review execution log for quality assurance
- Extend CleaningStrategies with domain-specific methods as needed

Overview

This skill cleans messy tabular datasets by applying deduplication, missing-value imputation, outlier handling, and text processing in a configurable pipeline. It provides strategy-based operations and an orchestrator that logs each step so you can reproduce and review cleaning decisions. Use it to standardize data before analysis, modeling, or reporting.

How this skill works

The skill exposes a set of CleaningStrategies (static operations) such as remove_duplicates, impute_median, impute_knn, process_text, and outlier handlers. A DataCleaningPipeline composes these strategies into an ordered sequence of steps with descriptions, parameters, and verbose execution logging. Running the pipeline on a DataFrame returns a cleaned DataFrame and an execution log for auditing.

When to use it

Preparing raw tabular data for ML training or statistical analysis
Fixing datasets with duplicate rows, nulls, inconsistent formatting, or noisy text
Imputing missing values when dropping rows would harm sample size
Removing or capping outliers before aggregations or modeling
Standardizing text fields (emails, numbers, casing) for joins or feature extraction

Best practices

Preview columns and decide which fields are critical before dropping missing values
Prefer imputation (median, mode, or KNN) when preserving sample size matters
Log each pipeline step and review the execution log for reproducibility
Use IQR winsorization (cap_outliers_iqr) to retain signal when outliers are sparse
Extend CleaningStrategies with domain-specific rules (e.g., phone or SKU parsing)

Example use cases

Remove duplicate orders by customer and date, then drop rows missing customer_id or amount
Impute missing income using KNN based on age and education before training a classifier
Cap extreme transaction amounts using IQR winsorization to stabilize revenue summaries
Clean text columns by lowercasing, stripping whitespace, and extracting numeric identifiers
Apply mode imputation for categorical fields that are independent of other features

FAQ

What input formats are supported?

Any tabular DataFrame (e.g., pandas) that supports row/column operations can be processed.

How can I audit what the pipeline did?

The DataCleaningPipeline produces an execution log with step descriptions, parameters, and outcomes to review and export.