home / skills / benchflow-ai / skillsbench / feature_engineering

This skill helps you engineer dataset features for machine learning by encoding categoricals, scaling numerics, generating interactions, and selecting relevant

npx playbooks add skill benchflow-ai/skillsbench --skill feature_engineering

Review the files below or copy the command above to add this skill to your agents.

Files (2)
SKILL.md
5.5 KB
---
name: feature_engineering
description: Engineer dataset features before ML or Causal Inference. Methods include encoding categorical variables, scaling numerics, creating interactions, and selecting relevant features.
---

# Feature Engineering Framework

Comprehensive, modular feature engineering framework general tabular datasets. Provides strategy-based operations including numerical scaling, categorical encoding, polynomial features, and feature selection through a configurable pipeline.

## Core Components

### FeatureEngineeringStrategies
Collection of static methods for feature engineering operations:

#### Numerical Features (if intepretability is not a concern)
- `scale_numerical(df, columns, method)` - Scale using 'standard', 'minmax', or 'robust'
- `create_bins(df, columns, n_bins, strategy)` - Discretize using 'uniform', 'quantile', or 'kmeans'
- `create_polynomial_features(df, columns, degree)` - Generate polynomial and interaction terms
- `create_interaction_features(df, column_pairs)` - Create multiplication interactions
- `create_log_features(df, columns)` - Log-transform for skewed distributions

#### Categorical Features
- `encode_categorical(df, columns, method)` - Encode using 'onehot', 'label', 'frequency', or 'hash'
- `create_category_aggregations(df, categorical_col, numerical_cols, agg_funcs)` - Group-level statistics

#### Binary Features
- `convert_to_binary(df, columns)` - Convert Yes/No, True/False to 0/1 (data type to int)

#### Data Quality Validation
- `validate_numeric_features(df, exclude_cols)` - Verify all features are numeric (except ID columns)
- `validate_no_constants(df, exclude_cols)` - Remove constant columns with no variance

#### Feature Selection
- `select_features_variance(df, columns, threshold)` - Remove low-variance features (default: 0.01). For some columns that consist of almost the same values, we might consider to drop due to the low variance it brings in order to reduce dimensionality.
- `select_features_correlation(df, columns, threshold)` - Remove highly correlated features

### FeatureEngineeringPipeline
Orchestrates multiple feature engineering steps with logging.

**CRITICAL REQUIREMENTS:**
1. **ALL output features MUST be numeric (int or float)** - DID analysis cannot use string/object columns
2. **Preview data types BEFORE processing**: `df.dtypes` and `df.head()` to check actual values
3. **Encode ALL categorical variables** - strings like "degree", "age_range" must be converted to numbers
4. **Verify output**: Final dataframe should have `df.select_dtypes(include='number').shape[1] == df.shape[1] - 1` (excluding ID column)

## Usage Example

```python
from feature_engineering import FeatureEngineeringStrategies, FeatureEngineeringPipeline

# Create pipeline
pipeline = FeatureEngineeringPipeline(name="Demographics")

# Add feature engineering steps
pipeline.add_step(
    FeatureEngineeringStrategies.convert_to_binary,
    columns=['<column5>', '<column2>'],
    description="Convert binary survey responses to 0/1"
).add_step(
    FeatureEngineeringStrategies.encode_categorical,
    columns=['<column3>', '<column7>'],
    method='onehot',
    description="One-hot encode categorical features"
).add_step(
    FeatureEngineeringStrategies.scale_numerical,
    columns=['<column10>', '<column1>'],
    method='standard',
    description="Standardize numerical features"
).add_step(
    FeatureEngineeringStrategies.validate_numeric_features,
    exclude_cols=['<ID Column>'],
    description="Verify all features are numeric before modeling"
).add_step(
    FeatureEngineeringStrategies.validate_no_constants,
    exclude_cols=['<ID Column>'],
    description="Remove constant columns with no predictive value"
).add_step(
    FeatureEngineeringStrategies.select_features_variance,
    columns=[],  # Empty = auto-select all numerical
    threshold=0.01,
    description="Remove low-variance features"
)

# Execute pipeline
# df_complete: complete returns original columns and the engineered features
df_complete = pipeline.execute(your_cleaned_df, verbose=True)

# Shortcut: Get the ID Column with the all needed enigneered features
engineered_features = pipeline.get_engineered_features()
df_id_pure_features = df_complete[['<ID Column>']+engineered_features]

# Get execution log
log_df = pipeline.get_log()
```

## Input
- A valid dataFrame that would be sent to feature engineering after any data processing, imputation, or drop (A MUST)

## Output
- DataFrame with both original and engineered columns
- Engineered feature names accessible via `pipeline.get_engineered_features()`
- Execution log available via `pipeline.get_log()`

## Key Features
- Multiple encoding methods for categorical variables
- Automatic handling of high-cardinality categoricals
- Polynomial and interaction feature generation
- Built-in feature selection for dimensionality reduction
- Pipeline pattern for reproducible transformations

## Best Practices
- **Always validate data types** before downstream analysis: Use `validate_numeric_features()` after encoding
- **Check for constant columns** that provide no information: Use `validate_no_constants()` before modeling
- Convert binary features before other transformations
- Use one-hot encoding for low-cardinality categoricals
- Use KNN imputation if missing value could be inferred from other relevant columns
- Use hash encoding for high-cardinality features (IDs, etc.)
- Apply variance threshold to remove constant features
- Check correlation matrix before modeling to avoid multicollinearity
- MAKE SURE ALL ENGINEERED FEATURES ARE NUMERICAL

Overview

This skill engineers dataset features for machine learning and causal inference workflows. It applies encoding, scaling, interactions, binning, and selection through a configurable pipeline to produce numeric-ready feature sets. The pipeline logs each step and returns both the transformed DataFrame and engineered feature names.

How this skill works

The skill provides a set of strategy methods (scaling, encoding, binning, polynomial/interaction creation, log transforms, binary conversion, and validation). Users assemble these strategies into a FeatureEngineeringPipeline that runs steps in order, previews dtypes and head, enforces that all output features are numeric, and records an execution log. Built-in selection methods remove low-variance or highly correlated features to reduce dimensionality before modeling.

When to use it

  • Preparing tabular data for supervised learning or causal analysis
  • Converting categorical and binary data into numeric formats required by models
  • Creating interaction and polynomial features to capture nonlinear relationships
  • Reducing dimensionality by removing constants, low-variance or highly correlated features
  • Standardizing or robust-scaling numeric features before model training

Best practices

  • Preview df.dtypes and df.head() before any transformations
  • Convert binary columns to 0/1 first, then encode other categoricals
  • Use one-hot for low-cardinality categories and hash encoding for high-cardinality IDs
  • Apply variance threshold and correlation-based selection to avoid multicollinearity
  • Validate that final DataFrame contains only numeric columns (excluding ID) before modeling

Example use cases

  • Encode survey and demographic fields, convert Yes/No answers to binary, and scale numeric covariates for a classifier
  • Create polynomial and interaction terms to improve predictive power for a regression model
  • Aggregate numerical statistics by category (group-level means, counts) to add informative features
  • Run a reproducible pipeline that logs every transformation and outputs engineered feature names for model training
  • Remove low-variance and highly correlated features prior to causal effect estimation

FAQ

What must I check before running the pipeline?

Ensure the input DataFrame is cleaned and imputed; preview df.dtypes and df.head() so you know which columns need encoding.

How do I guarantee the model-ready output is numeric?

Convert binary fields, encode all categorical columns, and use validate_numeric_features. The pipeline enforces that the final feature set is numeric (except the ID column).