home / skills / benchflow-ai / skillsbench / feature_engineering

feature_engineering skill

safe

/tasks/trend-anomaly-causal-inference/environment/skills/feature_engineering

This skill helps you engineer dataset features for machine learning by encoding categoricals, scaling numerics, generating interactions, and selecting relevant

npx playbooks add skill benchflow-ai/skillsbench --skill feature_engineering

Review the files below or copy the command above to add this skill to your agents.

Files (2)

SKILL.md

5.5 KB

---
name: feature_engineering
description: Engineer dataset features before ML or Causal Inference. Methods include encoding categorical variables, scaling numerics, creating interactions, and selecting relevant features.
---

# Feature Engineering Framework

Comprehensive, modular feature engineering framework general tabular datasets. Provides strategy-based operations including numerical scaling, categorical encoding, polynomial features, and feature selection through a configurable pipeline.

## Core Components

### FeatureEngineeringStrategies
Collection of static methods for feature engineering operations:

#### Numerical Features (if intepretability is not a concern)
- `scale_numerical(df, columns, method)` - Scale using 'standard', 'minmax', or 'robust'
- `create_bins(df, columns, n_bins, strategy)` - Discretize using 'uniform', 'quantile', or 'kmeans'
- `create_polynomial_features(df, columns, degree)` - Generate polynomial and interaction terms
- `create_interaction_features(df, column_pairs)` - Create multiplication interactions
- `create_log_features(df, columns)` - Log-transform for skewed distributions

#### Categorical Features
- `encode_categorical(df, columns, method)` - Encode using 'onehot', 'label', 'frequency', or 'hash'
- `create_category_aggregations(df, categorical_col, numerical_cols, agg_funcs)` - Group-level statistics

#### Binary Features
- `convert_to_binary(df, columns)` - Convert Yes/No, True/False to 0/1 (data type to int)

#### Data Quality Validation
- `validate_numeric_features(df, exclude_cols)` - Verify all features are numeric (except ID columns)
- `validate_no_constants(df, exclude_cols)` - Remove constant columns with no variance

#### Feature Selection
- `select_features_variance(df, columns, threshold)` - Remove low-variance features (default: 0.01). For some columns that consist of almost the same values, we might consider to drop due to the low variance it brings in order to reduce dimensionality.
- `select_features_correlation(df, columns, threshold)` - Remove highly correlated features

### FeatureEngineeringPipeline
Orchestrates multiple feature engineering steps with logging.

**CRITICAL REQUIREMENTS:**
1. **ALL output features MUST be numeric (int or float)** - DID analysis cannot use string/object columns
2. **Preview data types BEFORE processing**: `df.dtypes` and `df.head()` to check actual values
3. **Encode ALL categorical variables** - strings like "degree", "age_range" must be converted to numbers
4. **Verify output**: Final dataframe should have `df.select_dtypes(include='number').shape[1] == df.shape[1] - 1` (excluding ID column)

## Usage Example

```python
from feature_engineering import FeatureEngineeringStrategies, FeatureEngineeringPipeline

# Create pipeline
pipeline = FeatureEngineeringPipeline(name="Demographics")

# Add feature engineering steps
pipeline.add_step(
    FeatureEngineeringStrategies.convert_to_binary,
    columns=['<column5>', '<column2>'],
    description="Convert binary survey responses to 0/1"
).add_step(
    FeatureEngineeringStrategies.encode_categorical,
    columns=['<column3>', '<column7>'],
    method='onehot',
    description="One-hot encode categorical features"
).add_step(
    FeatureEngineeringStrategies.scale_numerical,
    columns=['<column10>', '<column1>'],
    method='standard',
    description="Standardize numerical features"
).add_step(
    FeatureEngineeringStrategies.validate_numeric_features,
    exclude_cols=['<ID Column>'],
    description="Verify all features are numeric before modeling"
).add_step(
    FeatureEngineeringStrategies.validate_no_constants,
    exclude_cols=['<ID Column>'],
    description="Remove constant columns with no predictive value"
).add_step(
    FeatureEngineeringStrategies.select_features_variance,
    columns=[],  # Empty = auto-select all numerical
    threshold=0.01,
    description="Remove low-variance features"
)

# Execute pipeline
# df_complete: complete returns original columns and the engineered features
df_complete = pipeline.execute(your_cleaned_df, verbose=True)

# Shortcut: Get the ID Column with the all needed enigneered features
engineered_features = pipeline.get_engineered_features()
df_id_pure_features = df_complete[['<ID Column>']+engineered_features]

# Get execution log
log_df = pipeline.get_log()
```

## Input
- A valid dataFrame that would be sent to feature engineering after any data processing, imputation, or drop (A MUST)

## Output
- DataFrame with both original and engineered columns
- Engineered feature names accessible via `pipeline.get_engineered_features()`
- Execution log available via `pipeline.get_log()`

## Key Features
- Multiple encoding methods for categorical variables
- Automatic handling of high-cardinality categoricals
- Polynomial and interaction feature generation
- Built-in feature selection for dimensionality reduction
- Pipeline pattern for reproducible transformations

## Best Practices
- **Always validate data types** before downstream analysis: Use `validate_numeric_features()` after encoding
- **Check for constant columns** that provide no information: Use `validate_no_constants()` before modeling
- Convert binary features before other transformations
- Use one-hot encoding for low-cardinality categoricals
- Use KNN imputation if missing value could be inferred from other relevant columns
- Use hash encoding for high-cardinality features (IDs, etc.)
- Apply variance threshold to remove constant features
- Check correlation matrix before modeling to avoid multicollinearity
- MAKE SURE ALL ENGINEERED FEATURES ARE NUMERICAL

Overview

This skill engineers dataset features for machine learning and causal inference workflows. It applies encoding, scaling, interactions, binning, and selection through a configurable pipeline to produce numeric-ready feature sets. The pipeline logs each step and returns both the transformed DataFrame and engineered feature names.

How this skill works

The skill provides a set of strategy methods (scaling, encoding, binning, polynomial/interaction creation, log transforms, binary conversion, and validation). Users assemble these strategies into a FeatureEngineeringPipeline that runs steps in order, previews dtypes and head, enforces that all output features are numeric, and records an execution log. Built-in selection methods remove low-variance or highly correlated features to reduce dimensionality before modeling.

When to use it

Preparing tabular data for supervised learning or causal analysis
Converting categorical and binary data into numeric formats required by models
Creating interaction and polynomial features to capture nonlinear relationships
Reducing dimensionality by removing constants, low-variance or highly correlated features
Standardizing or robust-scaling numeric features before model training

Best practices

Preview df.dtypes and df.head() before any transformations
Convert binary columns to 0/1 first, then encode other categoricals
Use one-hot for low-cardinality categories and hash encoding for high-cardinality IDs
Apply variance threshold and correlation-based selection to avoid multicollinearity
Validate that final DataFrame contains only numeric columns (excluding ID) before modeling

Example use cases

Encode survey and demographic fields, convert Yes/No answers to binary, and scale numeric covariates for a classifier
Create polynomial and interaction terms to improve predictive power for a regression model
Aggregate numerical statistics by category (group-level means, counts) to add informative features
Run a reproducible pipeline that logs every transformation and outputs engineered feature names for model training
Remove low-variance and highly correlated features prior to causal effect estimation

FAQ

What must I check before running the pipeline?

Ensure the input DataFrame is cleaned and imputed; preview df.dtypes and df.head() so you know which columns need encoding.

How do I guarantee the model-ready output is numeric?

Convert binary fields, encode all categorical columns, and use validate_numeric_features. The pipeline enforces that the final feature set is numeric (except the ID column).