home / skills / dkyazzentwatwa / chatgpt-skills / feature-engineering-kit

feature-engineering-kit skill

/feature-engineering-kit

This skill auto-generates features for ML pipelines, including encodings, scaling, polynomial terms, and temporal, text, and missing-value handling.

npx playbooks add skill dkyazzentwatwa/chatgpt-skills --skill feature-engineering-kit

Review the files below or copy the command above to add this skill to your agents.

Files (3)
SKILL.md
804 B
---
name: feature-engineering-kit
description: Auto-generate features with encodings, scaling, polynomial features, and interaction terms for ML pipelines.
---

# Feature Engineering Kit

Automated feature engineering with encodings, scaling, and transformations.

## Features

- **Encodings**: One-hot, label, target encoding
- **Scaling**: Standard, min-max, robust scaling
- **Polynomial Features**: Generate interactions
- **Binning**: Discretize continuous features
- **Date Features**: Extract time-based features
- **Text Features**: TF-IDF, word counts
- **Missing Value Handling**: Imputation strategies

## CLI Usage

```bash
python feature_engineering.py --data train.csv --output engineered.csv --config config.json
```

## Dependencies

- scikit-learn>=1.3.0
- pandas>=2.0.0
- numpy>=1.24.0

Overview

This skill auto-generates machine learning features including encodings, scalings, polynomial interactions, and common transformations. It produces ready-to-use engineered datasets or pipeline components for training and inference. The toolkit supports categorical and numerical handling, date/text extraction, and configurable imputation strategies. Use it to accelerate feature prep and reduce manual engineering work.

How this skill works

The skill inspects input columns and applies selected transformations such as one-hot, label, or target encoding for categoricals and standard/min-max/robust scaling for numerics. It can create polynomial and interaction terms, discretize continuous variables, extract date-time parts, and compute text features like TF-IDF or word counts. Missing values are handled with configurable imputers before transformations, and the output is written as an engineered dataset or pipeline artifact. Configuration is provided via a JSON file or CLI flags to control columns and transformation parameters.

When to use it

  • Preparing tabular data for supervised learning to improve model signal.
  • Automating repetitive encoding and scaling steps across experiments.
  • Creating interaction or polynomial features for non-linear models.
  • Generating date/time and text-derived predictors from raw inputs.
  • Standardizing preprocessing across training and production pipelines.

Best practices

  • Define column groups (categorical, numeric, datetime, text) explicitly in config to avoid unexpected transforms.
  • Use cross-validated target encoding or holdout schemes to prevent leakage.
  • Limit polynomial degree and interaction pairs to control feature explosion and overfitting.
  • Fit imputers and encoders on training data only and apply the same objects to validation/test sets.
  • Profile resulting feature set size and sparsity when using one-hot or TF-IDF to manage memory.

Example use cases

  • Quickly produce a cleaned CSV with encodings and scaled features for baseline model training.
  • Build a preprocessing pipeline that extracts year/month/day parts and encodes them for a time-series model.
  • Generate TF-IDF vectors and word-count features for text columns alongside numeric features.
  • Create polynomial and interaction terms for a gradient boosting model to capture non-linear effects.
  • Impute missing values and apply robust scaling before feeding data to models sensitive to outliers.

FAQ

What input formats are supported?

CSV files and pandas DataFrames; configuration is provided via JSON or CLI options.

How does the skill avoid target leakage with target encoding?

It supports cross-validated and holdout target encoding schemes so encodings are learned without using the target from the same fold.