Simplify Your Data Preparation With These 4 Lesser-Known Scikit-Learn Classes
Data preparation is famously the least-loved aspect of Data Science. If done right, however, it needn't be such a headache.
While scikit-learn has fallen out of vogue as a modelling library in recent years given the meteoric rise of PyTorch, LightGBM, and XGBoost, it's still easily one of the best data preparation libraries out there.
And I'm not just talking about that old chestnut: train_test_split
. If you're prepared to dig a little deeper, you'll find a treasure trove of helpful tools for more advanced data preparation techniques, all of which are perfectly compatible with using other libraries like lightgbm
, xgboost
and catboost
for subsequent modelling.
In this article, I'll walk through four scikit-learn classes which significantly speed up my data preparation workflows in my day-to-day job as a Data Scientist.
1. Pipeline: Seamlessly combine preprocessing steps
Scikit-learn's Pipeline
class enables you to combine different preprocessors or models into a single, callable chunk of code:
Pipelines can be composed of two different things:
- Transformer: any object with the
fit()
andtransform()
methods. You can think of a transformer as an object that's used for processing your data, and you will commonly have multiple transformers in your data preparation workflow. E.g., you might use one transformer to impute missing values, and another one to scale features or one-hot encode your categorical variables.MinMaxScaler()
,SimpleImputer()
andOneHotEncoder()
are all examples of transformers. - Estimator: In scikit-learn lingo, an "estimator" usually means a Machine Learning model; i.e. an object with the
fit()
andpredict()
methods.LinearRegression()
andRandomForestClassifier()
are examples of estimators.
In a pipeline, you can chain together as many transformers as you like, enabling you to apply different data preprocessing steps sequentially. If you like, you can also add on an estimator (ML model) at the end in order to make predictions using the newly transformed data, but it's not compulsory.
For example, you could build a pipeline that first imputes missing values with zeros and then one-hot encodes your variables:
Or, if you wanted to directly include the modelling in the pipeline itself, you could build a pipeline that imputes missing values with the mean, scales the features and then makes predictions using a RandomForestRegressor()
:
Building a pipeline with scikit-learn is remarkably simple.
To illustrate this, I'll first load some data and split it into training and testing sets. In this example, I'll use the diabetes dataset provided by scikit-learn, which contains ten predictor variables (age, sex, body mass index, average blood pressure, and six blood serum measurements) for 442 diabetes patients and a response variable representing the progression of each patient's diabetes one year after these predictor variables were recorded.
import pandas as pd
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
# Load diabetes dataset into pandas DataFrames
X, y = load_diabetes(scaled=False, return_X_y=True, as_frame=True)
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
display(X_train.head())
display(y_train.head())
Next, we define our Pipeline
. For now, I'll just define a simple preprocessing Pipeline
that includes two steps – impute missing values with the mean, and rescale all features – and I won't include an estimator/model. The principles, however, are the same regardless of whether or not you include an estimator.
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn import set_config
# Return pandas DataFrames instead of numpy arrays
set_config(transform_output="pandas")
# Build pipeline
pipe = Pipeline(steps=[
('impute_mean', SimpleImputer(strategy='mean')),
('rescale', MinMaxScaler())
])
Once we've defined our Pipeline
, we "fit" it to our training dataset, and use it to transform both the training and testing datasets:
# Fit the pipeline to the training data
pipe.fit(X_train)
# Transform data using the fitted pipeline
X_train_transformed = pipe.transform(X_train)
X_test_transformed = pipe.transform(X_test)
This will give us two preprocessed datasets (X_train_transformed
and X_test_transformed
), ready for any subsequent steps like modelling or feature selection.
The advantage of using a Pipeline
to handle these preprocessing steps is twofold:
- Protect against leakage: Because the preprocessor is fitted to the training dataset
X_train
, no information about the test set is "leaked" when imputing missing values or creating one-hot encoded features. - Avoid duplication: If we didn't use a
Pipeline
to handle these preprocessing steps, we'd end up transforming theX_test
dataset multiple times (every time we wanted to apply a preprocessing step). At this small scale, the repetition might not seem too bad. But in complex ML workflows you can easily grow to 5, 10, or even 20 preprocessing steps. Using aPipeline
makes this easy because we can add in as many steps as we like and still only have to transformX_train
andX_test
once:
preprocessor = Pipeline(steps=[
('imputer', SimpleImputer(strategy='mean')),
('scaler', MinMaxScaler()),
('step_3', ...),
('step_4', ...),
...,
('step_k', ...)
])
preprocessor.fit(X_train)
X_train_transformed = pipe.transform(X_train)
X_test_transformed = pipe.transform(X_test)
2. ColumnTransformer: Apply separate transformers to different feature subsets
In the previous example, we applied the same preprocessing steps to all features. But what if we had heterogenous datatypes and want to apply different preprocessors to different features? For example, if we only wanted to rescale the numerical features, or if we wanted to one-hot encode the categorical features?
This is where ColumnTransformer
steps in. A ColumnTransformer
allows you to apply different transformers to different columns of an array or pandas DataFrame.
In the code below, we start by defining the different groups of columns and, for each group, we use a Pipeline
to build a preprocessor that will act on that specific group. Finally, we chain together all of the transformers in a single ColumnTransformer
.
# This code will only work if you've already run the code in the previous sections
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer, MinMaxScaler
from sklearn.impute import SimpleImputer
# Categorical columns transformer - (a) impute NAs with the mode, and (b) one-hot encode
categorical_features = ['sex']
categorical_transformer = Pipeline(steps=[
('impute_mode', SimpleImputer(strategy='mode')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False, drop='first')) # handle_unknown='ignore' ensures that any values not encountered in the training dataset are ignored (i.e. all ohe columns will be set to zero)
])
# Numerical columns transformer - (a) impute NAs with the mean, and (b) rescale
numerical_features = ['bp', 'bmi', 's1', 's2', 's3', 's4', 's5', 's6'] # All except 'age' and 'sex'
numerical_transformer = Pipeline(steps=[
('impute_mean', SimpleImputer(strategy='mean')),
('rescale', MinMaxScaler())
])
# Combine the individual transformers into a single ColumnTransformer
preprocessor = ColumnTransformer(
# Chain together the individual transformers
transformers = [
('categorical_transformer', categorical_transformer, categorical_features),
('numerical_transformer', numerical_transformer, numerical_features),
],
# By default, columns which are not transformed by the ColumnTransformer
# will be dropped. By setting remainder='passthrough', we ensure that
# these columns are retained, in their original form.
remainder='passthrough',
# Prefix feature names with the name of the transformer that generated them (optional)
verbose_feature_names_out=True
)
# Get visual representation of the preprocessing/feature engineering pipeline
preprocessor
To apply the ColumnTransformer
to our data, we use the same code as we did to apply our first Pipeline
:
# Fit the preprocessor to the training data
preprocessor.fit(X_train)
# Transform data using the fitted preprocessor
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
3. FeatureUnion: Apply multiple transformers in parallel
Pipeline
and ColumnTransformer
are awesome tools, but they have a significant limitation. Did you spot it?
They can only apply transformers sequentially.
In other words, when you transform a feature Column1
using a Pipeline
/ColumnTransformer
, scikit-learn will first apply transformer_1
to Column1
, then apply transformer_2
to the transformed version of Column1
, and so on. This is fine for when we want to preprocess our data in a sequential manner (e.g. "first impute missing values, then one-hot encode"), but it's not ideal in cases where we want to apply different preprocessing steps in parallel (e.g. "create two new features from the same underlying column at the same time"). In these cases, using a standard Pipeline
or ColumnTransformer
won't suffice because the original "raw" values of Column1
will be lost as soon as the first transformer in the sequence is applied.
If we want to apply multiple transformations to the same underlying features in parallel, we need to use another tool: FeatureUnion
.
We can think of FeatureUnion
as a tool that creates a "copy" of your underlying data, applies transformers to those copies in parallel, and then stitches the results together. Each transformer is passed the raw, underlying data, so we don't experience the problem of sequential transformation.
To use FeatureUnion
, we just need to add a few lines of code:
# This code will only work if you've already run the code in the previous sections
from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA, TruncatedSVD
# Define a feature_union object which will create reduced-dimensionality features
union = FeatureUnion(transformer_list=[
("pca", PCA(n_components=1)),
("svd", TruncatedSVD(n_components=2))
])
# Adapt the numerical transformer so that it includes the FeatureUnion
numerical_features = ['bp', 'bmi', 's1', 's2', 's3', 's4', 's5', 's6'] # All except 'age' and 'sex'
numerical_transformer = Pipeline(steps=[
('impute_mean', SimpleImputer(strategy='mean')),
('rescale', MinMaxScaler()),
('reduce_dimensionality', union)
])
# Categorical columns transformer - same as above
categorical_features = ['sex']
categorical_transformer = Pipeline(steps=[
('impute_mode', SimpleImputer(strategy='mode')),
('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False, drop='first')) # handle_unknown='ignore' ensures that any values not encountered in the training dataset are ignored (i.e. all ohe columns will be set to zero)
])
# Build the ColumnTransformer
preprocessor = ColumnTransformer(
transformers = [
('categorical_transformer', categorical_transformer, categorical_features),
('numerical_transformer', numerical_transformer, numerical_features),
],
remainder='passthrough',
verbose_feature_names_out=True
)
preprocessor
In this diagram, we can see that the FeatureUnion
steps are applied in parallel, rather than sequentially. Just like before, we fit the preprocessor
to our training data and then use it to transform any dataset we want to use for modelling/prediction.
# Fit the preprocessor to the training data
preprocessor.fit(X_train)
# Transform data using the fitted preprocessor
X_train_transformed = preprocessor.transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
4. FunctionTransformer: Seamlessly integrate feature engineering
All of the transformers and tools discussed above use pre-built classes in scikit-learn to apply standard transformations to your data (e.g., scaling, one-hot encoding, imputing, etc.).
If you want to apply a custom function – for example during feature engineering – then you'll want to use FunctionTransformer
. Personally, I love this class – it makes it super easy to integrate custom functions into your Pipeline
without having to write new transformer classes from scratch.
Creating a FunctionTransformer
is really simple. You start by defining your functions in the standard Pythonic style, and then create a pipeline. Here, I define two simple functions: one that adds together two columns, and another that subtracts two columns.
from sklearn.preprocessing import FunctionTransformer
def add_features(X):
X['feature_1_2'] = X['feature_1'] + X['feature_2']
return X
def subtract_features(X):
X['feature_3_4'] = X['feature_3'] - X['feature_4']
return X
# Put into a pipeline
feature_engineering = Pipeline(steps=[
('add_features', FunctionTransformer(add_features)),
('subtract_features', FunctionTransformer(subtract_features))
])
To simplify things even further, you could include multiple transformations within the same function:
def add_subtract_features(X):
X['feature_1_2'] = X['feature_1'] + X['feature_2'] # Add features
X['feature_3_4'] = X['feature_3'] - X['feature_4'] # Subtract features
return X
# Put into a pipeline
feature_engineering = Pipeline(steps=[
('add_subtract_features', FunctionTransformer(add_subtract_features)),
])
Finally, add the feature_engineering
pipeline to the preprocessing
pipeline we defined earlier:
# Combine preprocessing and feature engineering in a single pipeline
pipe = Pipeline([
('preprocessing', preprocessor),
('feature_engineering', feature_engineering),
])
pipe
And use this new pipeline to apply the same preprocessing/feature engineering steps to all your datasets:
# Fit the preprocessor to the training data
pipe.fit(X_train)
# Transform data using the fitted preprocessor
X_train_transformed = pipe.transform(X_train)
X_test_transformed = pipe.transform(X_test)
Bonus: Save your pipelines for truly reproducible workflows
In enterprise applications of machine learning, it's very rare to only use a model or preprocessing workflow once. More often, you'll be required to regularly rerun your model each week/month and generate new predictions for new data.
In these situations, rather than writing a new preprocessing pipeline from scratch each time, you can use the same pipeline each time. To do this, once you've developed your pipeline use the joblib
library, save the pipeline so that you can rerun the exact same transformations with future datasets:
import joblib
# Save pipeline
joblib.dump(pipe, "pipe.pkl")
# Assume that the below steps are applied in another notebook/script
# Load pipeline
pretrained_pipe = joblib.load("pipe.pkl")
# Apply pipeline to a new dataset, X_test_new
X_test_new_transformed = pretrained_pipe.transform(X_test_new)
Conclusion
To recap:
Pipeline
provides a quick way to sequentially apply different preprocessing transformers to your data- Using a
ColumnTransformer
is a fantastic way to sequentially apply separate preprocessing steps to different feature subsets FeatureUnion
enables you to apply different preprocessing transformations in parallelFunctionTransformer
provides a super-simple way to write custom feature engineering functions and integrate them within your pipelines
If you use these tools, my promise to you is that they'll help you write code that is more elegant, reproducible, and pythonic. Your Machine Learning Engineers will love you!
If you liked this article, it would mean a lot if you followed me. If you'd like to get unlimited access to all of my stories (and the rest of Medium.com), you can sign up via my referral link for $5 per month. It adds no extra cost to you vs. signing up via the general signup page, and helps to support my writing as I get a small commission.
Thanks for reading!