A Brief Introduction to SciKit Pipelines

Author:Murphy | View: 20725 | Time: 2025-03-23 12:56:45

Have you ever trained a Machine Learning model and your predictions looked too good to be true? But then you realized that you had some data leakage between your training and testing data?

Or have you had many pre-processing steps to prepare your data so that it was difficult to transfer the pre-processing steps from your model training into production to make actual predictions?

Or your pre-processing becomes messy and it is hard to share you code in a readable and easy to understand manner?

Then you might want to try scikit-learn‘s Pipeline. The Pipeline is an elegant solution to set up your workflow for ML training, testing, and in production, making your life easier and your results more reproducible.

But what is a pipeline, what are the benefits, and how do you set up a pipeline? I will go through these questions and give you some code examples of the building blocks. By combining these building blocks you can build more sophisticated pipelines, which are tailored to your needs.

What is a Pipeline?

A pipeline allows you to assemble several steps in your ML workflow that sequentially transform your data before passing the data to an estimator. Hence, a pipeline can consist of pre-processing, feature engineering and feature selection steps before passing the data to a final estimator for classification or regression tasks.

Why should I use a Pipeline?

In general, using a pipeline makes your life easier and speeds up the development of your ML models. This is because a pipeline

leads to cleaner and better understandable code
is easy to replicate and understand data workflows
is easier to read and adjust
makes data preparation faster as the pipeline automates data preparation
helps avoid data leakage
allows hyperparameter optimization to be run over all estimators and parameters in the pipeline at once
is convenient as you only have to call fit() and predict() once to run your entire data pipeline

After you have trained and optimized your model and are happy with the results, you can easily save the trained pipeline. Then, whenever you want to run your model, just load the pre-trained pipeline and you are ready to do some inference. With this you can easily share your model in a very clean way, which is easy to replicate and understand.

How do I set up a Pipeline?

Setting up a pipeline with scikit-learn is very simple and straightforward.

scikit-learn‘s Pipeline uses a list of key-value pairs which contains the transformers you want to apply on your data as values. The keys you can choose arbitrarily. The keys can be used to access the parameters of the transformers, for example, when running a grid search during a hyperparameter optimization. As the transformers are stored in a list you can also access the transformers by indexing.

To fit data on your pipeline and make predictions you can then run fit() and predict() as you would to with any transformer or regressor in scikit-learn.

A very simple pipeline could look like this:

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

pipeline = Pipeline(
   steps=[("imputer", SimpleImputer()), 
          ("scaler", MinMaxScaler()), 
          ("regression", LinearRegression())
   ]
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

scikit-learn, however, makes your life even easier if you do not want to enter key values for your transformers. Instead you can just use the make_pipeline() function and scikit-learn sets the names based on the transformer's class name.

from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

pipeline = make_pipeline(steps=[
    SimpleImputer(), 
    MinMaxScaler(), 
    LinearRegression()
    ]
)

That's it. With this you have quickly set up a simple pipeline that you can start using to train a model and run predictions with. If you want to have a look at how your pipeline looks, you can just print the pipeline and scikit-learn shows you an interactive view of the pipeline.

But what if you want to build something more complex and customizable? For example, handle categorical and numerical values differently, add features or transform the target value.

No worries, scikit-learn provides additional functionality with which you can create more custom pipelines and bring your pipelines to the next level. These functions are:

ColumnTransformer
FeatureUnion
TransformedTargetRegressor

I will go through them and show you examples of how to use them.

Transforming selected features

If you have different kinds of features, e. g., continuous and categorical, you probably want to transform these features differently. For example, scale the continuous features while one-hhot-encode the categorical features.

You could do these pre-processing steps before passing your features into the pipeline. But by doing so you would not be able to include these pre-processing steps and parameters in your hyperparameter search later. Also, including them in the pipeline makes handling your ML model much easier.

To apply a transformation or even a sequence of transformations only to selected columns you can use the ColumnTransformer. The use is very similar to Pipeline as instead of passing a key-value pair to steps we just pass the same pairs to transformers. We can then include the created transformer as one step in our pipeline.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

categorical_transformer = ColumnTransformer(
    transformers=[("encode", OneHotEncoder())]
)

pipeline = Pipeline(steps=[
    ("categorical", categorical_transformer, ["col_name"])
    ]
)

Since we only want to run the transformation on certain columns, we need to pass these columns in the pipeline. Moreover, we can let the ColumnTransformer know what we want to do with the remaining columns. For example, if you want to keep the columns that are not changed by the transformer you need to set remainder to passthrough. Otherwise, the columns get dropped. Instead of doing nothing or dropping the columns you could also transform the remaining columns by passing a transformer.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

categorical_transformer = ColumnTransformer(
 transformers=[("encode", OneHotEncoder(), ["col_name"])], remainder="passthrough"
)

categorical_transformer = ColumnTransformer(
 transformers=[("encode", OneHotEncoder(), ["col_name"])], remainder=MinMaxScaler()
)


Since `scikit-learn` allows Pipeline stacking we could even pass a Pipeline to the `ColumnTransformer` instead of stating each transformation we want to do in the `ColumnTransformer` itself.

```javascript
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

categorical_transformer = Pipeline(steps=[("encode", OneHotEncoder())])
numerical_transformer = Pipeline(
   steps=[("imputation", SimpleImputer()), ("scaling", MinMaxScaler())]
)

preprocessor = ColumnTransformer(
   transfomers=[
     ("numeric", numerical_transformer),
     ("categoric", categorical_transformer, ["col_name"]),
   ]
)

pipeline = Pipeline(steps=["preprocesssing", preprocessor])

Combining features

Now, you are able to run different pre-processing steps on different columns, but what if you want to derive new features from the data and add them to you feature set?

For this, you can use FeatureUnion, which combines transformer objects into a new transformer with the combined objects. Running a pipeline with a FeatureUnion fits each transformer independently and then joins their output.

For example, assume we want to add the Moving Average as a feature, we could do this:

from sklearn.compose import FeatureUnion
from sklearn.pipeline import Pipeline

preprocessor = (
   FeatureUnion(
     [
       ("moving_Average", MovingAverage(window=30)),
       ("numerical", numerical_pipeline),
     ]
   ),
)

pipeline = Pipeline(steps=["preprocesssing", preprocessor])

Transforming the target value

If you have a regression problem sometimes it can help to transform the target before fitting a regression.

You can include such a transformation using the TransformedTargetRegressor class. With this class you can either use transformers provided by scikit-learn like a MinMax scaler or write your own transformation functions.

One huge advantage of the TransformedTargetRegressor is that it automatically maps the predictions back to the original space by an inverse transform. So, you do not need to care about this later on when you move from model training to making predictions in production.

from sklearn.compose import TransformedTargetRegressor
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

regressor = TransformedTargetRegressor(
    regressor=model, 
    func=np.log1p, 
    inverse_func=np.expm1
)

pipeline = Pipeline(
   steps=[
      ("imputer", SimpleImputer()), 
      ("scaler", MinMaxScaler()), 
      ("regressor", regressor)
    ]
)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

Building your own custom functions

Sometimes it is not enough to use pre-processing methods scikit-learn provides. This, however, should not hold you back when using Pipelines. You can easily create your own functions that you can then include in the pipeline.

For this, you need to build a class that contains a fit() and transform() method as these are called when running the pipeline. However, these methods do no necessarily need to do anything. Moreover, we can let the class inherit from scikit-learn‘s BaseEstimator and TransformerMixin class to give us some basic functionality that our pipeline needs.

For example, assume we want to make predictions on a time series and we want to smooth all features by a moving average. For this, we just set up a class with a transform method that contains the smoothing part.

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler

class MovingAverage(BaseEstimator, TransformerMixin):

    def __init__(self, window=30):  
        self.window = window

    def fit(self, X, y=None):  
        return self

    def transform(self, X, y=None):
        return X.rolling(window=self.window, min_periods=1, center=False).mean()

pipeline = Pipeline(
   steps=[
       ("ma", MovingAverage(window=30)),
       ("imputer", SimpleImputer()),
       ("scaler", MinMaxScaler()),
       ("regressor", model),
   ]
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

What else is there to know?

The default return of transformers in scikit-learn is a numpy array. This can lead to problems in your pipeline if you only want to apply a transformation on certain features in the second step of the pipeline, e.g., only categorical features.

However, to prevent your pipeline from breaking you can change the default return value of all transformers to a dataframe by stating

from sklearn import set_config
set_config(transform_output = "pandas")

When running a hyperparameter optimization or when checking single parameters of your pipeline, it can be helpful to access the parameters directly. To access parameters you can use the __syntax. For example, in the above example, of the moving average we could access the window width of the MovingAverage transformer by calling pipeline.set_params(pipeline__ma_window=7).

Conclusion

Using scikit-learn‘s Pipeline can make your life a lot easier when developing new ML models and setting up the pre-processing steps. Besides having many benefits, setting up a Pipeline is also simple and straightforward. Nevertheless, you can build sophisticated and customizable pre-processing Pipelines in which only your creativity sets the boundaries.

If you liked this article or have any questions, feel free to leave a comment or reach out to me. I am also interested in your experiences with scikit-learn‘s Pipeline.

Do you want to read more about Pipelines, check out the following link: