A Sklearn Pipeline Tutorial - Machine Learning in Python

In the past couple of weeks, I started to use sklearn pipelines more intensively. It seemed like a good project to find out more about them and share my experiences in a blog post. So here it is: a sklearn pipeline tutorial.

See the full code!

For this blog post, I use the donors dataset that can be found on Kaggle. You can find the complete code in my Google Colab notebook.

Why pipelines?

In most machine learning projects, you won’t start modeling before you went to a whole load of preprocessing steps such as handling missing data and engineering new features to feed to the model. Because you’ll want to work with a train, (validation) and test set, you’ll go through those preprocessing steps multiple times. If you’re not an efficient coder the result will be a huge notebook with many repetitions. Enter: pipelines.

Pipelines are a great way to apply sequential transformations on your data and to feed the result to a classifier. It is an end-to-end procedure that forces you to structure your code and thought process in a specific way.

I wouldn’t recommend it as a tool in an exploratory phase of your project. However, I tend to use it in parallel. Everytime I finished an exploratory step, I add it to the pipeline. Structuring your machine learning project in a pipeline has some benefits:

The possibility to hypertune not only the classifier, but also the feature engineering process
A standardized workflow — i.e. less chance you’ll have data leakage
Readability, interpretability and reproducibility

The BaseEstimator

If you want to use scikit-learn like a pro, you should think like scikit-learn. Because that is what is expected of you. Every step in your thought process should be structured in a fit and transform. The easiest way is by creating new classes that inherit from the BaseEstimator class in scikit-learn.

“All estimators in the main scikit-learn codebase should inherit from sklearn.base.BaseEstimator.”
— Developing scikit-learn estimators

In the donors dataset, there are some missing values. Since it’s only a few, a quick imputation method is simply using the mode as the imputation value. But to prevent data leakage, you need to calculate the mode on the train set, but apply the transformation on both train and validation set. That’s why both steps are done in two different class methods.

In the following chunk of code, the PreProcessor class inherits from the BaseEstimator. In the fit method, the mode is calculated. And in the transform method, it is imputed in the missing values.

class PreProcessor(BaseEstimator):
    def __init__(self):
        print('PreProcessor initiated.')
        pass

    def fit(self, x, y=None):
        self.teacher_prefix_mode = statistics.mode(x.loc[~x['teacher_prefix'].isna(),'teacher_prefix'])
        return self

    def transform(self, x):
        x_dataset = x.copy()
        x_dataset['teacher_prefix'] = x_dataset['teacher_prefix'].fillna(self.teacher_prefix_mode)
        return x_dataset

The Classifier

All the steps in my machine learning project come together in the pipeline. The syntax is as follows: (1) each step is named, (2) each step is done within a sklearn object. To get an overview of all the steps I took, please take a look at the notebook. The cool thing about this chunk of code is that it only takes you a couple of seconds to understand all the steps in your Machine Learning project.

The final object in the pipeline is a voting classifier. It is an ensemble model of a random forest, an adaboost and a K-nearest-neighbour model. Why? For demonstrational purposes. I want to show that pipelines are very flexible: you can have one classifier, stacked classifiers, feature classifiers, voting classifier, etc…

See the full code!

For this blog post, I use the donors dataset that can be found on Kaggle. You can find the complete code in my Google Colab notebook.

model_pipeline = Pipeline(
    steps = [
            ('pre_process', PreProcessor()),
            ('datetime_features', DateTimeFeatureCreator()),
            ('essay_features', EssayFeatureCreator()),
            ('essay_ngrams', EssayNGramCreator()),
            ('resources_ngrams', ResourcesNGramCreator()),
            ('OHE_categoricals', OneHotTransformer()),
            ('post_process', PostProcessor()),
            ('scale', StandardScaler()),
            ('voting_classifier', VotingClassifier(
                estimators=[('rf', RandomForestClassifier()), 
                            ('ab', AdaBoostClassifier()),
                            ('knn', KNeighborsClassifier())],
                voting='hard')
            )
    ]
)

One thing is missing from this blog post: hypertuning. In a next blog post, I want to show you how you can hypertune both your learning algorithm and the feature engineering steps.

I hope you enjoyed this sklearn pipeline tutorial. Great success!

Say thanks, ask questions or give feedback

Technologies get updated, syntax changes and honestly… I make mistakes too. If something is incorrect, incomplete or doesn’t work, let me know in the comments below and help thousands of visitors.

3 thoughts on “A Sklearn Pipeline Tutorial – Machine Learning in Python”

aimen baig February 24, 2021 at 6:38 am

this was very helpful <3

Regístrese para obtener 100 USDT June 22, 2023 at 12:19 am

Your article helped me a lot, is there any more related content? Thanks! https://www.binance.com/es/register?ref=B4EPR6J0

sahibinden kompresör August 8, 2023 at 6:36 am

Yüksek verimli 300 lt kompresörlerle işlerinizi hızlandırın. Güvenilir ve dayanıklı ürünlerimizle tanışın.

A Sklearn Pipeline Tutorial – Machine Learning in Python

See the full code!

Why pipelines?

The BaseEstimator

The Classifier

See the full code!

Say thanks, ask questions or give feedback

3 thoughts on “A Sklearn Pipeline Tutorial – Machine Learning in Python”

Leave a Reply Cancel reply

Related Posts

How to do a SUMIF in PySpark

Check if Python logger already exists

Spark 3.0: Solving the “dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z” error