Select Page

A step-by-step guide on training and scoring machine learning models in KNIME using PyCaret

PyCaret

PyCaret is an open-source, low-code machine learning library and end-to-end model management tool built in Python to automate machine learning workflows. Its ease of use, simplicity, and ability to quickly and efficiently build and deploy end-to-end machine learning pipelines will amaze you.
PyCaret is an alternate low-code library that can replace hundreds of lines of code with few lines only. This makes the experiment cycle exponentially fast and efficient.

PyCaret is simple and easy to use. All the operations performed in PyCaret are sequentially stored in a Pipeline that is fully automated for deployment. Whether it’s imputing missing values, one-hot-encoding, transforming categorical data, feature engineering, or even hyperparameter tuning, PyCaret automates all of it.

KNIME

KNIME Analytics Platform is open-source software for creating data science. Intuitive, open, and continuously integrating new developments, KNIME makes understanding data and designing data science workflows and reusable components accessible to everyone.

KNIME Analytics platform is one of the most popular open-source platforms used in data science to automate the data science process. KNIME has thousands of nodes in the node repository which allows you to drag and drop the nodes into the KNIME workbench. A collection of interrelated nodes creates a workflow that can be executed locally as well as can be executed in the KNIME web portal after deploying the workflow into the KNIME server.

Installation

For this tutorial, you will need two things. The first one being the KNIME Analytics Platform which is a desktop software that you can download from here. Second, you need Python.

The easiest way to get started with Python is to download Anaconda Distribution. To download, click here.

Once you have both the KNIME Analytics Platform and Python installed, you need to create a separate Conda environment in which we will install PyCaret. Open the Anaconda prompt and run the following commands:

# create a conda environment
conda create –name knimeenv python=3.6

# activate environment
conda activate knimeenv

# install pycaret
pip install pycaret

Now open the KNIME Analytics Platform and go to File → Install KNIME Extensions → KNIME & Extensions → and select KNIME Python Extension and install it.

Once installation completes, go to File → Preferences → KNIME → Python and select your Python 3 environment. Notice that in my case the name of the environment is “powerbi”. If you have followed the commands above, the name of the environment is “knimeenv”.

We are ready now

Click on “New KNIME Workflow” and a blank canvas will open.

On the left-hand side, there are tools that you can drag and drop on the canvas and execute the workflow by connecting each component to one another. All the actions in the repository on the left side are called Nodes.

Dataset

For this tutorial, I am using a regression dataset from PyCaret’s repository called ‘insurance’. You can download the data from here.

I will create two separate workflows. First one for model training and selection and the second one for scoring the new data using the trained pipeline.

Model Training & Selection

Let’s first read the CSV file from the CSV Reader node followed by a Python Script. Inside the Python script execute the following code:

# init setup, prepare data
from pycaret.regression import *
s = setup(input_table_1, target = ‘charges’, silent=True)

# model training and selection
best = compare_models()

# store the results, print and save
output_table_1 = pull()
output_table_1.to_csv(‘c:/users/moezs/pycaret-demo-knime/results.csv’, index = False)

# finalize best model and save
best_final = finalize_model(best)
save_model(best_final, ‘c:/users/moezs/pycaret-demo-knime/pipeline’)

This script is importing the regression module from pycaret, then initializing the setup function which automatically handles train_test_split and all the data preparation tasks such as missing value imputation, scaling, feature engineering, etc. compare_models trains and evaluates all the estimators using k-fold cross-validation and returns the best model. pull function calls the model performance metric as a Dataframe which is then saved as results.csv on a local drive. Finally, save_model saves the entire transformation pipeline and model as a pickle file.

When you successfully execute this workflow, you will generate pipeline.pkl and results.csv file in the defined folder.

This is what results.csv contains:

These are the cross-validated metrics for all the models. The best model, in this case, is Gradient Boosting Regressor.

Model Scoring

We can now use our pipeline.pkl to score on the new dataset. Since I do not have a separate dataset for ‘insurance.csv’, what I will do is drop the target column from the same file, just to demonstrate.

I have used the Column Filter node to remove the target column i.e. charges. In the Python script, execute the following code:

# load pipeline
from pycaret.regression import load_model, predict_model
pipeline = load_model(‘c:/users/moezs/pycaret-demo-knime/pipeline’)

# generate predictions and save to csv
output_table_1 = predict_model(pipeline, data = input_table_1)
output_table_1.to_csv(‘c:/users/moezs/pycaret-demo-knime/predictions.csv’, index=False)

When you successfully execute this workflow, it will generate predictions.csv.

I hope that you will appreciate the ease of use and simplicity in PyCaret. When used within an analytics platform like KNIME, it can save you many hours of coding and then maintaining that code in production. With less than 10 lines of code, I have trained and evaluated multiple models using PyCaret and deployed an ML Pipeline KNIME.