Introduction
In this tutorial, I will show you how you can train and deploy machine learning pipelines in a very popular ETL tool Alteryx using PyCaret — an open-source, low-code machine learning library in Python. The Learning Goals of this tutorial are:
- What is PyCaret and how to get started?
- What is Alteryx Designer and how to set it up?
- Train end-to-end machine learning pipeline in Alteryx Designer including data preparation such as missing value imputation, one-hot-encoding, scaling, transformations, etc.
- Deploy trained pipeline and generate inference during ETL.
PyCaret
PyCaret is an open-source, low-code machine learning library and end-to-end model management tool built in Python to automate machine learning workflows. PyCaret is known for its ease of use, simplicity, and ability to quickly and efficiently build and deploy end-to-end machine learning pipelines. To learn more about PyCaret, check out their GitHub.
Alteryx Designer
Alteryx Designer is a proprietary tool developed by Alteryx and is used for automating every step of analytics, including data preparation, blending, reporting, predictive analytics, and data science. You can access any data source, file, application, or data type, and experience the simplicity and power of a self-service platform with 260+ drag-and-drop building blocks. You can download the one-month free trial version of Alteryx Designer from here.
Tutorial Pre-Requisites:
For this tutorial, you will need two things. The first one being the Alteryx Designer which is a desktop software that you can download from here. Second, you need Python. The easiest way to get Python is to download Anaconda Distribution. To download that, click here.
Open Alteryx Designer and click on File → New Workflow
On the top, there are tools that you can drag and drop on the canvas and execute the workflow by connecting each component to one another.
Dataset
For this tutorial, I am using a regression dataset from PyCaret’s repository called insurance. You can download the data from here.
I will create two separate Alteryx workflows. First one for model training and selection and the second one for scoring the new data using the trained pipeline.
Model Training & Selection
Let’s first read the CSV file from the Input Data tool followed by a Python Script. Inside the Python script execute the following code:
# install pycaret
from ayx import Package
Package.installPackages(‘pycaret’)
# read data from input data tool
from ayx import Alteryx
data = Alteryx.read(“#1”)
# init setup, prepare data
from pycaret.regression import *
s = setup(data, target = ‘charges’, silent=True)
# model training and selection
best = compare_models()
# store the results, print and save
results = pull()
results.to_csv(‘c:/users/moezs/pycaret-demo-alteryx/results.csv’, index = False)
Alteryx.write(results, 1)
# finalize best model and save
best_final = finalize_model(best)
save_model(best_final, ‘c:/users/moezs/pycaret-demo-alteryx/pipeline’)
This script is importing the regression module from pycaret, then initializing the setup
function which automatically handles train_test_split and all the data preparation tasks such as missing value imputation, scaling, feature engineering, etc. compare_models
trains and evaluates all the estimators using kfold cross-validation and returns the best model.
pull
function calls the model performance metric as a Dataframe which is then saved as results.csv
on a drive and also written to the first anchor of Python tool in Alteryx (so that you can view results on screen).
Finally, save_model
saves the entire transformation pipeline including the best model as a pickle file.
When you successfully execute this workflow, you will generate pipeline.pkl
and results.csv
file. You can see the output of the best models and their cross-validated metrics on-screen as well.
This is what results.csv
contains:
These are the cross-validated metrics for all the models. The best model, in this case, is Gradient Boosting Regressor.
Model Scoring
We can now use our pipeline.pkl
to score on the new dataset. Since I do not have a separate dataset for insurance.csv without the label, what I will do is drop the target column i.e. charges, and then generate predictions using the trained pipeline.
I have used the Select Tool to remove the target column i.e. charges
. In the Python script execute the following code:
# read data from the input tool
from ayx import Alteryxdata = Alteryx.read(“#1”)
# load pipeline
from pycaret.regression import load_model, predict_model
pipeline = load_model(‘c:/users/moezs/pycaret-demo-alteryx/pipeline’)
# generate predictions and save to csv
predictions = predict_model(pipeline, data)
predictions.to_csv(‘c:/users/moezs/pycaret-demo-alteryx/predictions.csv’, index=False)
# display in alteryx
Alteryx.write(predictions, 1)
When you successfully execute this workflow, it will generate predictions.csv
.