Select Page


I am excited to announce PyCaret 2.2 — update for the month of Oct 2020.

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the machine learning experiment cycle and makes you more productive.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient.

Release Notes:


Installing PyCaret

Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using a virtual environment to avoid potential conflicts with other libraries. See the following example code to create a conda environment and install pycaret within that conda environment:

# create a conda environment
conda create --name yourenvname python=3.6

# activate conda environment
conda activate yourenvname

# install pycaret
pip install pycaret

# install notebook kernel
python -m ipykernel install --user --name yourenvname --display-name "display-name"

PyCaret’s default installation is a slim version of pycaret which only installs hard dependencies that are listed here. To install the full version of pycaret, use the following code:

# install the full version
pip install pycaret[full]

When you install the full version of pycaret, all the optional dependencies as listed here are also installed.

Installing the nightly build

PyCaret is evolving very fast. Often, you want to have access to the latest features but want to avoid compiling PyCaret from source or waiting for the next release. Fortunately, you can now install pycaret-nightly using pip.

# install the nightly build
pip install pycaret-nightly

# or install the full version of the nightly build
pip install pycaret-nightly[full]

PyCaret 2.2 Feature Summary

GPU Enabled Training

PyCaret 2.2 provides the option to use GPU for select model training and hyperparameter tuning. There is no change in the use of the API, however, in some cases, additional libraries have to be installed as they are not installed with the default slim version or the full version. The following models now can be trained on GPU.

  • Extreme Gradient Boosting (requires no further installation)
  • CatBoost (requires no further installation)
  • Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression, K-Means Clustering, and Density-Based Spatial Clustering (requires cuML >= 0.15

To enable Light Gradient Boosting Machine on GPU, you will have to install the GPU-enabled version of LightGBM. The official step-by-step tutorial to do that is here.

If you are using Google Colab you can install Light Gradient Boosting Machine for GPU but first, you have to uninstall LightGBM — CPU version. Before doing that, ensure that GPU is enabled in your Colab session. Use the following code to install GPU-enabled LightGBM:

# uninstall lightgbm CPU
pip uninstall lightgbm -y

# install lightgbm GPU
pip install lightgbm --install-option=--gpu --install-option="--opencl-include-dir=/usr/local/cuda/include/" --install-option="--opencl-library=/usr/local/cuda/lib64/"

As of today, cuML 0.15 is not supported on Google Colab. This may change in the future but for now, you can use blazingSQL Notebook for free which comes pre-installed with cuML 0.15.

Once you sign-in to your account, initiate the Python 3 Notebook and use the following code to install pycaret:

# install pycaret on blazingSQL
!/opt/conda-environments/rapids-stable/bin/python -m pip install --upgrade pycaret

Alternatively, if you have GPU on your local machine or you are planning to use any other cloud service with GPU, you can follow the official installation guide for cuML.

Assuming the installation is successful, the only thing that needs to be done to train models on GPU is to enable GPU when initializing the setup function.

# import dataset
from pycaret.datasets import get_data
data = get_data('poker')

# initialize the setup
from pycaret.classification import *clf = setup(data, target = 'CLASS', use_gpu = True)

That’s it. You can now use pycaret in the exact way you would use it on the CPU. It will automatically, use GPU for model training where possible else falls back to CPU equivalent algorithms. Even before starting to train, you can actually check which models are enabled on GPU by using the following command:

# check models available on GPU
models(internal=True)[['Name', 'GPU Enabled']]

Benchmark Comparisons CPU vs GPU (Time in Seconds)

Hyperparameter Tuning

New methods for hyperparameter tuning are now available. Up until PyCaret 2.1, the only way you can tune the hyperparameters of your model in PyCaret was by using the Random Grid Search from scikit-learn. New methods added in 2.2 are:

  • scikit-learn (grid)
  • scikit-optimize (bayesian)
  • tune-sklearn (random, grid, bayesian, hyperopt, bohb)
  • optuna (random, tpe)

To use these new methods, two new parameters ‘search_library’ and ‘search_algorithm’ have been added.

# train dt using default hyperparameters
dt = create_model('dt')

# tune hyperparameters with scikit-learn (default)
tuned_dt_sklearn = tune_model(dt)

# tune hyperparameters with scikit-optimize
tuned_dt_skopt = tune_model(dt, search_library = 'scikit-optimize')

# tune hyperparameters with optuna
tuned_dt_optuna = tune_model(dt, search_library = 'optuna')

# tune hyperparameters with tune-sklearn
tuned_dt_tuneskl = tune_model(dt, search_library = 'tune-sklearn')

search_algorithms are dependent on the search_library. Following search algorithms are available for the respective search libraries:

  • scikit-learn → random (default), grid
  • scikit-optimize → bayesian (default)
  • tune-sklearn → random (default), grid, bayesian, hyperopt, bohb
  • optuna → random, tpe (default)

Early stopping is also supported for estimators with the partial_fit attribute. Read more about it in the release notes.

Benchmark Comparisons of different tuners

Memory and Performance Improvements

PyCaret 2.2 is all about performance and functionality. A significant amount of code was refactored to improve memory footprint and optimize performance without impacting user-experience.

One example is all the numeric data is dynamically cast as 32 bit from 64 bit previously, reducing memory footprint significantly. Another example of performance improvement is cross-validation across all the functions are now parallelized automatically across multiple cores compared to sequential training by fold previously.

We have compared the performance of all released versions of PyCaret on 5M sampled rows from the famous New York Taxi Dataset. The below figure compares the time taken to complete the setup initialization:

All the comparisons are done on the AMD64 machine with 8 CPU cores.

Adding Custom Metrics

You can now fully customize (add or remove) the metrics evaluated during cross-validation. This means that you are no more limited to PyCaret’s default model evaluation metrics. Three new functions get_metrics, add_metric, and remove_metric have been added. The usage is super simple. See the example code:

# import dataset
from pycaret.datasets import get_data
data = get_data('juice')

# initialize the setup
from pycaret.classification import *
clf = setup(data, target = 'Purchase')

# check all metrics used for model evaluation

# add log loss metric in pycaret
from sklearn.metrics import log_loss
add_metric('logloss', 'LogLoss', log_loss, greater_is_better=False)

# compare baseline models
best = compare_models()

Notice that a new column “LogLoss” (all new metrics are added on the right, before TT) is added in the compare_models score grid because we added the metric using the add_metric function. You can use any metric available in scikit-learn or you can create your own using the make_scorer function. You can remove the metric using the following command:

# remove custom metric

Iterative Imputation

Iterative imputation is a technique of imputing missing data using regression and classification estimators to model each feature as a function of other features. Each feature is imputed in a round-robin fashion, previous predictions being used in new ones. This process is repeated several times in order to increase the quality of imputation. Compared to simple imputation, it can create synthetic values that are closer to real values, at a cost of additional processing time. Staying true to the spirit of PyCaret, the usage is super simple:

# initialize setup
from pycaret.classification import *clf = setup(data, target = 'Class', imputation_type="iterative")

By default, it will use Light Gradient Boosting Machine as an estimator for both categorical features (Classification) and numeric features (Regression) that can be changed using categorical_iterative_imputer and numeric_iterative_imputer parameters in the setup.

Benchmark comparisons of iterative imputation vs. simple imputation

To compare the results of the simple mean imputation with iterative imputation we have used the horse colic dataset that contains a large number of missing values. The figure below compares the performance of the Logistic Regression with different imputation methods.

Using Iterative Imputer with KNN as an estimator for both categorical and numeric features improved the mean AUC score by 0.014 (1.59%) compared to simple mean imputation. To learn more about this feature, you can read the complete blog post here.

Fold Strategy

PyCaret 2.2 provides flexibility to define the fold strategy. Up until PyCaret 2.1, you cannot define the cross-validation strategy. It uses ‘StratifiedKFold’ for Classification and ‘KFold’ for Regression which limits the use of PyCaret for certain uses cases, for example, Time-series data.

To overcome this problem, a new parameter ‘fold_strategy’ isadded to the setup function. It can take the following values:

  • kfold for KFold CV;
  • stratifiedkfold for Stratified KFold CV;
  • groupkfold for Group KFold CV;
  • timeseries for TimeSeriesSplit CV; or
  • a custom CV generator object compatible with scikit-learn.

Compare Models on the hold-out set

If you have used PyCaret before, you must be familiar with its most used function compare_models. This function trains and evaluates the performance of all estimators available in the model library using cross-validation. However, the problem is if you are dealing with very large datasets, compare_models may take forever to finish. The reason being that it fits 10 fold for each estimator in the model library. For Classification, this means 15 x 10 = 150 estimators in total.

In PyCaret 2.2 we have introduced a new parameter cross_validation in the compare_models function, which when set to False evaluate all metrics on the holdout set instead of cross-validating. While it may not be advisable to rely on holdout metrics solely, especially when the dataset is too small. It is definitely a huge time saver when working with large datasets.

To quantify the impact, we have compared the performance of compare_models in both scenarios (with cross-validation = True, and cross-validation = False). The dataset used for this comparison is here (45K x 50)

With Cross-Validation (It took 7 min 13s):

Without Cross-Validation (It took 1 min 19s):

Custom Transformations

This is a home run when it comes to flexibility. A new parameter custom_pipeline has been added to the setup function which can take any transformer and append to the preprocessing pipeline of PyCaret. All custom transformations are applied after train_test_split on each CV fold separately to avoid the risk of target leakage. The usage is super simple:

# import dataset
from pycaret.datasets import get_data
data = get_data('juice')

# create custom transformations
from imblearn.over_sampling import SMOTE
from sklearn.decomposition import PCA
custom_pp = [("PCA",PCA()),("smote", SMOTE())]

# initialize setup
from pycaret.classification import *
clf = setup(data, target = 'Purchase', custom_pipeline = custom_pp)

Separate Train and Test Set

This is long-awaited and one of the most requested features since the first release. Now you can pass a separate test set instead of relying on pycaret’s internal train_test_split. A new parameter ‘test_data’ has been added to the setup. When a DataFrame is passed into the test_data, it is used as a test set and the train_size parameter is ignored. test_data must be labeled. See the example code below:

# loading datasetimport pandas as pd
train_data = pd.read_csv('/path/train.csv')
test_data = pd.read_csv('/path/test.csv')

# initializing setupfrom pycaret.classification import *
clf = setup(data = train_data, test_data = test_data)

Disable Preprocessing

If you don’t want to use PyCaret’s default preprocessing pipeline or you already have the transformed dataset and just want to use PyCaret’s modeling capabilities, It wasn’t possible before but now we got you covered. Simply turn off the ‘preprocess’ parameter in the setup. When preprocess is set to False, no transformations are applied except for train_test_split and custom transformations passed in the custom_pipeline parameter.

# initializing setup
from pycaret.classification import *
clf = setup(data = train_data, preprocess = False)

However, when turning off the preprocessing in the setup, you have to ensure that your data is modeling-ready i.e. no missing values, no dates/timestamps, categorical data is encoded, etc.)

Other Changes

  • New plots ‘lift’, ‘gain’, and ‘tree’ have been added in the plot_model.
  • CatBoost is now compatible with the plot_model function. It requires catboost >= 0.23.2.
  • In order to make both the usage and development easier, type hints have been added to all updated pycaret functions, in accordance with best practices. Users can leverage those by using an IDE with support for type hints.

To learn more about all the updates in PyCaret 2.2, please see the release notes.