Introduction
I am excited to announce the release of PyCaret 2.0 today.
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflow. It is an end-to-end machine learning and model management tool that speeds up the machine learning experiment cycle and makes you more productive.
In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few words only. This makes experiments exponentially fast and efficient.
See detailed release notes for PyCaret 2.0.
Why use PyCaret?
Installing PyCaret 2.0
Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using virtual environment to avoid potential conflict with other libraries. See the following example code to create a conda environment and install pycaret within that conda environment:
# create a conda environment
conda create --name yourenvname python=3.6
# activate conda environment
conda activate yourenvname
# install pycaret
pip install pycaret==2.0
# install notebook kernel
python -m ipykernel install --user --name yourenvname --display-name "display-name"
All hard dependencies are automatically installed when you install PyCaret using pip. Click here to see the complete list of dependencies.
Getting Started with PyCaret 2.0
The first step of any machine learning experiment in PyCaret is to set up an environment by importing the relevant module and initialize the setup function by passing data frame and name of the target variable. See example code:
# import module
from pycaret.classification import
# initialize setup (in notebook)
clf1 = setup(data, target = 'target-variable')
# initialize setup (outside notebook)
clf1 = setup(data, target = 'target-variable', html = False)
# initialize setup (in silent mode)
clf1 = setup(data, target = 'target-variable', html = False, silent = True)
All the preprocessing transformations are applied within the setup function. PyCaret provides over 20 different pre-processing transformations that can be defined within the setup function. Click here to learn more about PyCaret’s preprocessing abilities.
Compare Models
This is the first step we recommend in any supervised machine learning task. This function trains all the models in the model library using default hyperparameters and evaluates performance metrics using cross-validation. It returns the trained model object class. The evaluation metrics used are:
- Classification: Accuracy, AUC, Recall, Precision, F1, Kappa, MCC
- Regression: MAE, MSE, RMSE, R2, RMSLE, MAPE
Here are few ways you can use compare_models function:
# import classification module
from pycaret.classification import *
# init setup
clf1 = setup(data, target = 'name-of-target')
# return best model
best = compare_models()
# return best model based on Recall metric
best = compare_models(sort = 'Recall')
# include certain models
best_specific = compare_models(whitelist = ['dt','rf','xgboost'])
# exclude certain models
best_specific = compare_models(blacklist = ['catboost','svm'])
# return top 3 models based on Accuracy
top3 = compare_models(n_select = 3)
Create Model
Create Model function trains a model using default hyperparameters and evaluates performance metrics using cross validation. This function is base to almost all other functions in PyCaret. It returns the trained model object class. Here are few ways you can use this function:
# import classification module
from pycaret.classification import *
# init setup
clf1 = setup(data, target = 'name-of-target')
# train logistic regression model
lr = create_model('lr') #lr is the id of the model
# check the model library to see all estimators
models()
# train rf model using 5 fold CV
rf = create_model('rf', fold = 5)
# train svm model without CV
svm = create_model('svm', cross_validation = False)
# train xgboost model with max_depth = 10
xgboost = create_model('xgboost', max_depth = 10)
# train xgboost model on gpu
xgboost_gpu = create_model('xgboost', tree_method = 'gpu_hist', gpu_id = 0) #0 is gpu-id
# train multiple lightgbm models with n learning_rate
lgbms = [create_model('lightgbm', learning_rate = i) for i in np.arange(0.1,1,0.1)]
# train custom model
from gplearn.genetic import SymbolicClassifier
symclf = SymbolicClassifier(generation = 50)
sc = create_model(symclf)
To learn more about create model function, click here.
Tune Model
Tune Model function tunes the hyperparameter of the model passed as an estimator. It uses Random grid search with pre-defined tuning grids that are fully customizable. Here are few ways you can use this function:
# import classification module
from pycaret.classification import *
# init setup
clf1 = setup(data, target = 'name-of-target')
# train a decision tree model
dt = create_model('dt')
# tune hyperparameters of decision tree
tuned_dt = tune_model(dt)
# tune hyperparameters with increased n_iter
tuned_dt = tune_model(dt, n_iter = 50)
# tune hyperparameters to optimize AUC
tuned_dt = tune_model(dt, optimize = 'AUC') #default is 'Accuracy'
# tune hyperparameters with custom_grid
params = {"max_depth": np.random.randint(1, (len(data.columns)*.85),20), "max_features": np.random.randint(1, len(data.columns),20), "min_samples_leaf": [2,3,4,5,6], "criterion": ["gini", "entropy"]}
tuned_dt_custom = tune_model(dt, custom_grid = params)
# tune multiple models dynamically
top3 = compare_models(n_select = 3)
tuned_top3 = [tune_model(i) for i in top3]
To learn more about tune model function, click here.
Ensemble Model
There are few functions available to ensemble base learners. ensemble_model, blend_models and stack_models are three of them. Here are few ways you can use this function:
# import classification module
from pycaret.classification import *
# init setup
clf1 = setup(data, target = 'name-of-target')
# train a decision tree model
dt = create_model('dt')
# train a bagging classifier on dt
bagged_dt = ensemble_model(dt, method = 'Bagging')
# train a adaboost classifier on dt with 100 estimators
boosted_dt = ensemble_model(dt, method = 'Boosting', n_estimators = 100)
# train a votingclassifier on all models in library
blender = blend_models()
# train a voting classifier on specific models
dt = create_model('dt')
rf = create_model('rf')
adaboost = create_model('ada')
blender_specific = blend_models(estimator_list = [dt,rf,adaboost], method = 'soft')
# train a voting classifier dynamically
blender_top5 = blend_models(compare_models(n_select = 5))
# train a stacking classifier
stacker = stack_models(estimator_list = [dt,rf], meta_model = adaboost)
# stack multiple models dynamically
top7 = compare_models(n_select = 7)
stacker = stack_models(estimator_list = top7[1:], meta_model = top7[0])
To learn more about ensemble models in PyCaret, click here.
Predict Model
As the name suggests, this function is used for inference / prediction. Here is how you can use it:
# train a catboost model
catboost = create_model('catboost')
# predict on holdout set (when no data is passed)
pred_holdout = predict_model(catboost)
# predict on new dataset
new_data = pd.read_csv('new-data.csv')
pred_new = predict_model(catboost, data = new_data)
Plot Model
Plot Model function is used to evaluate performance of the trained machine learning model. Here is an example:
# import classification module
from pycaret.classification import *
# init setup
clf1 = setup(data, target = 'name-of-target')
# train adaboost model
adaboost = create_model('ada')
# AUC plot
plot_model(adaboost, plot = 'auc')
# Decision Boundary
plot_model(adaboost, plot = 'boundary')
# Precision Recall Curve
plot_model(adaboost, plot = 'pr')
# Validation Curve
plot_model(adaboost, plot = 'vc')
Click here to learn more about different visualization in PyCaret. Alternatively, you can use the evaluate_model function to see plots via the user interface within the notebook.
Util functions
PyCaret 2.0 includes several new util functions that come in handy when managing your machine learning experiments with PyCaret. Some of them are shown below:
# select and finalize the best model in the active run
best_model = automl() #returns the best model based on CV score
# select and finalize the best model based on 'F1' on hold_out set
best_model_holdout = automl(optimize = 'F1', use_holdout = True)
# save model
save_model(model, 'c:/path-to-directory/model-name')
# load model
model = load_model('c:/path-to-directory/model-name')
# retrieve score grid as pandas df
dt = create_model('dt')
dt_results = pull()
# get global environment variable
X_train = get_config('X_train')
seed = get_config('seed')
# set global environment variable
set_seed(seed, 999)
# get experiment logs as csv file
logs = get_logs()
# get system logs for audit
system_logs = get_system_logs()
To see all new functions implemented in PyCaret 2.0, See release notes.
Experiment Logging
PyCaret 2.0 embeds MLflow tracking component as a backend API and UI for logging parameters, code versions, metrics, and output files when running your machine learning code and for later visualizing the results. Here is how you can log your experiment in PyCaret.
# import classification module
from pycaret.classification import *
# init setup
clf1 = setup(data, target = 'name-of-target', log_experiment = True, experiment_name = 'exp-name-here')
# compare models
best = compare_models()
# start mlflow server (with notebook)
!mlflow ui
Output (on https://localhost:5000)
Putting it all together — Create your own AutoML
Using all the functions, let’s create a simple command line software that will train multiple models with default parameters, tune hyperparameters of top candidate models, try different ensembling techniques and returns / saves the best model. Here is the command line script:
# import libraries
import pandas as pd
import sys
# define command line parameters
data = sys.argv[1]
target = sys.argv[2]
# load data (replace this part with your own script)
from pycaret.datasets import get_data
input_data = get_data(data)
# init setup
from pycaret.classification import *
clf1 = setup(data = input_data, target = target, log_experiment = True)
# compare baseline models
top5 = compare_models(n_select = 5)
# tune top5 models
tuned_top5 = [tune_model(i) for i in top5]
# ensemble top5 tuned models
bagged_tuned_top5 = [ensemble_model(i, method = 'Bagging') for i in tuned_top5]
# blend top5 models
blender = blend_models(estimator_list = top5)
# stack top5 models
stacker = stack_models(estimator_list = top5[1:], meta_model = top5[0])
# select best model based on recall
best_model = automl(optimize = 'Recall')
# save model
save_model(best_model, 'c:/path-to-directory/final-model')
This script will dynamically select and saves the best model. In just a few lines of code, you have developed your own Auto ML software with a full-fledged logging system and even a UI presenting a beautiful leaderboard.
There is no limit to what you can achieve using the light weight workflow automation library in Python. If you find this useful, please do not forget to give us ⭐️ on our github repo if you like PyCaret.
To hear more about PyCaret follow us on LinkedIn and Youtube.