Introduction
In my last post, we demonstrated how to build an anomaly detector in Power BI by integrating it with PyCaret, thus allowing analysts and data scientists to add a layer of machine learning to their reports and dashboards without any additional license costs.
In this post, we will see how we can implement Clustering Analysis in Power BI using PyCaret. If you haven’t heard about PyCaret before, please read this announcement to learn more.
Learning Goals of this Tutorial
- What is Clustering? Types of Clustering.
- Train and implement an unsupervised Clustering model in Power BI.
- Analyze results and visualize information in a dashboard.
- How to deploy the Clustering model in Power BI production?
Before we start
If you have used Python before, it is likely that you already have Anaconda Distribution installed on your computer. If not, click here to download Anaconda Distribution with Python 3.7 or greater.
Setting up the Environment
Before we start using PyCaret’s machine learning capabilities in Power BI we have to create a virtual environment and install pycaret. It’s a three-step process:
Step 1: Create an anaconda environment
Open Anaconda Prompt from start menu and execute the following code:
# create new conda environment
conda create --name myenv python=3.7
Step 2: Install PyCaret
Execute the following code in Anaconda Prompt:
# install pycaret
pip install pycaret
Installation may take 15–20 minutes. If you are having issues with installation, please see our GitHub page for known issues and resolutions.
Step 3: Set Python Directory in Power BI
The virtual environment created must be linked with Power BI. This can be done using Global Settings in Power BI Desktop (File → Options → Global → Python scripting). Anaconda Environment by default is installed under:
C:\Users\username\AppData\Local\Continuum\anaconda3\envs\myenv
What is Clustering?
Clustering is a technique that groups data points with similar characteristics. These groupings are useful for exploring data, identifying patterns, and analyzing a subset of data. Organizing data into clusters helps in identify underlying structures in the data and finds applications across many industries. Some common business use cases for clustering are:
- Customer segmentation for the purpose of marketing.
- Customer purchasing behavior analysis for promotions and discounts.
- Identifying geo-clusters in an epidemic outbreak such as COVID-19.
Types of Clustering
Given the subjective nature of clustering tasks, there are various algorithms that suit different types of problems. Each algorithm has its own rules and the mathematics behind how clusters are calculated.
This tutorial is about implementing a clustering analysis in Power BI using a Python library called PyCaret. Discussion of the specific algorithmic details and mathematics behind these algorithms is out-of-scope for this tutorial.
In this tutorial we will use a K-Means algorithm which is one of the simplest and most popular unsupervised machine learning algorithms. If you would like to learn more about K-Means, you can read this paper.
Setting the Business Context
In this tutorial we will use the current health expenditure dataset from the World Health Organization’sGlobal Health Expenditure database. The dataset contains health expenditure as a % of National GDP for over 200 countries from year 2000 through 2017.
Our objective is to find patterns and groups in this data by using a K-Means clustering algorithm.
Let’s get started
Now that you have set up the Anaconda Environment, installed PyCaret, understand the basics of Clustering Analysis, and have the business context for this tutorial, let’s get started.
Get Data
The first step is importing the dataset into Power BI Desktop. You can load the data using a web connector. (Power BI Desktop → Get Data → From Web).
Link to csv file:
https://github.com/pycaret/powerbi-clustering/blob/master/clustering.csv
Model Training
To train a clustering model in Power BI we will have to execute a Python script in Power Query Editor (Power Query Editor → Transform → Run python script). Run the following code as a Python script:
from pycaret.clustering import *
dataset = get_clusters(dataset, num_clusters=5, ignore_features=['Country'])
We have ignored the ‘Country’ column in the dataset using the ignore_features parameter. There could be many reasons for which you might not want to use certain columns for training a machine learning algorithm.
PyCaret allows you to hide instead of dropping unneeded columns from a dataset as you might require those columns for later analysis. For example, in this case, we don’t want to use ‘Country’ for training an algorithm and hence we have passed it under ignore_features.
By default, PyCaret trains a K-Means Clustering model with 4 clusters. Default values can be changed easily:
- To change the model type use the model parameter within get_clusters().
- To change the cluster number, use the num_clusters parameter.
See the example code for K-Modes Clustering with 6 clusters.
from pycaret.clustering import *
dataset = get_clusters(dataset, model='kmodes', num_clusters=6, ignore_features=['Country'])
A new column that contains the cluster label is attached to the original dataset. All the year columns are then unpivoted to normalize the data so it can be used for visualization in Power BI.
Here’s how the final output looks like in Power BI.
Dashboard
Once you have cluster labels in Power BI, here’s an example of how you can visualize it in dashboard to generate insights:
You can download the PBIX file and the data set from our GitHub.
Implementing Clustering in Production
What has been demonstrated above was one simple way to implement Clustering in Power BI. However, it is important to note that the method shown above trains the clustering model every time the Power BI dataset is refreshed. This may be a problem for two reasons:
- When the model is re-trained with new data, the cluster labels may change (eg: some data points that were labeled as Cluster 1 earlier might be labeled as Cluster 2 when re-trained)
- You don’t want to spend hours of time every day re-training the model.
A more productive way to implement clustering in Power BI is to use a pre-trained model for generating cluster labels instead of re-training the model every time.
Training Model before-hand
You can use any Integrated Development Environment (IDE)or Notebook for training machine learning models. In this example, we have used Visual Studio Code to train a clustering model.
A trained model is then saved as a pickle file and imported into Power Query for generating cluster labels.
If you would like to learn more about implementing Clustering Analysis in Jupyter notebook using PyCaret, watch this 2 minute video tutorial:
Using the pre-trained model
Execute the below code as a Python script to generate labels from the pre-trained model.
from pycaret.clustering import *
dataset = predict_model('c:/.../clustering_deployment_20052020, data = dataset)
The output of this will be the same as the one we saw above. The difference is that when you use a pre-trained model, the label is generated on a new dataset using the same model instead of re-training the model.
Making it work on Power BI Service
Once you’ve uploaded the .pbix file to the Power BI service, a couple more steps are necessary to enable seamless integration of the machine learning pipeline into your data pipeline. These include:
- Enable scheduled refresh for the dataset — to enable a scheduled refresh for the workbook that contains your dataset with Python scripts, see Configuring scheduled refresh, which also includes information about Personal Gateway.
- Install the Personal Gateway — you need a Personal Gateway installed on the machine where the file is located, and where Python is installed; the Power BI service must have access to that Python environment. You can get more information on how to install and configure Personal Gateway.
If you are Interested in learning more about Clustering Analysis, checkout our Notebook Tutorial.