In my last post, we demonstrated how to implement clustering analysis in Power BI by integrating it with PyCaret, thus allowing analysts and data scientists to add a layer of machine learning to their reports and dashboards without any additional license costs.
In this post, we will see how we can implement topic modeling in Power BI using PyCaret. If you haven’t heard about PyCaret before, please read this announcement to learn more.
Learning Goals of this Tutorial
- What is Natural Language Processing?
- What is Topic Modeling?
- Train and implement a Latent Dirichlet Allocation model in Power BI.
- Analyze results and visualize information in a dashboard.
Before we start
If you have used Python before, it is likely that you already have Anaconda Distribution installed on your computer. If not, click here to download Anaconda Distribution with Python 3.7 or greater.
Setting up the Environment
Before we start using PyCaret’s machine learning capabilities in Power BI we have to create a virtual environment and install pycaret. It’s a four-step process:
Step 1: Create an anaconda environment
Open Anaconda Prompt from start menu and execute the following code:
# create new conda environment
conda create --name powerbi python=3.7
“powerbi” is the name of environment we have chosen. You can keep whatever name you would like.
Step 2: Install PyCaret
Execute the following code in Anaconda Prompt:
# install pycaret
pip install pycaret
Installation may take 15–20 minutes. If you are having issues with installation, please see our GitHub page for known issues and resolutions.
Step 3: Set Python Directory in Power BI
The virtual environment created must be linked with Power BI. This can be done using Global Settings in Power BI Desktop (File → Options → Global → Python scripting). Anaconda Environment by default is installed under:
Step 4: Install Language Model
In order to perform NLP tasks, you must download the language model by executing the following code in your Anaconda Prompt. First activate your conda environment in Anaconda Prompt:
# activate conda env
conda activate powerbi
# install spacy language model
python -m spacy download en_core_web_sm
python -m textblob.download_corpora
What is Natural Language Processing?
Natural language processing (NLP) is a subfield of computer science and artificial intelligence that is concerned with the interactions between computers and human languages. In particular, NLP covers broad range of techniques on how to program computers to process and analyze large amounts of natural language data.
NLP-powered software helps us in our daily lives in various ways and it is likely that you have been using it without even knowing. Some examples are:
- Personal assistants: Siri, Cortana, Alexa.
- Auto-complete: In search engines (e.g: Google, Bing, Baidu, Yahoo).
- Spell checking: Almost everywhere, in your browser, your IDE (e.g: Visual Studio), desktop apps (e.g: Microsoft Word).
- Machine Translation: Google Translate.
- Document Summarization Software: Text compactor, Autosummarizer.
Topic Modeling is a type of statistical model used for discovering abstract topics in text data. It is one of many practical applications within NLP.
What is Topic Modeling?
A topic model is a type of statistical model that falls under unsupervised machine learning and is used for discovering abstract topics in text data. The goal of topic modeling is to automatically find the topics / themes in a set of documents.
Some common use-cases for topic modeling are:
- Summarizing large text data by classifying documents into topics (the idea is pretty similar to clustering).
- Exploratory Data Analysis to gain an understanding of data such as customer feedback forms, amazon reviews, survey results, etc.
- Feature Engineering creating features for supervised machine learning experiments such as classification or regression
There are several algorithms used for topic modeling. Some common ones are Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NMF). Each algorithm has its own mathematical details which will not be covered in this tutorial. We will implement a Latent Dirichlet Allocation (LDA) model in Power BI using PyCaret’s NLP module.
If you are interested in learning the technical details of the LDA algorithm, you can read this paper.
Text preprocessing for Topic Modeling
In order to get meaningful results from topic modeling text data must be processed before feeding it to the algorithm. This is common with almost all NLP tasks. The preprocessing of text is different from the classical preprocessing techniques often used in machine learning when dealing with structured data (data in rows and columns).
PyCaret automatically preprocesses text data by applying over 15 techniques such as stop word removal, tokenization, lemmatization, bi-gram/tri-gram extraction, etc. If you would like to learn more about all the text preprocessing features available in PyCaret, click here.
Setting the Business Context
Kiva is an international non-profit founded in 2005 in San Francisco. Its mission is to expand financial access to underserved communities in order to help them thrive.
In this tutorial, we will use the open dataset from Kiva which contains loan information on 6,818 approved loan applicants. The dataset includes information such as loan amount, country, gender, and some text data which is the application submitted by the borrower.
Our objective is to analyze the text data in the ‘en’ column to find abstract topics and then use them to evaluate the effect of certain topics (or certain types of loans) on the default rate.
Let’s get started
Now that you have set up the Anaconda Environment, understand topic modeling, and have the business context for this tutorial, let’s get started.
The first step is importing the dataset into Power BI Desktop. You can load the data using a web connector. (Power BI Desktop → Get Data → From Web).
To train a topic model in Power BI we will have to execute a Python script in Power Query Editor (Power Query Editor → Transform → Run python script). Run the following code as a Python script:
from pycaret.nlp import *
dataset = get_topics(dataset, text='en')
By default, PyCaret trains a Latent Dirichlet Allocation (LDA) model with 4 topics. Default values can be changed easily:
- To change the model type use the model parameter within get_topics().
- To change the number of topics, use the num_topics parameter.
See the example code for a Non-Negative Matrix Factorization model with 6 topics.
from pycaret.nlp import *
dataset = get_topics(dataset, text='en', model='nmf', num_topics=6)
New columns containing topic weights are attached to the original dataset. Here’s how the final output looks like in Power BI once you apply the query.
Once you have topic weights in Power BI, here’s an example of how you can visualize it in the dashboard to generate insights:
You can download the PBIX file and the data set from our GitHub.
If you would like to learn more about implementing Topic Modeling in Jupyter notebook using PyCaret, watch this 2 minute video tutorial:
If you are Interested in learning more about Topic Modeling, you can also checkout our NLP 101 Notebook Tutorial for beginners.