Sentiment Analysis of telegram chat history using Decision Tree Classifier model
In continuation to my earlier blog “How to extract question and answer pairs from telegram chat using Python pandas?”, I am here to explain the sentiment analysis on the same telegram group chat history.
The assignment is to find the satisfied and unsatisfied members in the “Eradicate Diabetes” telegram group and design a decision tree classifier model using the data.
A short introduction about “Eradicate Diabetes(ED)” - ED is a community chat group that unites the masses together to combat the problem using the power of crowdsourced healthcare. They help people in reversing their Type 2 diabetes by providing information and support. They not only help in reversing Type 2 diabetes but also believe in the holistic treatment of the organs like the liver, kidneys & heart that have been damaged over years of abuse. They have been advising on Herbal based treatments combined with dietary and lifestyle modifications that have been proven to successfully reverse diabetes.
As this is a continuation of my previous blog, I strongly recommend going through my blog “How to extract question and answer pairs from telegram chat using Python pandas?” to get more details on
Extracting chat history from a telegram group as a JSON file
Data preprocessing steps to extract required data from the chat history messages
With that as the foundation, let’s get started with the coding for sentiment analysis of ED chat history, and let’s see how we arrived at the decision tree model for it.
1. Retrieve the required features for the model
Step 1: Import required libraries. You have to import pandas and JSON libraries as we are using pandas and JSON file as input.
Step 2: Load the JSON file and retrieve only the messages from the JSON file to a pandas data frame. The data frame will be as shown below:
Step 3: I am creating a new data frame with only relevant columns from the main data frame that has all the messages and that satisfy the below-mentioned criteria. So, the new data frame will have
[id,from and text] columns
Not From “Tim”, “Raj”, “Dia” or “Trupti”
Messages with text length more than 2(this is to eliminate emojis)
Message text that does not have a list of words like good morning, good night, etc.
The following code snippet retrieves all messages that satisfy all the above-mentioned criteria.
Step 4: Next step is to retrieve all the “satisfied” messages that have been told to “Tim” and “Raj”, basically replies by other members in the group, and that will be based on:
List of words that should be present in the message text like great, thanks, awesome, etc.
List of words that should not be present in the message text like weakness, hungry, etc. which shows a bit of dissatisfaction or negative emotion.
Add a new column to the df_happy data frame called “emotion” with the value 1 for all the satisfied messages.
The following code snippet retrieves all messages that satisfy all the above-mentioned criteria, and the df_happy data frame contains messages that are nothing but satisfied messages.
Step 5: Next step is to retrieve all the “unsatisfied” messages that have been told to “Tim” and “Raj”, basically negative replies by other members in the group, and that will be based on:
List of words that should be present in the message text like hungry, weakness, confusing, etc.
List of words that should not be present in the message text like thanks, great, etc. which gives satisfaction or happy emotion.
Add a new column to the df_unhappy data frame called “emotion” with the value 0 for all the unsatisfied messages.
The following code snippet retrieves all messages that satisfy all the above-mentioned criteria, and the df_unhappy data frame contains messages that are nothing but unsatisfied messages.
Step 6: Now that we retrieved satisfied and unsatisfied messages based on required criteria, let us merge both the data frames row-wise to get a data frame of both types of emotions. To do that:
Let’s create an empty list.
Append both the data frames df_happy and df_unhappy to the list.
Use pd.concat() to merge both the data frame row-wise and extract only 2 columns (text and emotion) that are required for the Decision tree model.
We have now got the 2 features text and emotion that can be used to build the decision tree classifier model.
2. Building the Decision tree Classifier model using the features text and emotion
The 2 features considered here to build a model for sentiment analysis are text and emotion. Message text is an independent variable, so it will be X and since emotion is dependent on the text, emotion will be considered as Y as shown below:
Since string values cannot be used for prediction as algorithms only work on numeric data, we will have to convert text data to numeric using CountVectorizer.
What is a CountVectorizer?
In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (to use in further text analysis).
Ok.. Now let’s start putting down all of these in code.
Step 1: Let’s import the required libraries to use CountVectorizer and decision tree classifier functions.
Step 2: Then create the train, test data with the data available using the train_test_split() function. The message text which is X value, has to be transformed to an array of numeric values using fit_transform() of CountVectorizer.
Step 3: Use the transformed X values and the Y values (emotion feature values) to fit the decision tree classifier model. Use the Xtest values, convert them to an array of numeric values as we did for the Xtrain data, and use it to predict the model using predict() function.
Thus we have built a model to predict the emotion of a text message. We can use metrics.accuracy_score() to find the accuracy of the prediction.
So, the accuracy score of 0.98 shows that the model that we built predicts 98% accurately, which is really good.
Why Sentiment Analysis?
It’s estimated that 80% of the world’s data is unstructured, rather unorganized. Huge volumes of text data are created every day through emails, support tickets, chats, social media conversations, surveys, articles, documents, etc. It is very time-consuming and expensive and more importantly hard to analyze, understand, and sort through these huge volumes.
Sentiment analysis helps businesses to make sense of all this unstructured text by processing them as required.
The main benefits of sentiment analysis include:
It helps businesses process huge amounts of data in an efficient and cost-effective way.
It can identify critical issues in real-time, for example, Why did customers leave a brand?
Companies can gain better insights
Sentiment analysis can be applied to many aspects of business, from brand monitoring and product analytics to customer service and market research. Not only sentiment analysis enables us to get new insights, but it also helps us to better understand our customers, and empower our own teams more effectively so that they do better and more productive work.