In continuation to my earlier blog “How to extract question and answer pairs from telegram chat using Python pandas?” , I am here to explain the sentiment analysis on the same telegram group chat history.
The assignment is to find the satisfied and unsatisfied members in the “Eradicate Diabetes” telegram group and design a decision tree classifier model using the data.
A short introduction about “Eradicate Diabetes(ED)” - ED is a community chat group which unites the masses together to combat the problem using the power of crowdsourced healthcare. They help people in reversing their Type 2 diabetes by providing information and support. They not only help in reversing Type 2 diabetes, but also believe in holistic treatment of the organs like the liver, kidneys & heart that have been damaged over years of abuse. They have been advising on Herbal based treatments combined with dietary and lifestyle modifications that have been proven to successfully reverse diabetes.
As this is a continuation of my previous blog, I strongly recommend to go through my blog “How to extract question and answer pairs from telegram chat using Python pandas?” to get more details on
Extracting chat history from a telegram group as a JSON file
Data preprocessing steps to extract required data from the chat history messages
With that as the foundation, let’s get started with the coding for sentiment analysis of ED chat history and let’s see how we arrived at the decision tree model for it.
1. Retrieve the required features for the model
Step 1: Import required libraries. You have to import pandas and JSON libraries as we are using pandas and JSON file as input.
Step 2: Load the JSON file and retrieve only the messages from the JSON file to a pandas dataframe. The dataframe will be as shown below:
Step 3: I am creating a new dataframe with only relevant columns from the main dataframe that has all the messages and that satisfies the below mentioned criteria. So, the new dataframe will have
[id,from and text] columns
Not From “Tim” , “Raj”, “Dia” or “Trupti”
Messages with text length more than 2(this is to eliminate emojis)
Message text that does not have a list of words like good morning, good night etc.
The following code snippet retrieves all messages that satisfies all the above mentioned criteria.
Step 4: Next step is to retrieve all the “satisfied” messages that has been told to “Tim” and “Raj”, basically replies by other members in the group, and that will be based on:
List of words that should be present in the message text like great, thanks, awesome etc.
List of words that should not be present in the message text like weakness, hungry etc. which shows a bit of dissatisfaction or negative emotion.
Add a new column to the df_happy dataframe called “emotion” with value 1 for all the satisfied messages.
The following code snippet retrieves all messages that satisfies all the above mentioned criteria, and the df_happy dataframe contains messages that are nothing but satisfied messages.
Step 5: Next step is to retrieve all the “unsatisfied” messages that has been told to “Tim” and “Raj”, basically negative replies by other members in the group, and that will be based on:
List of words that should be present in the message text like hungry, weakness, confusing etc.
List of words that should not be present in the message text like thanks, great etc. which gives a satisfaction or happy emotion.
Add a new column to the df_unhappy dataframe called “emotion” with value 0 for all the unsatisfied messages.
The following code snippet retrieves all messages that satisfies all the above mentioned criteria, and the df_unhappy dataframe contains messages that are nothing but unsatisfied messages.
Step 6: Now that we retrieved satisfied and unsatisfied messages based on required criteria, let us merge both the dataframe row wise to get a dataframe of both the types of emotions . To do that:
Let’s create an empty list.
Append both the dataframes df_happy and df_unhappy to the list.
Use pd.concat() to merge both the dataframe row wise and extract only 2 columns text and emotion that are required for the Decision tree model.
We have now got the 2 features text and emotion that can be used to build the decision tree classifier model.
2. Building the Decision tree Classifier model using the features text and emotion
The 2 features considered here to build a model for sentiment analysis are text and emotion. Message text is an independent variable, so it will be X and since emotion is dependent on the text, emotion will be considered as Y as shown below:
Since, string values cannot be used for prediction as algorithms only work on numeric data, we will have to convert text data to numeric using CountVectorizer.
What is a CountVectorizer?
In order to use textual data for predictive modeling, the text must be parsed to remove certain words – this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. This is helpful when we have multiple such texts, and we wish to convert each word in each text into vectors (for using in further text analysis).
Ok.. Now let’s start putting down all of these in code.
Step 1: Let’s import the required libraries to use CountVectorizer and decision tree classifier functions.
Step 2: Then create the train, test data with the data available using train_test_split() function. The message text which is X value, has to be transformed to an array of numeric values using fit_transform() of CountVectorizer.
Step 3: Use the transformed X values and the Y values (emotion feature values) to fit the decision tree classifier model. Use the Xtest values, convert them to an array of numeric values as we did for the Xtrain data and use it to predict the model using predict() function.
Thus we have built a model to predict the emotion of a text message. We can use metrics.accuracy_score() to find the accuracy of the prediction.
So, the accuracy score of 0.98 shows that the model that we built predicts 98% accurately, which is really good.
Why Sentiment Analysis?
It’s estimated that 80% of the world’s data is unstructured, rather unorganized. Huge volumes of text data is created every day through emails, support tickets, chats, social media conversations, surveys, articles, documents, etc. It is very time-consuming and expensive and more importantly hard to analyze, understand, and sort through these huge volumes.
Sentiment analysis helps businesses to make sense of all this unstructured text through processing them as required.
The main benefits of sentiment analysis include:
It helps businesses process huge amounts of data in an efficient and cost-effective way.
It can identify critical issues in real-time, for example, Why did customers leave a brand?
Companies can gain better insights
Sentiment analysis can be applied to many aspects of business, from brand monitoring and product analytics, to customer service and market research. Not only sentiment analysis enables us to get new insights, it also helps us to better understand our customers, and empower our own teams more effectively so that they do better and more productive work.