In continuation to my earlier blog “How to extract question and answer pairs from telegram chat using Python pandas?”, I am here to explain the sentiment analysis on the same telegram group chat history.
The assignment is to find the satisfied and unsatisfied members in the “Eradicate Diabetes” telegram group and design a decision tree classifier model using the data.
A short introduction about “Eradicate Diabetes(ED)” - ED is a community chat group that unites the masses together to combat the problem using the power of crowdsourced healthcare. They help people in reversing their Type 2 diabetes by providing information and support. They not only help in reversing Type 2 diabetes but also believe in the holistic treatment of the organs like the liver, kidneys & heart that have been damaged over years of abuse. They have been advising on Herbal based treatments combined with dietary and lifestyle modifications that have been proven to successfully reverse diabetes.
As this is a continuation of my previous blog, I strongly recommend going through my blog “How to extract question and answer pairs from telegram chat using Python pandas?” to get more details on
Extracting chat history from a telegram group as a JSON file
Data preprocessing steps to extract required data from the chat history messages
With that as the foundation, let’s get started with the coding for sentiment analysis of ED chat history, and let’s see how we arrived at the decision tree model for it.
1. Retrieve the required features for the model
Step 1: Import required libraries. You have to import pandas and JSON libraries as we are using pandas and JSON file as input.
Step 2: Load the JSON file and retrieve only the messages from the JSON file to a pandas data frame. The data frame will be as shown below:
Step 3: I am creating a new data frame with only relevant columns from the main data frame that has all the messages and that satisfy the below-mentioned criteria. So, the new data frame will have
[id,from and text] columns
Not From “Tim”, “Raj”, “Dia” or “Trupti”
Messages with text length more than 2(this is to eliminate emojis)
Message text that does not have a list of words like good morning, good night, etc.
The following code snippet retrieves all messages that satisfy all the above-mentioned criteria.
Step 4: Next step is to retrieve all the “satisfied” messages that have been told to “Tim” and “Raj”, basically replies by other members in the group, and that will be based on:
List of words that should be present in the message text like great, thanks, awesome, etc.
List of words that should not be present in the message text like weakness, hungry, etc. which shows a bit of dissatisfaction or negative emotion.
Add a new column to the df_happy data frame called “emotion” with the value 1 for all the satisfied messages.
The following code snippet retrieves all messages that satisfy all the above-mentioned criteria, and the df_happy data frame contains messages that are nothing but satisfied messages.
Step 5: Next step is to retrieve all the “unsatisfied” messages that have been told to “Tim” and “Raj”, basically negative replies by other members in the group, and that will be based on:
List of words that should be present in the message text like hungry, weakness, confusing, etc.
List of words that should not be present in the message text like thanks, great, etc. which gives satisfaction or happy emotion.
Add a new column to the df_unhappy data frame called “emotion” with the value 0 for all the unsatisfied messages.
The following code snippet retrieves all messages that satisfy all the above-mentioned criteria, and the df_unhappy data frame contains messages that are nothing but unsatisfied messages.
Step 6: Now that we retrieved satisfied and unsatisfied messages based on required criteria, let us merge both the data frames row-wise to get a data frame of both types of emotions. To do that:
Let’s create an empty list.
Append both the data frames df_happy and df_unhappy to the list.
Use pd.concat() to merge both the data frame row-wise and extract only 2 columns (text and emotion) that are required for the Decision tree model.
We have now got the 2 features text and emotion that can be used to build the decision tree classifier model.
2. Building the Decision tree Classifier model using the features text and emotion
The 2 features considered here to build a model for sentiment analysis are text and emotion. Message text is an independent variable, so it will be X and since emotion is dependent on the text, emotion will be considered as Y as shown below:
Since string values cannot be used for prediction as algorithms only work on numeric data, we will have to convert text data to numeric using CountVectorizer.