How to extract question and answer pairs from telegram chat using Python pandas?

What? Extract questions and answers pairs from telegram chats? How is it possible? How to do it? These were some of the few questions that came up to my mind when this assignment was given to us as a part of a training course. It seemed impossible to me, because I never thought that I could extract chat history and manipulate the data. I never believed that it would be this easy to extract chat history and filter data using various criteria until I tried it out and saw the desired output.

Amazing and Interesting isn’t it?

So, without wasting much time let’s get into the steps of how to go about extracting the chat history from telegram and filtering question and answer pairs from it.

First, let’s talk about the problem statement of the assignment:

The assignment was to extract the chat history from “Eradicates Diabetes” a community chat group and filter out only the question and answer pairs that were answered by the group admins “Tim” and “Raj”.

Some insight about “Eradicate Diabetes(ED)” - ED is a community chat group that unites the masses together to combat the problem using the power of crowdsourced healthcare. They help people in reversing their Type 2 diabetes by providing information and support. They not only help in reversing Type 2 diabetes but also believe in the holistic treatment of the organs like the liver, kidneys & heart that have been damaged over years of abuse. They have been advising on Herbal based treatments combined with dietary and lifestyle modifications that have been proven to successfully reverse diabetes.

Following are the steps I followed to go about achieving the required results:

Step1: Extracting the chat history into a JSON file

Go to the telegram, and select the chat (I have used ED, but you can use any chat) from which you want to export the chat history, and click on the three dots in the right side top corner. A small menu opens up and you can see an option called “Export chat history” as shown below:

Click on the “Export chat history” option. A small window pops up with export options and file format selection etc. as shown below:

In the popup, select all the types of files that are required to be extracted from the chat. Telegram provides 2 file formats : HTML and JSON. By default, it extracts in HTML format, if you want to extract it in JSON format, click on Format: HTML. It will open up another popup where you can select JSON file format and click on “Save”. Once all the settings are selected, click on “Export” at the bottom corner.

The downloaded files can be found in the Downloads folder on the computer or the device.

Step 2: Analyzing the extracted JSON file to decide on what information is relevant

Data analysis and data cleansing are the most important step in any project in data science. To do that, first,

Analyze the data, how it is structured, what are the various attributes captured, and how each and every attribute are related to each other.
Analyze the data stored in those attributes, that is you need to understand the kind of data stored like integer, strings, etc.
Analyze and separate the relevant data attributes and not-so-important ones.
Remove unwanted data if required

These are some of the steps I followed, but you can add more according to the requirement. Remember, please do not jump immediately to start coding, without this most important step. If you forget to do this step, then you may end up spending more time to get the o/p as the understanding of data is more important than anything else in data science.

Step 3: Working on the JSON file to get the required output using python pandas

According to me, the smaller and cleaner the dataset, the output will be more accurate and can be validated easily. So, I would like to make my dataset smaller as much as possible according to the requirement and then start applying the logic to get the desired output.

After analyzing the data thoroughly, you can start code. I used Kaggle notebooks to do this assignment. Any IDE that supports python pandas can be used.

Open a new notebook in kaggle.com
Give a proper name for the notebook. Eg, ED Telegram chat extraction
Click on “ Add data” in the right pane top and add the input file which is nothing but the extracted JSON file.

Please check the screenshot given below:

The extracted data stored in JSON file has the following format. We are interested only in the messages tag and every message has a large set of attributes.

Let’s start coding and follow the below-mentioned steps:

Step 1: Import required libraries. You have to import pandas and JSON libraries as we are using pandas and JSON file as input. And also to write the output to a file we have to import sys and OS libraries too.

import json
import pandas as pd
import os,sys

Step 2: Load the JSON file and retrieve only the messages from the JSON file to a pandas data frame. The data frame will be as shown below:

#Opening and loading the json 
f = open('../input/result2/result2.json', encoding='utf8')
data = json.load(f)

#Retreive only the messages tag from json
msgs = data['messages']
dmain = pd.DataFrame(msgs)

Output:

Step 3: I am creating a new data frame with only relevant columns from the main data frame that has all the messages and all of its’ attributes. So, the new data frame will have [id, from, reply_to_message_id and text] columns as shown below:

#Create a new dataframe with only required columns (id,from,reply_to_message_id and text)
df_new = dmain.filter(items = ['id','from', 'reply_to_message_id' ,'text'])

Output:

Step 4: Next, let's retrieve all the messages that satisfy the following criteria:

Messages only from “Tim” or “Raj”
Message text length more than 8
Message text that does not have a list of words like good morning, awesome, etc., and also messages that have “?”
“Reply_to_message_id” field has a value greater than zero.

The following code snippet retrieves all messages that satisfy all the above-mentioned criteria, and the df_ans data frame contains messages that are nothing but answers.

#Retrieve all the answers by Tim and Raj
#Retrieve only if the length of text is greater than 8 and if it is a reply to another message, basically chk if reply_to_message_id > 0
df_ans = df_new.loc[df_new['from'].isin(['Tim','Raj']) & (df_new['text'].str.len() > 8) & (df_new["reply_to_message_id"] > 0), ['id','from', 'reply_to_message_id' ,'text']]

#List of words to be searched in the df_ans to eliminate those answers
dropList = ['how','when','which','what','good morning','good evening','goodafternoon','awesome','click','thanks','share','introduce','welcome','great','unfortunately','congratulations',congrats','wow','wife','makhan']

#Eliminate all messages which has above words
df_ans = df_ans[~df_ans['text'].str.lower().str.contains('|'.join(dropList),na=False)]

#eliminate "?"
df_ans = df_ans[~df_ans['text'].str.contains("?", regex=False)]

Output:

Step 5: Next step is to retrieve all the messages that have been asked to “Tim” and “Raj”, basically questions by others, that will be based on:

List of words that should be present in the message text like how, what, why, etc.
Messages that are not from “Tim” or “Raj”
List of words that should not be present in the message text like good morning, thanks, etc.

The following code snippet retrieves all messages that satisfy all the above-mentioned criteria, and the df_ques data frame contains messages that are nothing but questions.

#Retrieve all the questions
#List of words to be searched to retrieve qtns asked to Tim and Raj
include_list =['tell','can','what','where','when','how','which','who','why','suggest','take']

df_ques = df_new.loc[~df_new['from'].isin(['Tim','Raj']) & (df_new['text'].str.lower().str.contains('|'.join(include_list),na=False)) , ['id','from', 'reply_to_message_id' ,'text']]

#List of words to be searched to remove unwanted qtns
remove_list =['pin','great','wife','mrs','joined','whatsapp','thanks','good morning','good evening', 'gud nite', 'good night' 'good afternoon','dcotor','meant','hope','doctor','feature','qualification','forward']

df_ques = df_ques[~df_ques['text'].str.lower().str.contains('|'.join(remove_list),na=False)]

Output:

Step 6: Now that we retrieved questions and answers based on the required criteria, let us merge both the data frame to get Q&A pair. The answers data frame has a column called “reply_to_message_id” which is related to the “id” column of the question data frame.

I used inner join to relate these 2 columns to get the matching records.

#merge both the question and answer dataframe with reply_to_message_id from answer DF and Id col  
qtns = pd.merge(df_ques, df_ans, left_on='id', right_on='reply_to_message_id')
# qtns

#Extract only the required columns from the merged o/p
qtns[['id_x','from_x','text_x','id_y','from_y','reply_to_message_id_y','text_y']]

Output:

Step 7: From the resultant data frame, I just retrieved 2 text columns (text_x and text_y) which are nothing but questions and answers and using a for loop, looped through the data frame and the output was written to a .txt file with Question in one line and it’s the corresponding answer in the next line.

#extracting only 2 cols Qs and As(text_x = question and text_y = answer) and loop thru and print out Q and A
qtns = qtns[['text_x','text_y']]

#Open the txt for writing
f = open(r'ED_QA.txt','a', encoding='utf8'); 
sys.stdout=f
#using for loop
for index, row in qtns.iterrows():
    print("Question : " + str(row['text_x']))
    print("Answer : " + str(row['text_y']))

Step 8: The Output file will be created under the Output folder in the right pane of the Kaggle notebook as shown below:

Step 9: The output file can be downloaded to any folder by clicking on the 3 dots next to the file in the right pane. When you click on the dots, “Download” options pop up as shown below. On clicking “Download”, the “Save As” dialog opens up and you can save the output file to any of the folders.

Step 10: The sample output of this program is as follows:

I hope, I was able to explain my logic clearly and I also think that this blog would be useful for those who are trying to do the same exercise that I did and benefit from it.

My 2 cents:

Just wanted to share a simple tip that I followed to validate the output of this program. Usually, chat history files have a large set of data. So, I just took a subset of the data (say 1000 messages) and I manually analyzed all the 1000 messages, wrote down the ids of both questions and answers that were relevant. I then ran the program logic on the same set of data and compared the output with the result that I had got through manual analysis. The results were almost the same, thus verifying my logic too. Once this was successful, I ran the same program with the entire dataset and the output was mind-blowing, which was a valid set of question and answer pairs.

Thanks for reading!

How to extract question and answer pairs from telegram chat using Python pandas?

Recent Posts