How to extract question and answer pairs from telegram chat using Python pandas?

Updated: Oct 10, 2020

What? Extract questions and answers pairs from telegram chats? How is it possible? How to do it? These were some of the few questions that came up to my mind when this assignment was given to us as a part of a training course. It seemed impossible to me, because I never thought that I could extract chat history and manipulate the data. I never believed that it would be this easy to extract chat history and filter data using various criteria until I tried it out and saw the desired output.

Amazing and Interesting isn’t it?

So, without wasting much time let’s get into the steps of how to go about extracting the chat history from telegram and filtering question and answer pairs from it.

First, let’s talk about the problem statement of the assignment:

The assignment was to extract the chat history from “Eradicates Diabetes” a community chat group and filter out only the question and answer pairs that were answered by the group admins “Tim” and “Raj”.

Some insight about “Eradicate Diabetes(ED)” - ED is a community chat group which unites the masses together to combat the problem using the power of crowdsourced healthcare. They help people in reversing their Type 2 diabetes by providing information and support. They not only help in reversing Type 2 diabetes, but also believe in holistic treatment of the organs like the liver, kidneys & heart that have been damaged over years of abuse. They have been advising on Herbal based treatments combined with dietary and lifestyle modifications that have been proven to successfully reverse diabetes.

Following are the steps I followed to go about achieving the required results:

Step1: Extracting the chat history in to JSON file

Go to telegram, and select the chat (I have used ED, but you can use any chat) from which you want to export the chat history, and click on the three dots in the right side top corner. A small menu opens up and you can see an option called “Export chat history” as shown below:

  • Click on the “Export chat history” option. A small window pops up with export options and file format selection etc. as shown below:

In the popup, select all the types of files that are required to be extracted from the chat. Telegram provides 2 file formats : HTML and JSON. By default, it extracts in HTML format, if you want to extract it in JSON format, click on Format: HTML . It will open up another popup where you can select JSON file format and click on “Save”. Once all the settings are selected, click on “Export” at the bottom corner.

The downloaded files can be found in the downloads folder in the computer or the device.

Step 2: Analyzing the extracted JSON file to decide on what information is relevant

Data analysis and data cleansing are the most important step in any project in data science. To do that, first,

  • Analyze the data, how it is structured, what are the various attributes captured and how each and every attribute are related to each other.

  • Analyze the data stored in those attributes, that is you need to understand the kind of data stored like integer, strings etc.

  • Analyze and separate the relevant data attributes and not so important ones.

  • Remove unwanted data if required

These are some of the steps I followed, but you can add more according to the requirement. Remember, please do not jump immediately to start coding, without this most important step. If you forget to do this step, then you may end up spending more time to get the o/p as the understanding of data is more important than anything else in data science.

Step 3: Working on the JSON file to get the required output using python pandas

According to me, the smaller and cleaner the dataset, the output will be more accurate and can be validated easily. So, I would like to make my dataset smaller as much as possible according to the requirement and then start applying the logic to get the desired output.

After analyzing the data thoroughly, you can start code. I used Kaggle notebooks to do this assignment. Any IDE that supports python pandas can be used.

  • Open a new notebook in kaggle.com

  • Give a proper name for the notebook. Eg, ED Telegram chat extraction

  • Click on “ Add data” in the right pane top and add the input file which is nothing but the extracted JSON file.

Please check the screenshot given below:

The extracted data stored in JSON file has the following format. We are interested only in the messages tag and every message has a large set of attributes.


Let’s start coding and follow the below mentioned steps:

Step 1: Import required libraries. You have to import pandas and JSON libraries as we are using pandas and JSON file as input. And also to write the output to a file we have to import sys and OS libraries too.

import json
import pandas as pd
import os,sys

Step 2: Load the JSON file and retrieve only the messages from the JSON file to a pandas dataframe. The dataframe will be as shown below:

#Opening and loading the json 
f = open('../input/result2/result2.json', encoding='utf8')
data = json.load(f)
#Retreive only the messages tag from json
msgs = data['messages']
dmain = pd.DataFrame(msgs)

Output:


Step 3: I am creating a new dataframe with only relevant columns from the main dataframe that has all the messages and all of its’ attributes. So, the new dataframe will have [id, from, reply_to_message_id and text] columns as shown below:

#Create a new dataframe with only required columns (id,from,reply_to_message_id and text)
df_new = dmain.filter(items = ['id','from', 'reply_to_message_id' ,'text'])

Output:

Step 4: Next , let's retrieve all the messages that satisfies the following criteria:

  1. Messages only from “Tim” or “Raj”

  2. Message text length more than 8

  3. Message text that does not have a list of words like good morning, awesome etc. and also messages that has “?”

  4. “Reply_to_message_id” field has a value greater than zero.

The following code snippet retrieves all messages that satisfies all the above mentioned criteria, and the df_ans dataframe contains messages that are nothing but answers.

#Retrieve all the answers by Tim and Raj
#Retrieve only if the length of text is greater than 8 and if it is a reply to another message, basically chk if reply_to_message_id > 0
df_ans = df_new.loc[df_new['from'].isin(['Tim','Raj']) & (df_new['text'].str.len() > 8) & (df_new["reply_to_message_id"] > 0), ['id','from', 'reply_to_message_id' ,'text']]

#List of words to be searched in the df_ans to eliminate those answers
dropList = ['how','when','which','what','good morning','good evening','goodafternoon','awesome','click','thanks','share','introduce','welcome','great','unfortunately','congratulations',congrats','wow','wife','makhan']

#Eliminate all messages which has above words
df_ans = df_ans[~df_ans['text'].str.lower().str.contains('|'.join(dropList),na=False)]

#eliminate "?"
df_ans = df_ans[~df_ans['text'].str.contains("?", regex=False)]

Output:


Step 5: Next step is to retrieve all the messages that has been asked to “Tim” and “Raj”, basically questions by others, that will be based on:

  1. List of words that should be present in the message text like how, what, why etc.

  2. Messages that are not from “Tim” or “Raj”

  3. List of words that should not be present in the message text like good morning, thanks etc.

The following code snippet retrieves all messages that satisfies all the above mentioned criteria, and the df_ques dataframe contains messages that are nothing but questions.

#Retrieve all the questions
#List of words to be searched to retrieve qtns asked to Tim and Raj
include_list =['tell','can','what','where','when','how','which','who','why','suggest','take']

df_ques = df_new.loc[~df_new['from'].isin(['Tim','Raj']) & (df_new['text'].str.lower().str.contains('|'.join(include_list),na=False)) , ['id','from', 'reply_to_message_id' ,'text']]

#List of words to be searched to remove unwanted qtns
remove_list =['pin','great','wife','mrs','joined','whatsapp','thanks','good morning','good evening', 'gud nite', 'good night' 'good afternoon','dcotor','meant','hope','doctor','feature','qualification','forward']

df_ques = df_ques[~df_ques['text'].str.lower().str.contains('|'.join(remove_list),na=False)]

Output:


Step 6: Now that we retrieved questions and answers based on required criteria, let us merge both the dataframe to get Q&A pair. The answers dataframe have a column called “reply_to_message_id” which is related to the “id” column of the question dataframe.

I used inner join to relate these 2 columns to get the matching records.


#merge both the question and answer dataframe with reply_to_message_id from answer DF and Id col  
qtns = pd.merge(df_ques, df_ans, left_on='id', right_on='reply_to_message_id')
# qtns

#Extract only the required columns from the merged o/p
qtns[['id_x','from_x','text_x','id_y','from_y','reply_to_message_id_y','text_y']]

Output:

Step 7: From the resultant dataframe , I just retrieved 2 text columns (text_x and text_y) which are nothing but questions and answers and using a for loop, looped through the dataframe and the output was written to a .txt file with Question in one line and it’s corresponding answer in the next line.

#extracting only 2 cols Qs and As(text_x = question and text_y = answer) and loop thru and print out Q and A
qtns = qtns[['text_x','text_y']]

#Open the txt for writing
f = open(r'ED_QA.txt','a', encoding='utf8'); 
sys.stdout=f
#using for loop
for index, row in qtns.iterrows():
    print("Question : " + str(row['text_x']))
    print("Answer : " + str(row['text_y']))    

Step 8: The Output file will be created under Output folder in the right pane of the kaggle notebook as shown below:

Step 9: The output file can be downloaded to any folder by clicking on the 3 dots next to the file in the right pane. When you click on the dots, “Download” options pops up as shown below. On clicking “Download”, “Save As” dialog opens up and you can save the output file to any of the folders.

Step 10: The sample output of this program is as follows:


Hope, I was able to explain my logic clearly and I also think that this blog would be useful for those who are trying to do the same exercise that I did, and benefit from it.

My 2 cents:

Just wanted to share a simple tip that I followed to validate the output of this program. Usually chat history files have a large set of data. So, I just took a subset of the data (say 1000 messages) and I manually analyzed all the 1000 messages, wrote down the ids of both questions and answers that were relevant. I then ran the program logic on the same set of data and compared the output with the result that I had got through manual analysis. The results were almost same , thus verifying my logic too. Once this was successful, I ran the same program with the entire dataset and the output was mind blowing, which was a valid set of question and answers pairs.

Thanks for reading!


194 views0 comments

Recent Posts

See All
 

© Numpy Ninja.