How to extract question and answer pairs from telegram chat using Python pandas?

What? Extract questions and answers pairs from telegram chats? How is it possible? How to do it? These were some of the few questions that came up to my mind when this assignment was given to us as a part of a training course. It seemed impossible to me, because I never thought that I could extract chat history and manipulate the data. I never believed that it would be this easy to extract chat history and filter data using various criteria until I tried it out and saw the desired output.

Amazing and Interesting isn’t it?

So, without wasting much time let’s get into the steps of how to go about extracting the chat history from telegram and filtering question and answer pairs from it.

First, let’s talk about the problem statement of the assignment:

The assignment was to extract the chat history from “Eradicates Diabetes” a community chat group and filter out only the question and answer pairs that were answered by the group admins “Tim” and “Raj”.

Some insight about “Eradicate Diabetes(ED)” - ED is a community chat group that unites the masses together to combat the problem using the power of crowdsourced healthcare. They help people in reversing their Type 2 diabetes by providing information and support. They not only help in reversing Type 2 diabetes but also believe in the holistic treatment of the organs like the liver, kidneys & heart that have been damaged over years of abuse. They have been advising on Herbal based treatments combined with dietary and lifestyle modifications that have been proven to successfully reverse diabetes.

Following are the steps I followed to go about achieving the required results:

Step1: Extracting the chat history into a JSON file

Go to the telegram, and select the chat (I have used ED, but you can use any chat) from which you want to export the chat history, and click on the three dots in the right side top corner. A small menu opens up and you can see an option called “Export chat history” as shown below:

Click on the “Export chat history” option. A small window pops up with export options and file format selection etc. as shown below:

In the popup, select all the types of files that are required to be extracted from the chat. Telegram provides 2 file formats : HTML and JSON. By default, it extracts in HTML format, if you want to extract it in JSON format, click on Format: HTML. It will open up another popup where you can select JSON file format and click on “Save”. Once all the settings are selected, click on “Export” at the bottom corner.

The downloaded files can be found in the Downloads folder on the computer or the device.

Step 2: Analyzing the extracted JSON file to decide on what information is relevant

Data analysis and data cleansing are the most important step in any project in data science. To do that, first,

  • Analyze the data, how it is structured, what are the various attributes captured, and how each and every attribute are related to each other.

  • Analyze the data stored in those attributes, that is you need to understand the kind of data stored like integer, strings, etc.

  • Analyze and separate the relevant data attributes and not-so-important ones.

  • Remove unwanted data if required

These are some of the steps I followed, but you can add more according to the requirement. Remember, please do not jump immediately to start coding, without this most important step. If you forget to do this step, then you may end up spending more time to get the o/p as the understanding of data is more important than anything else in data science.

Step 3: Working on the JSON file to get the required output using python pandas

According to me, the smaller and cleaner the dataset, the output will be more accurate and can be validated easily. So, I would like to make my dataset smaller as much as possible according to the requirement and then start applying the logic to get the desired output.

After analyzing the data thoroughly, you can start code. I used Kaggle notebooks to do this assignment. Any IDE that supports python pandas can be used.

  • Open a new notebook in kaggle.com

  • Give a proper name for the notebook. Eg, ED Telegram chat extraction

  • Click on “ Add data” in the right pane top and add the input file which is nothing but the extracted JSON file.

Please check the screenshot given below:

The extracted data stored in JSON file has the following format. We are interested only in the messages tag and every message has a large set of attributes.


Let’s start coding and follow the below-mentioned steps:

Step 1: Import required libraries. You have to import pandas and JSON libraries as we are using pandas and JSON file as input. And also to write the output to a file we have to import sys and OS libraries too.

import json
import pandas as pd
import os,sys

Step 2: Load the JSON file and retrieve only the messages from the JSON file to a pandas data frame. The data frame will be as shown below:

#Opening and loading the json 
f = open('../input/result2/result2.json', encoding='utf8')
data = json.load(f)
#Retreive only the messages tag from json
msgs = data['messages']
dmain = pd.DataFrame(msgs)

Output:


Step 3: I am creating a new data frame with only relevant columns from the main data frame that has all the messages and all of its’ attributes. So, the new data frame will have [id, from, reply_to_message_id and text] columns as shown below:

#Create a new dataframe with only required columns (id,from,reply_to_message_id and text)
df_new = dmain.filter(items = ['id','from', 'reply_to_message_id' ,'text'])

Output:

Step 4: Next, let's retrieve all the messages that satisfy the following criteria:

  1. Messages only from “Tim” or “Raj”

  2. Message text length more than 8

  3. Message text that does not have a list of words like good morning, awesome, etc., and also messages that have “?”

  4. “Reply_to_message_id” field has a value greater than zero.

The following code snippet retrieves all messages that satisfy all the above-mentioned criteria, and the df_ans data frame contains messages that are nothing but answers.

#Retrieve all the answers by Tim and Raj
#Retrieve only if the length of text is greater than 8 and if it is a reply to another message, basically chk if reply_to_message_id > 0
df_ans = df_new.loc[df_new['from'].isin(['Tim','Raj']) & (df_new['text'].str.len() > 8) & (df_new["reply_to_message_id"] > 0), ['id','from', 'reply_to_message_id' ,'