I have been a part of a telegram group, ‘Eradicate Diabetes’, which was made with purpose of advising people on LCHF diets and advising how to reduce to sugar levels naturally and lead a healthy life.
It has a lot of information in the form of question answers by 2 experts named ‘Raj’ and ‘Tim’. And also so many chit chats and food images and recipes.
The requirement was to extract relevant information from this group and get a .csv file in form of question answer pairs.
It was just a vague requirement and in the steps ahead you will see how we got ahead and moved towards its solution seeing the requirements and challenges on the way, yes it was a journey.
Step 1 :
To extract the chats of group in .json format. We zeroed down to json format as it is easy to parse.
To read json file by Python for which code was written as below:
Identify the data which is of our use. For this we tried to print the dataframe df as below:
Now we observed that want to take out messages out of it so we wrote:
To get the whole data frame length we can say:
To identify relevant fields to build up the logic.
Now if we open json file it is in form as below:
Fields that we have to consider:
1) ‘reply_to_message_id”: Whenever there is a reply to some question this id is generated, which will be used to extract answers. It has value of message id which has question asked to the answer.
2) ‘id’: This is message id which we will use to locate where a question is asked.
3) ‘text’: It is the message related to the Id concerned.
4) ‘from’: This id is indicator of who asked question or answer, basically who is the one who is chatting.
Making logic to extract relevant data.
Now using all the fields in step 4 we made our logic.
The code below filters out all messages which has a ‘reply_to_message_id’ field, have some text, and has Tim or Raj our experts as the one giving answers. This will extract all answers that are given by ‘Tim and Raj’.
Now we need to extract questions asked, so we take the id given in a‘reply_to_message_id ‘field, and scroll up to the message id which has its question.
For the logic above code can be written as below, an added logic here is that question asked should not be from Tim or Raj as we want to just extract answers from them for queries asked by people.:
When we have done this we can print question and answers with the code below:
Data Cleaning -Remove the irrelevant.
Now we get the file required but it has a lot of issues like:
1) Lot of Good Morning and replied to wishes messages.
2) Relevant data was less.
3) Lots of junk characters.
We had to filter out these in a meaningful manner. The output files looked as below:
a) To get relevant question and answers only we put a filter for Questions , only those strings having keywords as 'can ','what ','where ','when ','how ','which ','who ','why ','suggest ', will be included in questions and rest will be removed.
You have to notice the space given after each word in list just to ensure we capture relevant word and not in concatenation with other words like (‘cant, whichever, whoever etc.)
For this code is as below:
b)To remove curly braces and text between it ,we removed it by using the following code:
To use the above code remember we need to import re(regular expression library of Python.
Now the data that we get is very clean and has only relevant question and answers.