• Manasi Desai

Bird says -Tweet Tweet: Decoding the Tweeter melody using R -Part 1

In this blog we will quickly go through some simple functions and arguments readily available in R for you to extract and analyse Twitter data using data components. For easy understanding I have divided this blog into 2 parts. Firstly, we`ll see ways of extracting twitter data and then, I`ll depict how to derive some basic yet vital information using the component of extracted data. Before we can begin we will need to download popular twitter API packages available in R.


1. twitterR

2. rtweet


I prefer and will be working with rtweet package as it allows you to access both REST and streaming APIs. Apart from the above mentioned packages you will need to install readr package to allow R to read large twitter json file. You can install and call the packages in your R studio by using the following code.


install.packages("rtweet")

install.packages("readr")

install.packages("dplyr")

library(rtweet)

library(readr)

library(dplyr)


So let's Begin with the first part, that is the extraction of tweets.There are different ways (or functions technically speaking) you can use to extract tweets in R and few of them are as follows:


1. Stream_tweets () function:

This function can be used when you want to extract live tweets.The function will randomly sample 1 % of all the available live tweets for 30 seconds window by default. One can change the default time setting by using the timeout argument. The example is given below


 # Extracting live tweets using  stream_tweets function with timeout
   argument and storing in l_tweet dataframe
l_tweet<- stream_tweets("", timeout = 60)
# Viewing the dimensions of dataframe
dim(l_tweet)

#output:   
[1] 3341   90
# We can see that 3341 rows of tweets with 90 columns was generated in the 60 sec timeframe in the dataframe

2. Search_tweets() function:

This function can be used to extract tweets based on search query. It would return max of 18,000 tweets for each request.The example of the same is given below here we will use search_tweets() to extract tweets on the recent BlackLivesMatters movement. We will be looking for tweets containing #BlackLivesMatters


# Extracting tweets on # BlackLivesMatters including retweets & tweets posted in English language only. The no. of tweet extraction is restricted to 5000.  

t_BLM <- search_tweets ("#BlackLivesMatters", n=5000, include_rts = 
                       TRUE, lang = "en")
dim(t_BLM)

# Quick look at dataframe for first 4 column and 20 rows.
head(t_BLM[,1:4],20)

Output:
> dim(t_BLM) 
[1] 4948   90

> head(t_BLM[,1:4], 20)
# A tibble: 20 x 4
   user_id  status_id  created_at          screen_name
   <chr>    <chr>      <dttm>              <chr>      
 1 1234567~ 122456789~ 2020-07-08 18:22:38 ABCDEF~
     
# Imp point: the user_id,status_id and screen name have been changed for privacy concern. And only one row of output is pasted for space saving.

3. get_timeline() function:

This function is used to extract tweets that's tweeted/posted by specific user. It would return a max of 3200 tweets at a time. In the code below we will extract tweets posted by Donald Trump.


# Extracting max no.of tweets posted by Donald Trump
gt_trump <- get_timeline("@realDonaldTrump", n = 3200)
dim(gt_trump) 
# View output for the first 5 columns and 1 rows
head(gt_trump[,1:5], 1)

# In the dataset gt_trump you will be able to see that he has tweeted on several topics like economy and job growth, covid19,etc 

# Output:
> dim(gt_trump)
[1] 600  90
> head(gt_trump[,1:5], 1)
# A tibble: 1 x 5
    user_id status_id created_at          screen_name
  <chr>   <chr>     <dttm>              <chr>      
1 234567~ 1234567~ 2020-07-10 20:04:12 realDonald~
# ... with 1 more variable: text <chr>

Once we have extracted the twitter data we need to understand its components before we can make use of them to gain meaningful insight. Twitter API returns tweets and their component as JSON and JSON uses named attributes and value to describe these tweets and components.These attributes are than subsequently converted into dataframe column. We can see the components of tweets using the code given below.


# viewing components of tweets:
t_BLM <- search_tweets ("#BlackLivesMatters")
# viewing the column names
names(t_BLM)

# output
 [1] "user_id"                                                                                                                                                          
 [2] "status_id"              
 [3] "created_at"             
 [4] "screen_name"            
 [5] "text"                   
 [6] "source"                 
 [7] "display_text_width"     
 [8] "reply_to_status_id"     
 [9] "reply_to_user_id"       
[10] "reply_to_screen_name"  ................[90]

The output generated 90 column names including tweets and their components. I have just depicted 10 of them above. we will now see how some of these components can be used to gain some meaningful and useful insight.


1] "screen_name": is used to understand the user`s interest which can be subsequently be used to promote specific events and/or products (targeted marketing). Using the code below we are able to view top 5 users who has tweeted the most about Black lives matters . These screen name users can be used to promote positive black lives matters event, to set up meet ups where people can be made aware of racial discrimination to bring about a much needed positive change in the society. This is one of the way we can use this information.

twt_BL <- search_tweets ("#BlackLivesMatters", n=5000)
Screen_N <- table (twt_BL$screen_name)
head(Screen_N)

# output: We can see screen_name in 1st row and their subsequent count in 2nd row
___Abc_Vu_____  ___Defgrrr      __HIJKL__ 
              1               1               1 
 __mnopqr__4       __Stuvw__ __XYabcd_or 
              1               1               1 
# sorted the table in desc order of tweets count & viewed 

srt <- sort(Screen_N, decreasing = TRUE)
head(srt)

# output: We can see that screen_name Gogost has tweeted the most (35 times) on BLM

         Gogost   abcdefghijklmn97     _qrstuvwxy_y 
             35              12              11 
      abcdklmnt88  ddoonnxxyyzz   dbbbccceeefffg 
             11              11              10 

2] "follwers_count": As the name suggest this column stores the number of followers a given twitter accounts has. The information from this column can be used to access the popularity and influence of that account. From the output of the code below we can see that CNN is the most popular followed account with 48971178 followers.One can use this account to spread awareness ads regarding the racial discrimination and ways to address the issue.


# Extracted the user data using lookup_user()function for prominent news channels
Nws_ch <- lookup_users(c("CBS" ,"MSNBC", "CNN", "FOXNews"))
# Created dataframe with screen_name and followers count columns 
N_df <- Nws_ch [,c("screen_name","followers_count")]
# And then viewed the followers count for comparison
N_df

# output
  screen_name     followers_count
  <chr>                     <int>
1 CBS                 1120478
2 MSNBC               3459963
3 CNN                48971178
4 FoxNews            19416362

Quick tip: Just as we can filter for language in search_tweet function we can also filter for original tweet, retweets and popular tweets by using different arguments. The sample code given below is for all of you to try, but before starting, you will have to install plyr package.


# [a] Filtration based on popularity of tweet related to BlackLivesMatters

# Extracting tweets with minimum of 100 retweets & 100 favorites w.r.t BLM
twt_Br <- search_tweets("BlackLivesMatters min_retweets:100 AND min_faves:100")
# Creating dataframe counts to check retweet & favorite count
counts <- twt_Br[c("retweet_count", "favorite_count")]
head(counts)
# Viewing tweets
head(twt_Br$text)


# [b] Filtration based on originality of tweet meaning tweets of user which are not retweets , quotes or reply's.

# Extracting original tweets of BlackLivesMatters
twts_Oblm <- search_tweets("BlackLivesMatters
                            -filter:retweets 
                            -filter:quote 
                            -filter:replies", 
                            n = 100)

# Use the below code for presence of replies
count(twts_Oblm$reply_to_screen_name)
# Use this code to check the presence of quotes
count(twts_Oblm$is_quote)
# similarly check for presence of retweets
count(twts_Oblm$is_retweet)
 

Conclusion

We have just scratched the surface of gaining useful insight through twitter data.

Stay tuned in as we will explore it more in my upcoming blogs.


11 views0 comments

Recent Posts

See All

Text Summarization through use of Spacy library

Text summarization in NLP means telling a long story in short with a limited number of words and convey an important message in brief. There can be many strategies to make the large message short and

 

© Numpy Ninja.