In this blog we will quickly go through some simple functions and arguments readily available in R for you to extract and analyse Twitter data using data components. For easy understanding I have divided this blog into 2 parts. Firstly, we`ll see ways of extracting twitter data and then, I`ll depict how to derive some basic yet vital information using the component of extracted data. Before we can begin we will need to download popular twitter API packages available in R.
I prefer and will be working with rtweet package as it allows you to access both REST and streaming APIs. Apart from the above mentioned packages you will need to install readr package to allow R to read large twitter json file. You can install and call the packages in your R studio by using the following code.
So let's Begin with the first part, that is the extraction of tweets.There are different ways (or functions technically speaking) you can use to extract tweets in R and few of them are as follows:
1. Stream_tweets () function:
This function can be used when you want to extract live tweets.The function will randomly sample 1 % of all the available live tweets for 30 seconds window by default. One can change the default time setting by using the timeout argument. The example is given below
# Extracting live tweets using stream_tweets function with timeout argument and storing in l_tweet dataframe l_tweet<- stream_tweets("", timeout = 60) # Viewing the dimensions of dataframe dim(l_tweet) #output:  3341 90 # We can see that 3341 rows of tweets with 90 columns was generated in the 60 sec timeframe in the dataframe
2. Search_tweets() function:
This function can be used to extract tweets based on search query. It would return max of 18,000 tweets for each request.The example of the same is given below here we will use search_tweets() to extract tweets on the recent BlackLivesMatters movement. We will be looking for tweets containing #BlackLivesMatters
# Extracting tweets on # BlackLivesMatters including retweets & tweets posted in English language only. The no. of tweet extraction is restricted to 5000. t_BLM <- search_tweets ("#BlackLivesMatters", n=5000, include_rts = TRUE, lang = "en") dim(t_BLM) # Quick look at dataframe for first 4 column and 20 rows. head(t_BLM[,1:4],20) Output: > dim(t_BLM)  4948 90 > head(t_BLM[,1:4], 20) # A tibble: 20 x 4 user_id status_id created_at screen_name <chr> <chr> <dttm> <chr> 1 1234567~ 122456789~ 2020-07-08 18:22:38 ABCDEF~ # Imp point: the user_id,status_id and screen name have been changed for privacy concern. And only one row of output is pasted for space saving.
3. get_timeline() function:
This function is used to extract tweets that's tweeted/posted by specific user. It would return a max of 3200 tweets at a time. In the code below we will extract tweets posted by Donald Trump.
# Extracting max no.of tweets posted by Donald Trump gt_trump <- get_timeline("@realDonaldTrump", n = 3200) dim(gt_trump) # View output for the first 5 columns and 1 rows head(gt_trump[,1:5], 1) # In the dataset gt_trump you will be able to see that he has tweeted on several topics like economy and job growth, covid19,etc # Output: > dim(gt_trump)  600 90 > head(gt_trump[,1:5], 1) # A tibble: 1 x 5 user_id status_id created_at screen_name <chr> <chr> <dttm> <chr> 1 234567~ 1234567~ 2020-07-10 20:04:12 realDonald~ # ... with 1 more variable: text <chr>
Once we have extracted the twitter data we need to understand its components before we can make use of them to gain meaningful insight. Twitter API returns tweets and their component as JSON and JSON uses named attributes and value to describe these tweets and components.These attributes are than subsequently converted into dataframe column. We can see the components of tweets using the code given below.
# viewing components of tweets: t_BLM <- search_tweets ("#BlackLivesMatters") # viewing the column names names(t_BLM) # output  "user_id"  "status_id"  "created_at"  "screen_name"  "text"  "source"  "display_text_width"  "reply_to_status_id"  "reply_to_user_id"  "reply_to_screen_name" ................
The output generated 90 column names including tweets and their components. I have just depicted 10 of them above. we will now see how some of these components can be used to gain some meaningful and useful insight.
1] "screen_name": is used to understand the user`s interest which can be subsequently be used to promote specific events and/or products (targeted marketing). Using the code below we are able to view top 5 users who has tweeted the most about Black lives matters . These screen name users can be used to promote positive black lives matters event, to set up meet ups where people can be made aware of racial discrimination to bring about a much needed positive change in the society. This is one of the way we can use this information.
twt_BL <- search_tweets ("#BlackLivesMatters", n=5000) Screen_N <- table (twt_BL$screen_name) head(Screen_N) # output: We can see screen_name in 1st row and their subsequent count in 2nd row ___Abc_Vu_____ ___Defgrrr __HIJKL__ 1 1 1 __mnopqr__4 __Stuvw__ __XYabcd_or 1 1 1 # sorted the table in desc order of tweets count & viewed srt <- sort(Screen_N, decreasing = TRUE) head(srt) # output: We can see that screen_name Gogost has tweeted the most (35 times) on BLM Gogost abcdefghijklmn97 _qrstuvwxy_y 35 12 11 abcdklmnt88 ddoonnxxyyzz dbbbccceeefffg 11 11 10
2] "follwers_count": As the name suggest this column stores the number of followers a given twitter accounts has. The information from this column can be used to access the popularity and influence of that account. From the output of the code below we can see that CNN is the most popular followed account with 48971178 followers.One can use this account to spread awareness ads regarding the racial discrimination and ways to address the issue.
# Extracted the user data using lookup_user()function for prominent news channels Nws_ch <- lookup_users(c("CBS" ,"MSNBC", "CNN", "FOXNews")) # Created dataframe with screen_name and followers count columns N_df <- Nws_ch [,c("screen_name","followers_count")] # And then viewed the followers count for comparison N_df # output screen_name followers_count <chr> <int> 1 CBS 1120478 2 MSNBC 3459963 3 CNN 48971178 4 FoxNews 19416362
Quick tip: Just as we can filter for language in search_tweet function we can also filter for original tweet, retweets and popular tweets by using different arguments. The sample code given below is for all of you to try, but before starting, you will have to install plyr package.
# [a] Filtration based on popularity of tweet related to BlackLivesMatters # Extracting tweets with minimum of 100 retweets & 100 favorites w.r.t BLM twt_Br <- search_tweets("BlackLivesMatters min_retweets:100 AND min_faves:100") # Creating dataframe counts to check retweet & favorite count counts <- twt_Br[c("retweet_count", "favorite_count")] head(counts) # Viewing tweets head(twt_Br$text)
# [b] Filtration based on originality of tweet meaning tweets of user which are not retweets , quotes or reply's. # Extracting original tweets of BlackLivesMatters twts_Oblm <- search_tweets("BlackLivesMatters -filter:retweets -filter:quote -filter:replies", n = 100) # Use the below code for presence of replies count(twts_Oblm$reply_to_screen_name) # Use this code to check the presence of quotes count(twts_Oblm$is_quote) # similarly check for presence of retweets count(twts_Oblm$is_retweet)
We have just scratched the surface of gaining useful insight through twitter data.
Stay tuned in as we will explore it more in my upcoming blogs.