library(lexRankr)
library(tidyverse)
library(stringr)
library(httr)
library(jsonlite)
In this document we get tweets from twitter using the twitter API and then analyze the tweets using lexRankr in order to find a user’s most representative tweets. If you don’t care about interacting with the twitter api you can jump to the lexrank analysis.
Before we can analyze tweets we’ll need some tweets to analyze. We’ll be using Twitter’s API, and you’ll need to set up an account to get all keys needed for the api. The credentials needed for the api are: consumer key, consumer secret, token, and token secret. Below is how to set up your credentials to use the twitter api in this vignette.
# set api tokens/keys/secrets as environment vars
# Sys.setenv(cons_key = 'my_cons_key')
# Sys.setenv(cons_secret = 'my_cons_sec')
# Sys.setenv(token = 'my_token')
# Sys.setenv(token_secret = 'my_token_sec')
#sign oauth
auth <- httr::oauth_app("twitter", key=Sys.getenv("cons_key"), secret=Sys.getenv("cons_secret"))
sig <- httr::sign_oauth1.0(auth, token=Sys.getenv("token"), token_secret=Sys.getenv("token_secret"))
Now that we have our credentials set up, let’s write a function to get a user’s tweets from the api. Below the function get_timeline_df
is defined. The function takes a user’s twitter handle, the number of tweets to get from the api, and the credentials we just set up. The function will return a dataframe with the columns created_at, favorite_count, retweet_count, text
. The twitter api limits 200 tweets per get, so we will use a loop until we get the desired number of tweets.
get_timeline_df <- function(user, n_tweets=200, oauth_sig) {
i <- 0
n_left <- n_tweets
timeline_df <- NULL
#loop until n_tweets are all got
while (n_left > 0) {
n_to_get <- min(200, n_left)
i <- i+1
#incorporae max id in get_url (so as not to download same 200 tweets repeatedly)
if (i==1) {
get_url <- paste0("https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=",
user,"&count=", n_to_get)
} else {
get_url <- paste0("https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=",
user,"&count=",n_to_get,"&max_id=", max_id)
}
#GET tweets
response <- httr::GET(get_url, oauth_sig)
#extract content and clean up
response_content <- httr::content(response)
json_content <- jsonlite::toJSON(response_content)
#clean out evil special chars
json_conv <- iconv(json_content, "UTF-8", "ASCII", sub = "") %>%
stringr::str_replace_all("\003", "") #special character (^C) not caught by above clean
timeline_list <- jsonlite::fromJSON(json_conv)
#extract desired fields
fields_i_care_about <- c("id", "text", "favorite_count", "retweet_count", "created_at")
timeline_df <- purrr::map(fields_i_care_about, ~unlist(timeline_list[[.x]])) %>%
purrr::set_names(fields_i_care_about) %>%
dplyr::as_data_frame() %>%
dplyr::bind_rows(timeline_df) %>%
dplyr::distinct()
#store min id (oldest tweet) to set as max id for next GET
max_id <- min(purrr::map_dbl(timeline_list$id, 1))
#update number of tweets left
n_left <- n_left-n_to_get
}
return(timeline_df)
}
We can now use our function to gather a user’s tweets with the additional information of date-time, favorites, retweets. Lets use one of the most famous twitter accounts as of late: @realDonaldTrump.
tweets_df <- get_timeline_df("realDonaldTrump", 600, sig) %>%
mutate(text = str_replace_all(text, "\n", " ")) #clean out newlines for display
tweets_df %>%
head(n=3) %>%
select(text, created_at) %>%
knitr::kable()
text | created_at |
---|---|
Yes, it is true - Carlos Slim, the great businessman from Mexico, called me about getting together for a meeting. We met, HE IS A GREAT GUY! | Tue Dec 20 20:27:57 +0000 2016 |
especially how to get people, even with an unlimited budget, out to vote in the vital swing states ( and more). They focused on wrong states | Tue Dec 20 13:09:18 +0000 2016 |
Bill Clinton stated that I called him after the election. Wrong, he called me (with a very nice congratulations). He “doesn’t know much” … | Tue Dec 20 13:03:59 +0000 2016 |
We now have a dataframe that contains a column of tweets. This column of tweets will be the subject of the rest of the analysis. With the data in this format, we only need to call the bind_lexrank
function to apply the lexrank algorithm to the tweets. The function will add a column of lexrank scores. The higher the lexrank score the more representative the tweet is of the tweets that we downloaded.
note: typically one would parse documents into sentences before applying lexrank (?unnest_sentences
); however we will equate tweets to sentences for this analysis
tweets_df %>%
bind_lexrank(text, id, level="sentences") %>%
arrange(desc(lexrank)) %>%
head(n=5) %>%
select(text, lexrank) %>%
knitr::kable(caption = "Most Representative @realDonaldTrump Tweets")
text | lexrank |
---|---|
MAKE AMERICA GREAT AGAIN! | 0.0087551 |
Well, the New Year begins. We will, together, MAKE AMERICA GREAT AGAIN! | 0.0085258 |
HAPPY PRESIDENTS DAY - MAKE AMERICA GREAT AGAIN! | 0.0082361 |
Happy Thanksgiving to everyone. We will, together, MAKE AMERICA GREAT AGAIN! | 0.0060486 |
Hopefully, all supporters, and those who want to MAKE AMERICA GREAT AGAIN, will go to D.C. on January 20th. It will be a GREAT SHOW! | 0.0059713 |
With our get_timeline_df
function we can easily repeat this analysis for other users. Below we repeat the whole analysis in a single magrittr pipeline.
get_timeline_df("dog_rates", 600, sig) %>%
mutate(text = str_replace_all(text, "\n", " ")) %>%
bind_lexrank(text, id, level="sentences") %>%
arrange(desc(lexrank)) %>%
head(n=5) %>%
select(text, lexrank) %>%
knitr::kable(caption = "Most Representative @dog_rates Tweets")
text | lexrank |
---|---|
@Lin_Manuel good day good dog | 0.0167123 |
Please keep loving | 0.0099864 |
Here we h*ckin go | 0.0085708 |
Last day to get anything from our Valentine’s Collection by Valentine’s Day! | 0.0077583 |
Even if I tried (which I would never), I’d last like 17 seconds | 0.0073899 |