The general idea

Data transfer is highly controlled. The key notions are authentication and protocol.

Downloading tweets with rtweet

There are several packages that run an interface with twitter: rtweet, RTwitterAPI, streamR and twitteR.
Recent packages are better because firms update their API policies (and access), thus old protocols sometimes do not work!

First things first

First, the packages. Download…

if(!require(rtweet)){install.packages("rtweet")}

… and activate.

library(tidyverse)
library(plotly)
library(rtweet)

Authentication

Second: you need your twitter credentials (you need a twitter account). You also need a developer account: https://developer.twitter.com/en/apply-for-access Login on twitter and go to: https://developer.twitter.com

The next step is crucial: we need to retrieve identification credentials.
In order to do that, you need to create a Twitter app. Below, you can see mine. To create one, simply click on the “Create an app” button (on the right)

If you click on the “details” of an app, you can see this:

The second tab is called “Keys and tokens\(\rightarrow\) that’s where the info is!!!

Now we are ready to proceed. The lines below open the connexion with the API.

consumer_key <- "your_consumer_key"
consumer_secret <- "you_consumer_secret"
access_token <- "your_access_token"
access_secret <- "your_access_secret"

create_token(app = "the_name_of_your_app",
             consumer_key = consumer_key, 
             consumer_secret = consumer_secret, 
             access_token = access_token, 
             access_secret = access_secret
             )
<Token>
<oauth_endpoint>
 request:   https://api.twitter.com/oauth/request_token
 authorize: https://api.twitter.com/oauth/authenticate
 access:    https://api.twitter.com/oauth/access_token
<oauth_app> Big Doudou
  key:    xqMm10Vwwl1XAx31CAzYNiqoi
  secret: <hidden>
<credentials> oauth_token, oauth_token_secret
---

Authentication is an important part of the process. For more info on that:
- https://cran.r-project.org/web/packages/googlesheets/vignettes/managing-auth-tokens.html
- https://httr.r-lib.org/reference/index.html (section Authentication)

Extraction

If no error appears, we are ready to query. Depending on the number of requested tweets, this can take some time.

search_term <- "confinement"
tweets <- search_tweets(
  search_term,          # What to search for
  n = 2000,             # Number of tweets to download
  include_rts = FALSE   # Exclude re-tweets
)

For large queries, the progress bar helps.
Note that many options are available, like: exclude retweets, limit search to particular geographical zones (inside radiuses).

Text mining

References

The reference book is: https://www.tidytextmining.com
A great interactive tutorial: https://juliasilge.shinyapps.io/learntidytext/
And the package is:

if(!require(tidytext)){install.packages("tidytext", repos = "https://cloud.r-project.org/")}
library(tidytext)

(see also: https://quanteda.io/index.html)

Data retrieval

Now, let’s move forward to simple text analysis. First, we need to prepare the data! (as usual)

tokens <- tweets %>% 
    mutate(id = row_number()) %>%    # This creates a tweet id
    select(id, text) %>%             # Keeps only id and text of the tweet
    unnest_tokens(word, text)        # Creates tokens!
tokens

Let’s have a look at word frequencies.

tokens %>%
    count(word, sort = TRUE)

This is polluted by small words. Let’s filter that (FIRST METHOD).

tokens %>% mutate(length = nchar(word))

Data frequencies

Now let’s omit the small words (smaller than 5 characters).
NOTE: all the thresholds below depend on the sample!

tokens_2 %>%
    mutate(length = nchar(word)) %>%
    filter(length > 4) %>%             # Keep words with length larger than 4
    count(word, sort = TRUE) %>%       # Count words
    head(21) %>%                       # Keep only top 12 words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

A better way to proceed is to remove “stop words” like “a”, “I”, “of”, “the”, etc (SECOND METHOD). Also, it would make sense to remove the search item and “https”.

data("stop_words")
tidy_tokens <- tokens %>% 
    anti_join(stop_words)                    # Remove unrelevant terms
tidy_tokens %>%
    count(word, sort = TRUE) %>%             # Count words
    head(20) %>%                             # Keep only top 15 words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

Problem: strange characters remain. We are going to remove them by converting the text to ASCII format and omit NA data.

tidy_tokens <- tokens %>% 
    anti_join(stop_words) %>%                            # Remove unrelevant
    mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
    na.omit() %>%                                        # Remove missing
    filter(nchar(word) > 2,                              # Remove small words
           !(word %in% c("https", "t.co", search_term))  # search_term defined above
    )
tidy_tokens %>%
    count(word, sort = TRUE) %>%         # Count words
    head(20) %>%                         # Keep only top words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words")

Perfect!

Word cloud

This data can also be shown with a word cloud. We simply use the wordcloud package: https://cran.r-project.org/web/packages/wordcloud/index.html

The package wordcloud2 adds a few features: https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html

if(!require(wordcloud)){install.packages("wordcloud")}
library(wordcloud)
cloud_data <- tidy_tokens %>% count(word)
wordcloud(words = cloud_data$word, 
          freq = cloud_data$n, min.freq = 10,
          max.words = 100, random.order = FALSE, rot.per = 0.15, 
          colors = brewer.pal(8, "Dark2"))

n-grams

See https://www.tidytextmining.com/ngrams.html

tweets %>% 
    mutate(id = 1:nrow(tweets)) %>%    # This creates a tweet id
    select(id, text, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  group_by(bigram) %>%
  count(sort = T) %>%
  head(20) %>%
  ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col()

Again: same issue with stop words! So we must remove them again. But it’s more complicated now. We can use the separate() function to help us.

tweets %>% 
    mutate(id = row_number()) %>%      # This creates a tweet id
    select(id, text, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
   separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
  filter(!(word1 %in% c(stop_words$word, "https", search_term)),
         !(word2 %in% c(stop_words$word, "https", search_term))) %>%
  group_by(bigram) %>%
  count(sort = T) %>%
  head(20) %>%
  ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram")

Sentiment

This section is inspired from: https://www.tidytextmining.com/sentiment.html
Sometimes, you may be asked in the process if you really want to download data (lexicons).
Just say yes in the console (type the correct answer: if not, you will be blocked/struck).

First, we need to load some sentiment lexicon. AFINN is one such sentiment database.

if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
Loading required package: textdata
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn

To create a nice visualization, we need to extract the time of the tweets.

tokens_time <- tweets %>% 
    mutate(id = 1:nrow(tweets)) %>%    # This creates a tweet id
    select(id, text, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(word, text)          # Creates tokens!
tokens_time

We then use inner_join() to merge the two sets. This function removes the cases when a match does not occur.

library(lubridate)
sentiment <- tokens_time %>% 
  inner_join(afinn) %>%
  mutate(day = day(created_at),
         hour = hour(created_at) / 24,
         minute = minute(created_at) / 60 / 24,
         time = day + hour + minute)
sentiment

We then compute the average sentiment, minute-by-minute.
Of course, average sentiment can be misleading. Indeed, if a text contains the terms “I’m not happy”, then only “happy” will be tagged, which is the opposite of the intended meaning.

sentiment %>%
  group_by(time, day, hour, minute) %>%
  summarise(avg_sentiment = mean(value)) %>%
  mutate(time = make_datetime(year = 2020, month = 4, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = avg_sentiment)) + geom_col() 
`summarise()` has grouped output by 'time', 'day', 'hour'. You can override using the `.groups` argument.

There are 24 bars per day, but the y-axis is not optimal…

What about emotions? The NRC lexicon categorizes emotions. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions.

nrc <- get_sentiments("nrc")
nrc <- nrc %>%
  mutate(sentiment = as.factor(sentiment),
         sentiment = recode_factor(sentiment,
                                   joy = "joy",
                                   trust = "trust",
                                   surprise = "surprise",
                                   anticipation = "anticipation",
                                   positive = "positive",
                                   negative = "negative",
                                   sadness = "sadness",
                                   anger = "anger",
                                   fear = "fear",
                                   digust = "disgust",
                                   .ordered = T))

We then create the merged dataset.

emotions <- tokens_time %>% 
  inner_join(nrc) %>%                  # Merge data with sentiment
  mutate(day = day(created_at),
         hour = hour(created_at)/24,
         minute = minute(created_at)/24/60,
         time = day + hour + minute)   # Create day column
emotions                               # Show the result

The merging has reduced the size of the dataset, but there still remains enough to pursue the study.
Finally, we move to the pivot-table that counts emotions for each day.

g <- emotions %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col() + 
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time") +
  scale_fill_viridis(option = "magma", discrete = T, direction = -1)
`summarise()` has grouped output by 'time', 'sentiment', 'day', 'hour'. You can override using the `.groups` argument.
ggplotly(g)

This can also be shown in percentage format.

g <- emotions %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time") +
  scale_fill_viridis(option = "magma", discrete = T, direction = -1)
ggplotly(g)
emotions %>% 
  mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>% 
  group_by(time, sentiment, day, hour, minute) %>%
  summarise(intensity = n()) %>%
  mutate(time = make_datetime(year = 2020, month = 10, day = day, hour = hour*24, min = minute*24*60)) %>%
  ggplot(aes(x = time, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
  theme(axis.text.x = element_text(angle = 80, 
                                   size = 10,
                                   hjust = 1)) + xlab("Time") +
  scale_fill_manual(values = c("#001144", "#FFDD99"))
`summarise()` has grouped output by 'time', 'sentiment', 'day', 'hour'. You can override using the `.groups` argument.

Advanced sentiment

The problem with the preceding methods is that they don’t take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says not happy, counting the word happy is not a good idea! The package sentimentr is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr
(see also: https://www.sentometrics.org and the book Supervised Machine Learning for Text Analysis in R hosted at https://smltar.com)

I haven’t tested aws.comprehend, but it seems promising: https://github.com/cloudyr/aws.comprehend

if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)

First, let’s keep only the tweets written in English!

tweets_en <- tweets %>%
  mutate(language = textcat(text)) %>%
  filter(language == "english") %>%
  dplyr::select(created_at, text)

NOTE: the code above was used to show the function textcat: the language is already coded in the tweets via the lang column/variable. (it suffices to keep the instances for which lang == “en”)

Next, we compute advanced sentiment.

tweet_sent <- tweets_en$text %>%
  get_sentences() %>%  # Intermediate function
  sentiment()          # Sentiment!
tweet_sent

NOTE: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant.

Resources

Below, a short list of resources (to access third-party data):

Possibly deprecated:
- Facebook: https://cran.r-project.org/web/packages/Rfacebook/index.html
- Instagram: https://cran.r-project.org/web/packages/instaR/index.html

