The general idea

Data transfer is highly controlled. The key notions are authentication and protocol.

Downloading toots with rtoot

There are several packages that run an interface with twitter: rtweet, RTwitterAPI, streamR and twitteR.
But since Auth V2, we will need RTwitterV2! But this only runs on R v4.2!
Documentation: https://github.com/MaelKubli/RTwitterV2.
Recent packages are better because firms update their API policies (and access), thus old protocols sometimes do not work!
Unfortunately, the Twitter API is no longer free!
Hence, in this notebook, we will test the competitor: mastodon!
The package for this will be rtoot.

First things first

First, the packages. Download…

if(!require(rtoot)){install.packages("rtoot")}

… and activate.

library(tidyverse)
library(plotly)
library(rtoot)

Authentication

Second: authentication You have to choose a particular instance of the network. Personally, I am registered on “sciences.social”, the largest one is “mastodon.social” (see https://mastodonservers.net/servers/top) => Write the answer without the quotation marks and choose a public tocken

rtoot::auth_setup(
  instance = "mastodon.social",
  type = "public"
)
Token of type "public" for instance mastodon.social is valid
<mastodon bearer token> for instance: mastodon.social of type: public 
# get_timeline_hashtag(hashtag = "rstats", 
#                      instance = "mastodon.social",
#                      limit = 200)

Authentication can be an important part of the process. For more info on that:
- https://cran.r-project.org/web/packages/googlesheets/vignettes/managing-auth-tokens.html
- https://httr.r-lib.org/reference/index.html (section Authentication)
- https://blog.r-hub.io/2021/01/25/oauth-2.0/

Extraction

If no error appears, we are ready to query. Depending on the number of requested tweets, this can take some time.

There are different types of queries that the packages allows.
For instance, below we use the get_timeline_hashtag function to access toots that include one particular term, the “hashtag”.

search_term <- "election"
toots <- get_timeline_hashtag(hashtag = search_term, 
                              instance = "mastodon.social",
                              limit = 2000)

Text mining

References

The reference book is: https://www.tidytextmining.com
A great interactive tutorial: https://juliasilge.shinyapps.io/learntidytext/
And the package is:

if(!require(tidytext)){install.packages("tidytext", repos = "https://cloud.r-project.org/")}
library(tidytext)

(see also: https://quanteda.io/index.html)

Data retrieval

Now, let’s move forward to simple text analysis. First, we need to prepare the data! (as usual)

tokens <- toots %>% 
    select(id, content) %>%             # Keeps only id and text/content of the tweet
    unnest_tokens(word, content)        # Creates tokens!
tokens

Let’s have a look at word frequencies.

tokens %>%
    count(word, sort = TRUE)

This is polluted by small words. Let’s filter that (FIRST METHOD).

tokens %>% mutate(length = nchar(word))

Data frequencies

Now let’s omit the small words (smaller than 5 characters).
NOTE: all the thresholds below depend on the sample!

tokens %>%
    mutate(length = nchar(word)) %>%
    filter(length > 4) %>%             # Keep words with length larger than 4
    count(word, sort = TRUE) %>%       # Count words
    head(21) %>%                       # Keep only top 12 words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()

A better way to proceed is to remove “stop words” like “a”, “I”, “of”, “the”, etc (SECOND METHOD). Also, it would make sense to remove the search item and “https”.

data("stop_words")
tidy_tokens <- tokens %>% 
    anti_join(stop_words)                    # Remove unrelevant terms
tidy_tokens %>%
    count(word, sort = TRUE) %>%             # Count words
    head(20) %>%                             # Keep only top 15 words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()

Problem: strange characters remain. We are going to remove them by converting the text to ASCII format and omit NA data.

new_stop_words <- c("https", "span", "class", "href", "target", "_blank", "rel", "tag",
                    "mastodon.social", "ellipsis", "mastodon.online", "mstdn.social", "amp",
                    "http", "invisible", "03", search_term, tolower(search_term), "d0", "src",
                    "tags", "mention", "noreferrer", "noopener", "nofollow", "hashtag", "translate",
                    "www", "url", "die", "der", "und", "a", "p", "br", "1", "2", "01", "02")
tidy_tokens <- tokens %>% 
    anti_join(stop_words) %>%                            # Remove unrelevant
    mutate(word = iconv(word, from = "UTF-8", to = "ASCII")) %>% # Put in latin format
    na.omit() %>%                                        # Remove missing
    filter(nchar(word) > 2,                              # Remove small words
           !(word %in% new_stop_words)  # search_term defined above
    )
tidy_tokens %>%
    count(word, sort = TRUE) %>%         # Count words
    head(30) %>%                         # Keep only top words
    ggplot(aes(y = reorder(word,n), x = n)) + geom_col() + ylab("Words") + theme_bw()

Perfect!

Word cloud

This data can also be shown with a word cloud. We simply use the wordcloud package: https://cran.r-project.org/web/packages/wordcloud/index.html

The package wordcloud2 adds a few features: https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html

if(!require(wordcloud)){install.packages("wordcloud")}
library(wordcloud)
cloud_data <- tidy_tokens %>% count(word)
wordcloud(words = cloud_data$word, 
          freq = cloud_data$n, min.freq = 10,
          max.words = 82, random.order = FALSE, rot.per = 0.15, 
          colors = brewer.pal(8, "Dark2")) 

n-grams

See https://www.tidytextmining.com/ngrams.html

toots %>% 
    mutate(id = 1:nrow(toots)) %>%        # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
    filter(!(bigram %in% c("span a", "a href", "a a", "_blank span", "href https", "span class",
                           "a p", "p p", "br a", "target _blank", "noreferrer target", "span span",
                           "hastag rel", "class mention", "rel nofollow", "mention hashtag",
                           "class invisible", "invisible https", "p a",
                           "nofollow noopener", "noopener noreferrer", "hashtag rel"))) %>%
    group_by(bigram) %>%
    count(sort = T) %>%
    head(20) %>%
    ggplot(aes(y = reorder(bigram, n), x = n)) + 
  geom_col() + theme_bw() + ylab("bigrams")

Again: same issue with stop words! So we must remove them again. But it’s more complicated now. We can use the separate() function to help us.

toots %>% 
    mutate(id = row_number()) %>%         # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
    mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
    na.omit() %>%
    separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
    filter(!(word1 %in% c(new_stop_words, stop_words$word)),
           !(word2 %in% c(new_stop_words, stop_words$word))) %>%
    group_by(bigram) %>%
    count(sort = T) %>%
    head(24) %>%
    ggplot(aes(y = reorder(bigram, n), x = n)) + geom_col() + ylab("Bi-gram") +
    theme_bw()

cloud_data <- toots %>% 
    mutate(id = row_number()) %>%         # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(bigram, content, token = "ngrams", n = 2) %>%
    mutate(bigram = iconv(bigram, from = "UTF-8", to = "ASCII")) %>%
    na.omit() %>%
    separate(bigram, c("word1", "word2"), sep = " ", remove = F) %>%
    filter(!(word1 %in% c(new_stop_words, stop_words$word)),
           !(word2 %in% c(new_stop_words, stop_words$word))) %>%
    group_by(bigram) %>%
    count(sort = T) |>
  mutate(length = nchar(bigram)) |>
  filter(length < 15)
  
cloud_data
wordcloud(words = cloud_data$bigram, 
          freq = cloud_data$n, min.freq = 10,
          max.words = 35, random.order = FALSE, rot.per = 0.10, 
          colors = brewer.pal(8, "Dark2")) 

Sentiment

This section is inspired from: https://www.tidytextmining.com/sentiment.html
Sometimes, you may be asked in the process if you really want to download data (lexicons).
Just say yes in the console (type the correct answer: if not, you will be blocked/struck).

First, we need to load some sentiment lexicon. AFINN is one such sentiment database.

if(!require(textdata)){install.packages("textdata", repos = "https://cloud.r-project.org/")}
Loading required package: textdata
library(tidytext)
library(textdata)
afinn <- get_sentiments("afinn")
afinn
afinn |> filter(value > 3)

To create a nice visualization, we need to extract the time of the tweets.

tokens_time <- toots %>% 
    mutate(id = row_number()) %>%         # This creates a tweet id
    select(id, content, created_at) %>%   # Keeps id, text and date of the tweet
    unnest_tokens(word, content)          # Creates tokens!
tokens_time

We then use inner_join() to merge the two sets. This function removes the cases when a match does not occur.

library(lubridate)
sentiment <- tokens_time %>% 
    inner_join(afinn) %>%
    mutate(day = day(created_at),
           hour = hour(created_at) / 24,
           minute = minute(created_at) / 60 / 24,
           time = day + hour + minute)
Joining with `by = join_by(word)`
sentiment

We then compute the average sentiment, minute-by-minute, or day-by-day, depending on frequency.
Of course, average sentiment can be misleading. Indeed, if a text contains the terms “I’m not happy”, then only “happy” will be tagged, which is the opposite of the intended meaning.

sentiment %>%
  mutate(date = as.Date(created_at)) |>
    group_by(date) %>%
    #filter(year(date)==2024) |>
    summarise(avg_sentiment = mean(value)) %>%
    ggplot(aes(x = date, y = avg_sentiment)) + geom_col() + theme_bw()

What about emotions? The NRC lexicon categorizes emotions. Below, we order emotions. The most important impact is the dichotomy between positive & negative emotions.

nrc <- get_sentiments("nrc")
nrc <- nrc %>%
    mutate(sentiment = as.factor(sentiment),
           sentiment = recode_factor(sentiment,
                                     joy = "joy",
                                     trust = "trust",
                                     surprise = "surprise",
                                     anticipation = "anticipation",
                                     positive = "positive",
                                     negative = "negative",
                                     sadness = "sadness",
                                     anger = "anger",
                                     fear = "fear",
                                     digust = "disgust",
                                     .ordered = T))
nrc

We then create the merged dataset.

emotions <- tokens_time %>% 
    inner_join(nrc) %>%                     # Merge data with sentiment
    mutate(date = as.Date(created_at))      # Create day column
Joining with `by = join_by(word)`Warning: Detected an unexpected many-to-many relationship between `x` and `y`.
emotions                                    # Show the result

The merging has reduced the size of the dataset, but there still remains enough to pursue the study.
Finally, we move to the pivot-table that counts emotions for each day.

g <- emotions %>% 
    group_by(date, sentiment) %>%
    summarise(intensity = n()) %>%
    filter(year(date) == 2024) |>
    ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col() + 
    theme(axis.text.x = element_text(angle = 80, 
                                     size = 10,
                                     hjust = 1)) + xlab("Time") +
    scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw()
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplotly(g)

This can also be shown in percentage format.

g <- emotions %>% 
    group_by(date, sentiment) %>%
    filter(year(date) == 2024) |>
    summarise(intensity = n()) %>%
    ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
    theme(axis.text.x = element_text(angle = 80, 
                                     size = 10,
                                     hjust = 1)) + xlab("Time") +
    scale_fill_viridis_d(option = "magma", direction = -1) + theme_bw() + 
    geom_hline(yintercept = 0.5, linetype = 2)
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
ggplotly(g)
emotions %>% 
    mutate(sentiment = if_else(sentiment < "negative", "positive", "negative")) %>% 
    group_by(date, sentiment) %>%
    summarise(intensity = n()) %>%
    ggplot(aes(x = date, y = intensity, fill = sentiment)) + geom_col(position = "fill") +
    theme_bw() + 
    theme(axis.text.x = element_text(angle = 80, 
                                     size = 10,
                                     hjust = 1)) + xlab("Time") +
    geom_hline(yintercept = 0.5) + 
    scale_fill_manual(values = c("#223333", "#FFBB99")) 
`summarise()` has grouped output by 'date'. You can override using the `.groups` argument.

Advanced sentiment

The problem with the preceding methods is that they don’t take into account valence shifters (i.e., negators, amplifiers (intensifiers), de-amplifiers (downtoners), and adversative conjunctions). If a tweet says not happy, counting the word happy is not a good idea! The package sentimentr is built to circumvent these issues: have a look at https://github.com/trinker/sentimentr
(see also: https://www.sentometrics.org and the book Supervised Machine Learning for Text Analysis in R hosted at https://smltar.com)

I haven’t tested aws.comprehend, but it seems promising: https://github.com/cloudyr/aws.comprehend

if(!require(sentimentr)){install.packages(c("sentimentr", "textcat"))}
library(sentimentr)
library(textcat)

First, let’s keep only the tweets written in English!

# toots_en <- toots %>%
#     mutate(language = textcat(content)) %>%
#     filter(language == "english") %>%
#     dplyr::select(created_at, content)

toots_en <- toots |> filter(language == "en")

NOTE: the code above was used to show the function textcat: the language is already coded in the tweets via the lang column/variable. (it suffices to keep the instances for which lang == “en”)

Next, we compute advanced sentiment.

tweet_sent <- toots_en$content %>%
    get_sentences() %>%  # Intermediate function
    sentiment()          # Sentiment!
tweet_sent

NOTE: depending on frequency issues, it is better to analyze at daily or hourly scales. If a word is very popular, then, higher frequencies are more relevant.

    ggplot(aes(x = date, y = avg_sent)) + geom_col() 
Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by `fortify()`, not a <uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?
Backtrace:
 1. ggplot2::ggplot(aes(x = date, y = avg_sent))
 2. ggplot2:::ggplot.default(aes(x = date, y = avg_sent))
 4. ggplot2:::fortify.default(data, ...)

Resources

Below, a short list of resources (to access third-party data):

Possibly deprecated:
- Facebook: https://cran.r-project.org/web/packages/Rfacebook/index.html
- Instagram: https://cran.r-project.org/web/packages/instaR/index.html

