R / RStudio and the tidyverse

What?

R and RStudio (1/2)

R is a programming language
RStudio is the most widespread IDE associated to R
With Python, R is one of the two top data science languages
Millions of R users worldwide and the number increases rapidly
Examples of use cases:
https://github.com/ThinkR-open/companies-using-r
https://posit.co/about/customer-stories/

Roughly speaking (very personal opinion):

R is usually better for statistics;
Python is preferred by computer scientists and people working on very Deep Learning;
both are amazing for data science, graphs and simple machine learning.

R and RStudio (2/2)

Beyond statistical analyses and data science, you can do lots of things with R: reports, websites, books, applications, and slides (like these ones).
+ doc & ppt (see https://ardata-fr.github.io/officeverse/), but the latter are usually ugly.

R & RStudio can easily be combined to other languages (Python, C/C++, SQL, JavaScript, etc.).

Moreover, the R community is very inclusive and kind.

In short, R is pretty cool. 😎

The learning curve

“Failure the greatest teacher is.” Master Yoda in The Last Jedi.

“Failure is an option. If things are not failing, you are not innovating enough.”

The end of the learning curve

What does “where I want you to be” mean?

Here are 2 examples of some of my students’ dynamic dashboards:

https://tmespe.shinyapps.io/Games_dashboard/
https://jitrayupunrattanapongs.shinyapps.io/dist/

My role/hope is to make you code similar objects.

YOU CAN DO IT!

But you need to start working hard early.

The philosophy of the course

Knowledge is not free: you have to pay attention.

Learning comes from YOU! More than 80% of what you will get from the course will come from your efforts. Passive listening \(\neq\) learning.
Remote teaching is a major barrier! Because attention spans are short, especially in front of pedagogical videos!
Remote teaching is not a major hurdle! Because anyway, progress will only be made by practice (on your computer, outside the video sessions).
I will always be there to help. Google, stackoverflow & chatGPT are your best friends. I’m next on the list.
To optimize my feedback, be as precise as possible. (best solution: send me files & code, not screenshots!)

About LLMs

Large Language Models (LLMs) such as GPT & co can be useful.
But the wise student should know the difference between asking for help and outsourcing.
The goal is to learn to code, not to prompt!
If chatGPT can do a job: you won’t get the job!
Also: chatGPT makes a lot of errors. You need to (double) check.

Errors, errors, errors

source: https://github.com/allisonhorst/stats-illustrations

=> Debugging!

Learning!

The data science workflow

In this course, we will mostly overlook the Model part.

My only goal: that Excel becomes marginal in your workflow! 😉

How shall we proceed?

Course structure (1/2)

In-class sessions: mix between slides & tutorials
Exercises: practice is the most important
A personal project, in two steps:

A short presentation of the project + dataset search & formatting ( ~10-15h work), due 2024-02-18;
The full report, code & deployment ( ~20-60h work), due 2024-04-07;
Please do not ask for adjournments (the deadlines are comfortable).

Questions & Interactions!

They are crucial: don’t be shy, there is no such thing as a bad question.
There are only bad teachers :)
Seriously, don’t be shy. 😊
Often, seemingly ridicule questions are not ridicule at all.
R is new to you, it can be overwhelming (and I know it).

Course structure (2/2)

Introduction to R/RStudio & the tidyverse
Baseline R & data structures + import/export
Plots + options
Shiny 1 - User interface & Server
Shiny 2 - UI layout (organization: tabs, rows, columns, boxes, menus, etc.)
Shiny 3 - Deployment + further options (CSS, themes & advanced tricks)
Geocomputing / Text Mining / APIs
Leveraging chatGPT / Advanced modelling / + options \(\rightarrow\) possibly to be defined together

\(\rightarrow\) Don’t hesitate to submit ideas or wishes!

What RStudio looks like

The greatness of notebooks

About packages

One of the great strengths of R is a (very) large collection of packages.

Packages are libraries that expand the capabilities of R.

The most important one is in fact a collection of packages called the

tidyverse.

It includes:

ggplot: the best plotting engine in the world (seriously)
dplyr: for data manipulation
tidyr: helps you work with tidy data (more on that soon)
readr and readxl: import rectangular data files (excel, etc.)
and more!

Install packages

There are two steps: first you need to download the package (only once, like pip in Python). The packages are downloaded from servers all around the world. So we start by choosing one.

# chooseCRANmirror() to set downloading location,
chooseCRANmirror(ind = 1)    #  see getCRANmirrors() for geographical details

install.packages("tidyverse") # Install (download) package: only once

NOTE (reminder): it’s easy to install packages directly in RStudio (“Tools” tab). + .libPaths()

Then, if you want to work with it, you need to load/activate it (for every session).

library(tidyverse)            # Loading, like "import..." in Python

“Unofficial” packages

Sometimes, packages are not verified by the CRAN (Comprehensive R Archive Network).
They are simply hosted on Github.
To install them, use use the devtools package:

devtools::install_github("xvrdm/ggrough")

But you have to install devtools first!

Comments and output

One hashtag # precedes a comment in R code.
Two hashtags ## precede output from a code sequence in the slides or notebooks.

1+1 # Test!

## [1] 2

In these slides, code will appear in grey areas (rectangles).

The assignment operator <-

In R, we don’t use “=” to create variables, but an arrow sign “<-”.
Though “=” works, too.

a <- 6 # This creates a variable but does not show it!
a      # If you want to see it, ask for it!

## [1] 6

b <- 11:42  # This creates a variable but does not show it!
b          # If you want to see it, ask for it!

##  [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## [26] 36 37 38 39 40 41 42

The brackets (ex: [26]) indicate the position of the first element in the row.

Tidy data via the package tidyr

Many ways to organize data in rectangles, but:

source: Allison Horst & Julia Stewart Lowndes https://www.openscapes.org/blog/2020/10/12/tidy-data/

Instances vs variables

The diamonds database is included in the tidyverse. The head() function shows the first lines of a dataset. The tail() function shows the last lines.

head(diamonds, 4) # The number gives the amount of rows shown

carat	cut	color	clarity	depth	table	price	x	y	z
0.23	Ideal	E	SI2	61.5	55	326	3.95	3.98	2.43
0.21	Premium	E	SI1	59.8	61	326	3.89	3.84	2.31
0.23	Good	E	VS1	56.9	65	327	4.05	4.07	2.31
0.29	Premium	I	VS2	62.4	58	334	4.20	4.23	2.63

One instance = one observation = one row.
One variable = one unique characteristic = one column.

Variable types

number (numerical): integer (int) or double (dbl)
character: text (chr)
factor: categorical, ordered (ord) or not (fct)
boolean: TRUE or FALSE / T or F (bool)
date: day precision (date) or second precision (time): usually starts 1970-01-01

NOTE: we are only concerned with rectangular / structured datasets.

Tidyness structures data & thought

source: Allison Horst & Julia Stewart Lowndes

Tidy data benefits from hundreds of tools

source: Allison Horst & Julia Stewart Lowndes

Tidy data: example via gapminder

install.packages("gapminder") # Install (download) package: only once

Tidy data satisfies the (row = instance) & (column = variable) structure.

library(gapminder)            # Activate: each time you launch RStudio
head(gapminder, 3)            # Have a look!

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007

Tidy data: counter-example

The table below shows the evolution of population of countries.

pivot_wider(gapminder[c(1:4,13:16,25:28), c(1,3,5)],  # Don't look at this code!
            names_from = "country", values_from = "pop")

year	Afghanistan	Albania	Algeria
1952	8425333	1282697	9279525
1957	9240934	1476505	10270856
1962	10267083	1728137	11000948
1967	11537966	1984060	12760499

\(\rightarrow\) Not tidy! The columns are not VARIABLES!
(This is typically the excel format.)

Tidy tools (illustration)

Tidy tools

The tidyverse has two functions that switch from matrix/excel format to tidy data and back:

pivot_longer(): from matrix/excel to tidy data (wide_to_long/melt in pandas)
pivot_wider(): from tidy data to matrix/excel (pivot in pandas)

Below, one example of data (called “not_tidy_pop”“) in ‘excel’ format.

Year	France	Germany	UK
1970	52	61	56
1990	59	80	57
2010	65	82	63

BE VERY CAREFUL: type case matters in R!
Continent \(\neq\) continent.
When referring to a variable (column names), a mistake will lead to an error.

Tidy tool: pivot_longer()! From wide to long.

Gather joins/concatenates columns which belong to the same variable.

tidy_pop <- pivot_longer(not_tidy_pop, cols = -Year, names_to = "Country", values_to = "Population")

tidy_pop[1:7,]  # First 7 lines (only) shown

Year	Country	Population
1970	France	52
1970	Germany	61
1970	UK	56
1990	France	59
1990	Germany	80
1990	UK	57
2010	France	65

The syntax is the following:

pivot_longer(data,
\(\quad\) cols = columns to tidy,
\(\quad\) names_to = name_of_the_new_variable,
\(\quad\) values_to = name_of_the_column_with_values
)

names_to = Country because the columns are all countries.

values_to = Population because the data pertains to population values.

We use -Year because the Year variable is excluded from the pivoting

Visually…

Source: software carpentry

Tidy tools: pivot_wider()! From long to wide.

The reverse operation (no need for the “cols” argument this time).

pivot_wider(tidy_pop, names_from = "Country", values_from = "Population")

Year	France	Germany	UK
1970	52	61	56
1990	59	80	57
2010	65	82	63

pivot_wider(data,
\(\quad\) names_from = variable_to_be_put_in_columns,
\(\quad\) values_from = where_to_get_values
)

Data manipulation via the package dplyr

filter() rows - Part I

Often, analyses are performed on subsets of data (query in Python).

filter(gapminder, lifeExp > 81.5) # Countries where people live long lives on average

country	continent	year	lifeExp	pop	gdpPercap
Hong Kong, China	Asia	2007	82.208	6980412	39724.98
Iceland	Europe	2007	81.757	301931	36180.79
Japan	Asia	2002	82.000	127065841	28604.59
Japan	Asia	2007	82.603	127467972	31656.07
Switzerland	Europe	2007	81.701	7554661	37506.42

filter() rows - Part II

Filters can be combined (with commas preferably, the & operator works, too).

filter(gapminder, country == "Japan", year > 2000)

country	continent	year	lifeExp	pop	gdpPercap
Japan	Asia	2002	82.000	127065841	28604.59
Japan	Asia	2007	82.603	127467972	31656.07

Only two observations for Japan post-2000.
NOTE: as in all languages, there are TWO EQUAL SIGNS (==) for the comparison.
One “=” is like the arrow (<-) and is used to assign values.

select() columns

Sometimes, you might want to keep just a few variables to ease readability.

select(gapminder[1:5,], country, year, pop)

country	year	pop
Afghanistan	1952	8425333
Afghanistan	1957	9240934
Afghanistan	1962	10267083
Afghanistan	1967	11537966
Afghanistan	1972	13079460

Use select(data, -variable) to remove variable: the minus sign!

Sort via arrange()

This is when you want to order your data (sort_values in pandas). Here, from smallest pop to largest.

head(arrange(gapminder, pop)) # Alternative: arrange(gapminder, desc(lifeExp)); desc() is for descending

country	continent	year	lifeExp	pop	gdpPercap
Sao Tome and Principe	Africa	1952	46.471	60011	879.5836
Sao Tome and Principe	Africa	1957	48.945	61325	860.7369
Djibouti	Africa	1952	34.812	63149	2669.5295
Sao Tome and Principe	Africa	1962	51.893	65345	1071.5511
Sao Tome and Principe	Africa	1967	54.425	70787	1384.8406
Djibouti	Africa	1957	37.328	71851	2864.9691

Create new columns via mutate()

With population and gdpPercap you can infer total GDP!

head(mutate(gapminder, gdp = pop * gdpPercap))

country	continent	year	lifeExp	pop	gdpPercap	gdp
Afghanistan	Asia	1952	28.801	8425333	779.4453	6567086330
Afghanistan	Asia	1957	30.332	9240934	820.8530	7585448670
Afghanistan	Asia	1962	31.997	10267083	853.1007	8758855797
Afghanistan	Asia	1967	34.020	11537966	836.1971	9648014150
Afghanistan	Asia	1972	36.088	13079460	739.9811	9678553274
Afghanistan	Asia	1977	38.438	14880372	786.1134	11697659231

Piping: %>%

(or |>)

Definition: sequences of operations

Very often, one simple analysis will require several steps. They can be combined via the %>% or |> operators.

A fake sequence:

me %>%
\(\quad\) wake_up(time = “06:20”) %>%
\(\quad\) shower(temp = 40) %>%
\(\quad\) go_to(place = “baker”, with = “scooter”) %>%
\(\quad\) buy(item = “bread”) %>%
\(\quad\) go_to(place = “home”, with = “scooter”) %>%
\(\quad\) breakfast(drink = “hot_chocolate”, eat = “toast”, eat = “kiwi”) %>%
\(\quad\) toothbrush(duration = 2)

Example (short)

select(filter(diamonds, carat > 4), carat, price, clarity) # BEURK!

diamonds |>
    filter(carat > 4) |>
    select(carat, price, clarity)  # So simple!

carat	price	clarity
4.01	15223	I1
4.01	15223	I1
4.13	17329	I1
5.01	18018	I1
4.50	18531	I1

Example (long)

diamonds %>% 
    filter(carat > 2, cut == "Ideal") %>%      # First we filter
    mutate(car_price_ratio = carat/price) %>%  # Then, we create a new column
    arrange(desc(car_price_ratio)) %>%         # We order the data
    select(-x, -y, -z) %>%                     # We take out some superfluous columns
    head(4)                                    # Finally, we ask for the top 4 instances

carat	cut	color	clarity	depth	table	price	car_price_ratio
3.50	Ideal	H	I1	62.8	57	12587	0.0002781
3.22	Ideal	I	I1	62.6	55	12545	0.0002567
2.16	Ideal	H	I1	62.2	56	8709	0.0002480
2.25	Ideal	E	I1	61.4	54	9072	0.0002480

Pivot tables

Definition

“A pivot table is a table of statistics that summarizes the data of a more extensive table.”

— Wikipedia

There are two dimensions in a pivot table:
- which variable(s) we want to analyze (gender, continent/country, size, etc.);
- which statistic we want to compute (mean, min, max, number of instances, variance etc.).

In R, these two steps are separated via two functions: group_by() and summarise()

Example I

diamonds |>
    group_by(clarity, cut) |>               # Define the variables
    summarise(avg_price = mean(price),      # Define the statistics
              max_price = max(price),
              avg_carat = mean(carat),
              max_carat = max(carat)) |>   
    head(3)

clarity	cut	avg_price	max_price	avg_carat	max_carat
I1	Fair	3703.533	18531	1.361000	5.01
I1	Good	3596.635	11548	1.203021	3.00
I1	Very Good	4078.226	15984	1.281905	4.00

Example II

You can even pipe inside a function!

gapminder %>%
    group_by(continent, year) %>%
    summarise(avg_lifeExp = mean(lifeExp) %>% round(2)) %>%
    head(4)

continent	year	avg_lifeExp
Africa	1952	39.14
Africa	1957	41.27
Africa	1962	43.32
Africa	1967	45.33

The round() function rounds numbers up to some decimals.

Bonus: tutors!

visualize tidyverse code: https://tidydatatutor.com
equivalent for pandas: https://pandastutor.com

Takeaway

Summary

R is an incredibly powerful tool for data science. The preferred environment is the tidyverse. As its name indicates, the core concept is TIDY DATA!

In short, it’s just a question of functions:
- pivot_longer() and pivot_wider() to work with tidy data;
- filter(), select(), arrange() and mutate() for wrangling/manipulation;
- group_by() and summarize() for pivot tables;
- head() and tail() to see the first and last lines of a dataset.

That’s it!

Resources / Links

an online series of exercises with solutions!:
https://gcoqueret.shinyapps.io/Exercises/
the Bible of Data Science with R:
http://r4ds.hadley.nz/
Two great books on data science:
https://ubc-dsci.github.io/introduction-to-datascience/
https://rafalab.github.io/dsbook/
Shiny use cases:
https://shiny.posit.co/r/gallery/
One other app (marketing):
https://mdancho84.shinyapps.io/shiny-app/

What are your questions?

A ‘bad’ question is better than no question (there are no bad questions).

Programming AND learning to code is HARD:
Ask questions so I can make it (slightly) easier.
I can’t guess your questions, don’t be shy!
Nothing will replace practice & making mistakes.

Tip: for your project, choose a dataset with a mixture of numerical and categorical data. It is the best combination to create a nice looking dashboard!