What?

R and RStudio (1/2)

Roughly speaking (very personal opinion):

  • R is usually better for statistics;
  • Python is preferred by computer scientists and people working on very Deep Learning;
  • both are amazing for data science, graphs and simple machine learning.

R and RStudio (2/2)

Beyond statistical analyses and data science, you can do lots of things with R: reports, websites, books, applications, and slides (like these ones).
+ doc & ppt (see https://ardata-fr.github.io/officeverse/), but the latter are usually ugly.

R & RStudio can easily be combined to other languages (Python, C/C++, SQL, JavaScript, etc.).

Moreover, the R community is very inclusive and kind.

In short, R is pretty cool. 😎

The learning curve

“Failure the greatest teacher is.” Master Yoda in The Last Jedi.

“Failure is an option. If things are not failing, you are not innovating enough.”

The end of the learning curve

The philosophy of the course

Knowledge is not free: you have to pay attention.

  • Learning comes from YOU! More than 80% of what you will get from the course will come from your efforts. Passive listening \(\neq\) learning.
  • Remote teaching is a major barrier! Because attention spans are short, especially in front of pedagogical videos!
  • Remote teaching is not a major hurdle! Because anyway, progress will only be made by practice (on your computer, outside the video sessions).
  • I will always be there to help. Google, stackoverflow & chatGPT are your best friends. I’m next on the list.
  • To optimize my feedback, be as precise as possible. (best solution: send me files & code, not screenshots!)

About LLMs

  • Large Language Models (LLMs) such as GPT & co can be useful.
  • But the wise student should know the difference between asking for help and outsourcing.
  • The goal is to learn to code, not to prompt!
  • If chatGPT can do a job: you won’t get the job!
  • Also: chatGPT makes a lot of errors. You need to (double) check.

Errors, errors, errors

=> Debugging!

Learning!

The data science workflow

In this course, we will mostly overlook the Model part.

My only goal: that Excel becomes marginal in your workflow! 😉

How shall we proceed?

Course structure (1/2)

  1. In-class sessions: mix between slides & tutorials
  2. Exercises: practice is the most important
  3. A personal project, in two steps:
  • A short presentation of the project + dataset search & formatting ( ~10-15h work), due 2024-02-18;
  • The full report, code & deployment ( ~20-60h work), due 2024-04-07;
  • Please do not ask for adjournments (the deadlines are comfortable).
Questions & Interactions!

They are crucial: don’t be shy, there is no such thing as a bad question.
There are only bad teachers :)
Seriously, don’t be shy. 😊
Often, seemingly ridicule questions are not ridicule at all.
R is new to you, it can be overwhelming (and I know it).

Course structure (2/2)

  1. Introduction to R/RStudio & the tidyverse
  2. Baseline R & data structures + import/export
  3. Plots + options
  4. Shiny 1 - User interface & Server
  5. Shiny 2 - UI layout (organization: tabs, rows, columns, boxes, menus, etc.)
  6. Shiny 3 - Deployment + further options (CSS, themes & advanced tricks)
  7. Geocomputing / Text Mining / APIs
  8. Leveraging chatGPT / Advanced modelling / + options \(\rightarrow\) possibly to be defined together

\(\rightarrow\) Don’t hesitate to submit ideas or wishes!

What RStudio looks like

The greatness of notebooks

About packages

One of the great strengths of R is a (very) large collection of packages.

Packages are libraries that expand the capabilities of R.

The most important one is in fact a collection of packages called the

tidyverse.

It includes:

  • ggplot: the best plotting engine in the world (seriously)
  • dplyr: for data manipulation
  • tidyr: helps you work with tidy data (more on that soon)
  • readr and readxl: import rectangular data files (excel, etc.)
  • and more!

Install packages

There are two steps: first you need to download the package (only once, like pip in Python). The packages are downloaded from servers all around the world. So we start by choosing one.

# chooseCRANmirror() to set downloading location,
chooseCRANmirror(ind = 1)    #  see getCRANmirrors() for geographical details
install.packages("tidyverse") # Install (download) package: only once

NOTE (reminder): it’s easy to install packages directly in RStudio (“Tools” tab). + .libPaths()

Then, if you want to work with it, you need to load/activate it (for every session).

library(tidyverse)            # Loading, like "import..." in Python

“Unofficial” packages

Sometimes, packages are not verified by the CRAN (Comprehensive R Archive Network).
They are simply hosted on Github.
To install them, use use the devtools package:

devtools::install_github("xvrdm/ggrough")

But you have to install devtools first!

Comments and output

  • One hashtag # precedes a comment in R code.
  • Two hashtags ## precede output from a code sequence in the slides or notebooks.
1+1 # Test!
## [1] 2

In these slides, code will appear in grey areas (rectangles).

The assignment operator <-

In R, we don’t use “=” to create variables, but an arrow sign “<-”.
Though “=” works, too.

a <- 6 # This creates a variable but does not show it!
a      # If you want to see it, ask for it!
## [1] 6
b <- 11:42  # This creates a variable but does not show it!
b          # If you want to see it, ask for it!
##  [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## [26] 36 37 38 39 40 41 42

The brackets (ex: [26]) indicate the position of the first element in the row.

Tidy data via the package tidyr

Many ways to organize data in rectangles, but:

Instances vs variables

The diamonds database is included in the tidyverse. The head() function shows the first lines of a dataset. The tail() function shows the last lines.

head(diamonds, 4) # The number gives the amount of rows shown
carat cut color clarity depth table price x y z
0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

One instance = one observation = one row.
One variable = one unique characteristic = one column.

Variable types

  • number (numerical): integer (int) or double (dbl)
  • character: text (chr)
  • factor: categorical, ordered (ord) or not (fct)
  • boolean: TRUE or FALSE / T or F (bool)
  • date: day precision (date) or second precision (time): usually starts 1970-01-01

NOTE: we are only concerned with rectangular / structured datasets.

Tidyness structures data & thought

source: Allison Horst & Julia Stewart Lowndes

Tidy data benefits from hundreds of tools

source: Allison Horst & Julia Stewart Lowndes

Tidy data: example via gapminder

install.packages("gapminder") # Install (download) package: only once

Tidy data satisfies the (row = instance) & (column = variable) structure.

library(gapminder)            # Activate: each time you launch RStudio
head(gapminder, 3)            # Have a look!
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007

Tidy data: counter-example

The table below shows the evolution of population of countries.

pivot_wider(gapminder[c(1:4,13:16,25:28), c(1,3,5)],  # Don't look at this code!
            names_from = "country", values_from = "pop")
year Afghanistan Albania Algeria
1952 8425333 1282697 9279525
1957 9240934 1476505 10270856
1962 10267083 1728137 11000948
1967 11537966 1984060 12760499

\(\rightarrow\) Not tidy! The columns are not VARIABLES!
(This is typically the excel format.)

Tidy tools (illustration)

Tidy tools

The tidyverse has two functions that switch from matrix/excel format to tidy data and back:

  • pivot_longer(): from matrix/excel to tidy data (wide_to_long/melt in pandas)
  • pivot_wider(): from tidy data to matrix/excel (pivot in pandas)
Below, one example of data (called “not_tidy_pop”“) in ‘excel’ format.
Year France Germany UK
1970 52 61 56
1990 59 80 57
2010 65 82 63

BE VERY CAREFUL: type case matters in R!
Continent \(\neq\) continent.
When referring to a variable (column names), a mistake will lead to an error.

Tidy tool: pivot_longer()! From wide to long.

Tidy tool: pivot_longer()! From wide to long.

Gather joins/concatenates columns which belong to the same variable.

tidy_pop <- pivot_longer(not_tidy_pop, cols = -Year, names_to = "Country", values_to = "Population")
tidy_pop[1:7,]  # First 7 lines (only) shown
Year Country Population
1970 France 52
1970 Germany 61
1970 UK 56
1990 France 59
1990 Germany 80
1990 UK 57
2010 France 65

The syntax is the following:

pivot_longer(data,
\(\quad\) cols = columns to tidy,
\(\quad\) names_to = name_of_the_new_variable,
\(\quad\) values_to = name_of_the_column_with_values
)

names_to = Country because the columns are all countries.

values_to = Population because the data pertains to population values.

We use -Year because the Year variable is excluded from the pivoting

Visually…

Source: software carpentry

Tidy tools: pivot_wider()! From long to wide.

The reverse operation (no need for the “cols” argument this time).

pivot_wider(tidy_pop, names_from = "Country", values_from = "Population")
Year France Germany UK
1970 52 61 56
1990 59 80 57
2010 65 82 63

pivot_wider(data,
\(\quad\) names_from = variable_to_be_put_in_columns,
\(\quad\) values_from = where_to_get_values
)

Data manipulation via the package dplyr

filter() rows - Part I

Often, analyses are performed on subsets of data (query in Python).

filter(gapminder, lifeExp > 81.5) # Countries where people live long lives on average
country continent year lifeExp pop gdpPercap
Hong Kong, China Asia 2007 82.208 6980412 39724.98
Iceland Europe 2007 81.757 301931 36180.79
Japan Asia 2002 82.000 127065841 28604.59
Japan Asia 2007 82.603 127467972 31656.07
Switzerland Europe 2007 81.701 7554661 37506.42

filter() rows - Part II

Filters can be combined (with commas preferably, the & operator works, too).

filter(gapminder, country == "Japan", year > 2000) 
country continent year lifeExp pop gdpPercap
Japan Asia 2002 82.000 127065841 28604.59
Japan Asia 2007 82.603 127467972 31656.07


Only two observations for Japan post-2000.
NOTE: as in all languages, there are TWO EQUAL SIGNS (==) for the comparison.
One “=” is like the arrow (<-) and is used to assign values.

select() columns

Sometimes, you might want to keep just a few variables to ease readability.

select(gapminder[1:5,], country, year, pop)
country year pop
Afghanistan 1952 8425333
Afghanistan 1957 9240934
Afghanistan 1962 10267083
Afghanistan 1967 11537966
Afghanistan 1972 13079460

Use select(data, -variable) to remove variable: the minus sign!

Sort via arrange()

This is when you want to order your data (sort_values in pandas). Here, from smallest pop to largest.

head(arrange(gapminder, pop)) # Alternative: arrange(gapminder, desc(lifeExp)); desc() is for descending
country continent year lifeExp pop gdpPercap
Sao Tome and Principe Africa 1952 46.471 60011 879.5836
Sao Tome and Principe Africa 1957 48.945 61325 860.7369
Djibouti Africa 1952 34.812 63149 2669.5295
Sao Tome and Principe Africa 1962 51.893 65345 1071.5511
Sao Tome and Principe Africa 1967 54.425 70787 1384.8406
Djibouti Africa 1957 37.328 71851 2864.9691

Create new columns via mutate()

With population and gdpPercap you can infer total GDP!

head(mutate(gapminder, gdp = pop * gdpPercap))
country continent year lifeExp pop gdpPercap gdp
Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231

Piping: %>%













(or |>)

Definition: sequences of operations

Very often, one simple analysis will require several steps. They can be combined via the %>% or |> operators.

A fake sequence:

me %>%
\(\quad\) wake_up(time = “06:20”) %>%
\(\quad\) shower(temp = 40) %>%
\(\quad\) go_to(place = “baker”, with = “scooter”) %>%
\(\quad\) buy(item = “bread”) %>%
\(\quad\) go_to(place = “home”, with = “scooter”) %>%
\(\quad\) breakfast(drink = “hot_chocolate”, eat = “toast”, eat = “kiwi”) %>%
\(\quad\) toothbrush(duration = 2)

Example (short)

select(filter(diamonds, carat > 4), carat, price, clarity) # BEURK!
diamonds |>
    filter(carat > 4) |>
    select(carat, price, clarity)  # So simple!
carat price clarity
4.01 15223 I1
4.01 15223 I1
4.13 17329 I1
5.01 18018 I1
4.50 18531 I1

Example (long)

diamonds %>% 
    filter(carat > 2, cut == "Ideal") %>%      # First we filter
    mutate(car_price_ratio = carat/price) %>%  # Then, we create a new column
    arrange(desc(car_price_ratio)) %>%         # We order the data
    select(-x, -y, -z) %>%                     # We take out some superfluous columns
    head(4)                                    # Finally, we ask for the top 4 instances
carat cut color clarity depth table price car_price_ratio
3.50 Ideal H I1 62.8 57 12587 0.0002781
3.22 Ideal I I1 62.6 55 12545 0.0002567
2.16 Ideal H I1 62.2 56 8709 0.0002480
2.25 Ideal E I1 61.4 54 9072 0.0002480

Pivot tables

Definition

A pivot table is a table of statistics that summarizes the data of a more extensive table.

— Wikipedia

There are two dimensions in a pivot table:
- which variable(s) we want to analyze (gender, continent/country, size, etc.);
- which statistic we want to compute (mean, min, max, number of instances, variance etc.).

In R, these two steps are separated via two functions: group_by() and summarise()

Example I

diamonds |>
    group_by(clarity, cut) |>               # Define the variables
    summarise(avg_price = mean(price),      # Define the statistics
              max_price = max(price),
              avg_carat = mean(carat),
              max_carat = max(carat)) |>   
    head(3)
clarity cut avg_price max_price avg_carat max_carat
I1 Fair 3703.533 18531 1.361000 5.01
I1 Good 3596.635 11548 1.203021 3.00
I1 Very Good 4078.226 15984 1.281905 4.00

Example II

You can even pipe inside a function!

gapminder %>%
    group_by(continent, year) %>%
    summarise(avg_lifeExp = mean(lifeExp) %>% round(2)) %>%
    head(4)
continent year avg_lifeExp
Africa 1952 39.14
Africa 1957 41.27
Africa 1962 43.32
Africa 1967 45.33

The round() function rounds numbers up to some decimals.

Bonus: tutors!

Takeaway

Summary

R is an incredibly powerful tool for data science. The preferred environment is the tidyverse. As its name indicates, the core concept is TIDY DATA!

In short, it’s just a question of functions:
- pivot_longer() and pivot_wider() to work with tidy data;
- filter(), select(), arrange() and mutate() for wrangling/manipulation;
- group_by() and summarize() for pivot tables;
- head() and tail() to see the first and last lines of a dataset.

That’s it!

Resources / Links

What are your questions?


A ‘bad’ question is better than no question (there are no bad questions).

  • Programming AND learning to code is HARD:
  • Ask questions so I can make it (slightly) easier.
  • I can’t guess your questions, don’t be shy!
  • Nothing will replace practice & making mistakes.

Tip: for your project, choose a dataset with a mixture of numerical and categorical data. It is the best combination to create a nice looking dashboard!