Basic principle: STOP whenever you have an error message: it’s useless to continue!
A large majority of commands use an arrow “<-”. The following line

a <- b

means that the software will put value b inside variable a.

1 - BASICS

Packages

You need to install a package only once, but you need to activate it each time you start a new R session. The hashtag is use to append comments to the code.

if (!require("tidyverse")) install.packages('tidyverse') # This line to install, if it has not already been done.
library(tidyverse)                                       # This line to activate. Note: quotes are unnecessary here.

Working directory

R works in one particular folder. You can fix it in the Files pane in RStudio. Or you can use the setwd() function. To see what is the current working directory, type getwd().

Variables vs functions

Two major items in R: the functions that you are going to use (like in Excel: sum(), min(), etc.) and the variables that you will manipulate. There is a MAJOR difference between the two! In terms of code, there is only one small (but important!) difference: functions work with round brackets () and data variables work with square brackets [].
For a function, for instance the square root function sqrt(), there is always an argument inside the round brackets: it is the element on which the function will work. sqrt(5) will produce the square root of five. For a variable, the numbers inside the square brackets will relate to indexing (more on that below).

Importing data

This is usually done directly in the user interface, or with packages like openxlsx or readxl (to import Excel files) with the function read.xlsx() or read_excel(). The basic case:test_data <- read.xlsx(“MyFile.xlsx”) or test_data <- read_excel(“MyFile.xlsx”).
This stores your data into the test_data variable. This assumes that the Excel file “MyFile.xlsx” exists in your working directory.

2 - CREATING DATA

Simple sequences

You can create data from scratch, using the colon operator for instance.

1:10
 [1]  1  2  3  4  5  6  7  8  9 10
3:17
 [1]  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17

More generally, the c() function concatenates and encapsulates numbers (or text):

c(2,5,7)
[1] 2 5 7
c(1:6,12:20)
 [1]  1  2  3  4  5  6 12 13 14 15 16 17 18 19 20
c("R", " is ", "awesome")
[1] "R"       " is "    "awesome"

Another way to replicate data is to use row-bind and column-bind functions rbind() and cbind().

rbind(c(2,5,7),c(3,1,8)) 
     [,1] [,2] [,3]
[1,]    2    5    7
[2,]    3    1    8
cbind(c(2,5,7),c(3,1,8)) 
     [,1] [,2]
[1,]    2    3
[2,]    5    1
[3,]    7    8

You can also fill in matrices:

m <- matrix(1:20, nrow = 4) 
m2 <- matrix(1:20, nrow = 4, byrow = T) # Two ways to fill: by row or by column
m
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    5    9   13   17
[2,]    2    6   10   14   18
[3,]    3    7   11   15   19
[4,]    4    8   12   16   20
m2
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    2    3    4    5
[2,]    6    7    8    9   10
[3,]   11   12   13   14   15
[4,]   16   17   18   19   20

R is great to generate random data.

runif(10) # uniform distribution: 10 samples
 [1] 0.06688124 0.82541667 0.28137797 0.11063758 0.53751033 0.38189383 0.08916188 0.86453049 0.61392825 0.68673409
rnorm(20) # Gaussian distribution (parameters could be specified, see online manual): 20 data points
 [1]  1.58870173  0.86578773  0.13029713 -0.97993171 -1.64267374 -0.40144518  0.43771547 -0.16043997 -1.27166377  1.09629147
[11]  0.51176613  1.41954010 -0.06755030 -0.14813283  0.40070418  0.36982084  0.70984827  0.49217370  0.69688766  0.02829686

Dataframes

Datasets often mix text and numbers. R can do that too, with data frames. Let’s create one with the data.frame() function. We use the round() function which rounds up numbers.

nb_gender <- 7                                              # Number of people of each gender
Gender <- rep(c("Male"),nb_gender)                          # nb_gender men in total
Weight <- rnorm(nb_gender, mean = 70, sd = 8) %>% round()   # in kilos
Height <- rnorm(nb_gender, mean = 178, sd = 10) %>% round() # in cm
Age <- rnorm(nb_gender, mean = 40, sd = 7)  %>% round()  
data <- data.frame(Gender,Weight,Height,Age)                # data with only men
Gender <- rep(c("Female"),nb_gender)                        # nb_gender women in total
Weight <-  rnorm(nb_gender, 60, sd = 8)  %>% round()        # in kilos
Height <-  rnorm(nb_gender, 167, sd = 10)  %>% round()      # in cm
Age <- rnorm(nb_gender, mean = 40, sd = 7)  %>% round()  
data <- rbind(data, data.frame(Gender,Weight,Height,Age))   # grouping women with men
data

You can use rownames() or colnames() to get or set the names of rows or columns: colnames(data).

Dimensions

You can obtain the dimension of a matrix or data frame with the dim() function: dim(data). (Nb rows and nb columns). Each dimension can be obtained separately with nrow() and ncol() For vectors, the number of elements can be found with the length() function.

dim(data)  # Be careful with this one
[1] 14  4
nrow(data) # Number of rows
[1] 14
ncol(data) # Number of columns
[1] 4
length(3:35) # Number of elements (best used for a vector)
[1] 33

Boolean (TRUE/FALSE) data

In R, it is usefulto perform tests. For instance, given the sequence 1:12, we want to know which values are strictly greater than 6. The simple command 1:12>6 will provide the answer: the statement is false for the first six elements (1 to 6) and true for the last six (7 to 12).

1:12>6
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

3 - HANDLING DATA IN PURE R

Extracting data

Accessing the values of a variable can be done with the square brackets [] thanks to indexing. For instance, the value in the third row and second column of data is data[3,2].
When columns have names, it is possible to use it to isolate a particular column with the dollar $ operator:

data$Age
 [1] 42 29 34 37 45 38 45 40 54 46 36 35 44 28

Another way to proceed is to omit to specify the row numbers: since Height is the third column of data, then the result is the same with data[,3]. This give you all of the third column. Likewise, data[3,] will return all of the third row.

data[,3] # Third column
 [1] 165 184 188 177 158 180 174 168 154 169 151 171 173 163
data[3,] # Third row

You can extract data with boolean vectors! For instance, if we want to select the people who are older than 42 years old: simple!

data$Age>42
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE

will provide the corresponding indices. To extract the data, you just need to select the right rows and all columns:

data[data$Age>42,]

Only the TRUE rows are kept. As we will see, the filter() function of the tidyverse does just that.

Writing / Replacing values

Writing on data frames, vectors, or matrices can be done with the arrow operator:

data[3,2] <- 99
data[c(7,9),3] <- 166        # Replace 2 cells at a time! Seventh and ninth row on the third column.
data[c(6,8),3] <- c(199,177) # Same, but with 2 different values. 
data                         # CHECK where the new values are!

Seeing data

Unlike in Excel, the data is not directly shown in R. You have to ask for it! To see the content of a variable, you have to type its name and press ENTER.
The head() function shows the first 6 lines and the tail() function shows the last 6 lines.

head(data, 8) # First n lines, with n = 6 by default

The summary() function very often gives useful (statistical) information

summary(data) # Descriptive statistics
    Gender      Weight          Height           Age       
 Male  :7   Min.   :48.00   Min.   :151.0   Min.   :28.00  
 Female:7   1st Qu.:59.25   1st Qu.:165.2   1st Qu.:35.25  
            Median :65.50   Median :170.0   Median :39.00  
            Mean   :68.07   Mean   :171.9   Mean   :39.50  
            3rd Qu.:71.75   3rd Qu.:177.0   3rd Qu.:44.75  
            Max.   :99.00   Max.   :199.0   Max.   :54.00  

Date management

The best package for date management is lubridate. Dates can be converted using the as.Date(), and years, months and days can be retrieved using the year(), month() and day() functions.

if(!require(lubridate)){install.packages("lubridate")}
library(lubridate)
d <- as.Date("2000-04-08")
year(d)  # Gives the year
[1] 2000
month(d) # Gives the month
[1] 4
day(d)   # Gives the day
[1] 8
make_date(year = 2017, month = 6, day = 12) # Creates a date with specified YMD
[1] "2017-06-12"

4 - HANDLING DATA WITH THE TIDYVERSE

Manipulation

Filtering items is incredibly easy via filter(). The %in% operator can be useful when testing for several values.

filter(data, Age > 42)                        # All people older than 42
filter(data, Gender == "Male", Weight > 70)   # All guys heavier than 70 
filter(diamonds, cut %in% c("Fair", "Good"))  # Diamonds with Fair or Good cut
filter(diamonds, color %in% c("E", "F"))      # Diamonds with E or F color

Ordering according to a particular variable is performed with top_n(). In the same vein, arrange() orders the whole dataset according to particular numerical variables

top_n(data, 3, Height)   # Tallest 3 individuals
top_n(data, -4, Weight)  # Lightest 4 individuals
arrange(data, Gender, Height)       # First selects individuals by gender, and then ranks them by height
arrange(data, Gender, desc(Weight)) # Same, but with descending weight.

Selecting a few columns with select().

select(data, Gender, Age) # Keeping only these 3 columns.

Adding new columns can be performed with the mutate() function. Below, we compute the price/carat ratio of diamonds. The mutate function() is THE best choice to add columns that are easily calculated.

diamonds %>% 
    select(-x, -y, -z, - depth, -table) %>% # Getting rid of less useful variables
    mutate(P_C_ratio = price/carat)         # Create a new column: the price/carat ratio

Piping

Very often, several operations are required before the desired output is obtained. There is an elegant way to successively combine functions. It is called piping and works with the operator %>%, which we used just above.

data %>% select(Gender, Height, Age) %>% filter(Age > 40) # The two functions select() and filter() are applied successively

Pivot tables

Pivot tables (PT) are very simply obtained via the combination of group_by() summarise(). group_by() determines along which variables the PT will be computed and summarise() specifies the metric/indicator of interest.

diamonds %>% 
    group_by(cut, color, clarity) %>% # Grouping by cut, then color and the clarity
    summarise(med_price = median(price), med_carat = median(price)) # For each subgroup, computing 2 indicators: median price and median carat

Graphs

In R, there is one major function for graphical representation: ggplot(). In ggplot, aes() describe the aesthetics of the plot. Usually, we need to define the x-axis variable and often, the y-axis variable (for scatter plots and lines). Also, representations can allow for color, shape and size variations, especially for scatter plots. The type of the plot is defined by a ‘geom’. More details can be found here: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

ggplot(data) + geom_point(aes(x = Height, y = Weight, color = Age, shape = Gender, size = Age)) # geom_point = scatter plot

ggplot(diamonds) + geom_bar(aes(x = clarity, fill = cut)) # geom_bar = barplot

diamonds %>% 
    group_by(clarity) %>% 
    summarise(med_carat = median(price)) %>%
    ggplot(aes(x = clarity, y = med_carat)) + geom_bar(stat = "identity")

 # When providing a "y" for a barplot, you must specify the stat="identity". This graph shows that large diamonds are much less "pure" than smaller ones. Which makes sense.

5 - LOOPS

There are usually several types of loops, but we will focus on the for loop. Its structure is simple: the idea is to repeat a task a finite number of times. This allows to automate the changes in a variable. For instance, the Fibonacci sequence:

nb <- 20  # Number of desired numbers
Fib <- 1  # Initiate the output (it will be incrementally augmented): first value
Fib[2] <- 1 # Initialisation: second value
for(k in 3:nb){
   Fib[k] <- Fib[k-1] + Fib[k-2] # the kth value is the sum of the 2 previous ones
}
Fib # Show the sequence
 [1]    1    1    2    3    5    8   13   21   34   55   89  144  233  377  610  987 1597 2584 4181 6765

6 - MISC. FUNCTIONS

Statistics

Below, we present a few useful functions.

rbind(c(2,5,7),c(3,1,8)) 
     [,1] [,2] [,3]
[1,]    2    5    7
[2,]    3    1    8
rbind(c(2,5,7),c(3,1,8)) %>% t()    # Transpose (a vector, a matrix, a dataframe)
     [,1] [,2]
[1,]    2    3
[2,]    5    1
[3,]    7    8
sqrt(5)                             # Square root
[1] 2.236068
mean(c(2:6, 8:43))                  # Average value
[1] 22.87805
sd(c(2:6, 8:43))                    # Standard deviation, use var() for the variance
[1] 12.17004
v <- rnorm(18)                      # We generate a random vector
min(v)                              # Minimum
[1] -1.223596
max(v)                              # Maximum
[1] 1.775057
v                                   # A look at the whole vector
 [1]  0.62139521  0.75263857  0.54648395 -1.22359589  0.10757450  0.42116621  1.77505745  1.39276542  0.48365785 -0.04642858
[11]  0.78928712 -0.39740584 -0.34888729 -1.13734942  0.09880032 -0.07831536  0.67318915 -0.70351280
lag(1:10)                           # The lag() function: shifts data to the right
 [1] NA  1  2  3  4  5  6  7  8  9
mean(c(NA,1:10), na.rm = T)         # When computing means, if values are missing, NA will be returned. na.rm = T solves this.
[1] 5.5

Changing modes

Usual modes for variables are:
- logical (Boolean, TRUE or FALE),
- numeric (numbers),
- character (text),
- factor (unordered category) and
- ordered factor (ordered category)
It is sometimes possible to switch from one to another. One counter-example is: translating a charater into a number.
Some examples below.

c("3", "8", "7")                    # Numbers viewed as text 
[1] "3" "8" "7"
c("3", "8", "7") %>% as.numeric()   # Change the above into *true* numbers
[1] 3 8 7
c(3,4,6) %>% as.character()         # The opposite: change fields into characters
[1] "3" "4" "6"
data$Age %>% as.factor() %>% summary() # as.factor() transforms the fields into catagories, the final step computes the number of elements in each catergory
28 29 34 35 36 37 38 40 42 44 45 46 54 
 1  1  1  1  1  1  1  1  1  1  2  1  1 
