Introduction

In the broad field of machine learning, there are mainly 3 types of algorithms. The classical classification is:
- supervised learning: try to explain (and then predict) one variable (\(y\)) with other variables (\(X\));
- unsupervised learning: try to make sense of a dataset on its own; often, this amounts to finding clusters;
- reinforcement learning: learn via trial and error; perform actions, get rewards and improve!

This notebook is dedicated to supervised learning. In supervised learning, there are two key elements:

Label Feature
y X
Dependent variable Independent variable
Output Input
Predicted variable Predictor

The core equation has the form: \[y = f(X)+e,\] because we try to make sense of \(y\) with the knowledge of \(X\). By "make sense’, we mean: find a magical formula (\(f\)) such that the related error \(e\) are not too big (see below). When \(y\) is a number, we talk about regression, when \(y\) is a category or class, the exercise is a classification. For simplicity, we only focus on regression in this notebook, but the ideas are very similar for classification.

The kind of questions we can answer are for instance:

  • can we explain the price (\(y\)) of a diamond with is size (carat), cut, color, clarity? All these variables are included/stacked into \(X\).
  • can we predict the popularity (\(y\)) of a song with \(X\) being its length, valence, danceability, loudness, etc.?

Linear models: the genesis, the building blocks

The basic idea

The simplest functions are linear. For diamonds, the formula would be:

\[price = b + b_{carat} \times carat + b_{cut} \times cut + b_{color} \times color + b_{clarity} \times clarity,\]

and the whole challenge is to find the magical numbers \(b_{xxx}\). It is customary to include a constant in the model - here it is \(b\). Naturally, the above formulation assumes that all variables are numerical: it is impossible to multiply text or categories. This is no big deal for ordered categories: we can simply transform them into numbers. For instance, the cut values of the diamonds will be coded as 1 for Fair, 2 for Good, 3 for Very Good, 4 for Premium and 5 for Ideal.

Obviously, if we want to find a formula for a large quantity of diamonds, say 10,000, it will be impossible to find values of the coefficients \(b_{xxx}\) that exactly match the prices of the diamonds. We will have to admit that the models makes some errors. Hence, the formula will be

\[price_i = b + b_{carat} \times carat_i + b_{cut} \times cut_i + b_{color} \times color_i + b_{clarity} \times clarity_i + error_i,\]

where we use the index i to show that we model the price of diamond number i with the carat number i, the cut number i, and so on. For each diamond, the model will make an error, also indexed by i. Note that the coefficients \(b_{xxx}\) are not indexed by i: they will be common to all diamonds.

The key question now is: how do we determine these coefficients?

The answer is simple: we want to make the smallest errors as possible! A simple objective would be to minimize the sum of errors. This is not a good idea because positive errors would offset negative errors (i.e. if we overprice or underprice the diamonds). Instead, it is more relevant to minimize the sum of squared errors:

\[\underset{b_{xxx}}{\text{min}} \ \sum_{i=1}^N(error_i)^2\]

Remember that the errors depend directly on the \(b_{xxx}\) numbers.

Some practice

Simple linear models are obtained in R via the function lm(). The syntax is a it strange at first because of the tilde ~, but very recurrent in other statistical packages. First, we digitize the diamonds database. For technical reasons, we also divide the price of diamonds by 1,000. Machine learning algorithms prefer when numbers have homogeneous magnitudes and are possibly not too big.

library(tidyverse)
diams <- diamonds %>%
    select(price, carat, cut, color, clarity) %>%
    mutate_if(is.factor, as.numeric) %>%
    mutate(price = price / 1000)
head(diams)

We confirm that all variables are numerical.
We can then proceed to the regression.

lm(price ~ carat + cut + color + clarity, data = diams)

Call:
lm(formula = price ~ carat + cut + color + clarity, data = diams)

Coefficients:
(Intercept)        carat          cut        color      clarity  
    -4.6612       8.7838       0.1557      -0.3197       0.5248  

The (Intercept) is the constant \(b\). The model is thus:

\[price_i = -4.66 + 8.78 \times carat_i + 0.16 \times cut_i-0.32 \times color_i + 0.52 \times clarity_i+ error_i\]

Beyond this simple output, many interesting quantities can be computed that further characterize the regression output (\(R^2\), \(t\)-statistics, \(p\)-values, etc.). To obtain them, it suffices to pipe a summary. A detailed analysis of these numbers is out of the scope of this introductory session.

lm(price ~ carat + cut + color + clarity, data = diams) %>% summary()

Call:
lm(formula = price ~ carat + cut + color + clarity, data = diams)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.7704  -0.6934  -0.1692   0.5488   9.7219 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4.661173   0.027590 -168.94   <2e-16 ***
carat        8.783772   0.012692  692.09   <2e-16 ***
cut          0.155700   0.004863   32.01   <2e-16 ***
color       -0.319673   0.003302  -96.81   <2e-16 ***
clarity      0.524843   0.003527  148.80   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.235 on 53935 degrees of freedom
Multiple R-squared:  0.9042,    Adjusted R-squared:  0.9042 
F-statistic: 1.272e+05 on 4 and 53935 DF,  p-value: < 2.2e-16

Now, let us imagine a small diamond, say with 0.4 carat, a fair cut, and color code of 2 and average clarity of 3 (corresponding to SI1). Let us compute the model value (assuming zero error) for this diamond:

\[price = -4.66 + 8.78 \times 0.4 + 0.16 \times 1-0.32 \times 2 + 0.52 \times 3 = -0.07,\]

that is: the predicted price is negative. Hum hum….
Linear relationships are too limited.
But it’s possible to build models that are more complex…

Neural networks

Introduction

Often, neural networks (NNs) are represented with figures like the one below:

This is a very simplistic view of what NNs are. In fact, in each circle, there is a linear model! And in between different colors, a non-linear model is applied. The output from the blue models are altered and sent as input to the green linear models. The same process is re-iterated from green to red. Each color represents what is called a layer. The original input is transformed and passed to layers to the right, until it reaches the last layer, which is simply the output.

The exact structure is detailed below for a very small network.

Needless to say: NNs can be very complicated functions with lots of parameters. In neural networks, the parameters are all the coefficients (called weights) in all linear units in the network.

The crucial step is thus to find the best parameters, just like for regressions. Just like for regression, the model will aim at minimizing the sum of squared errors. This quantity is called the loss.

Finding the optimal parameters for a complex function like a NN cannot be done automatically like for regressions. The process that finds good parameters is progressive, iterative. The idea is the following:

The aim is to reduce the total error. So we use the derivative and follow it to reach a minimum point.

The process is complicated at the scale of the whole network and relies on the chain rule for differentiation. It starts from the right of the network and then diffuses back progressively to the early nodes of the network. That’s why it’s called backpropagation.

Example: diamonds

We show how to implement NNs in R via the keras package. But first, we must prepare the data.

even <- 2*(1:20000)                   # Even numbers
odd <- even + 1                       # Odd numbers
diam_train <- diams[even,]            # Training set
train_label <- diam_train$price       # Training label
train_features <- diam_train %>%      # Training features
  select(-price) %>%
  as.matrix()
diam_test <- diams[odd,]              # Testing sample
test_label <- diam_test$price         # Testing label
test_features <- diam_test %>%        # Testing features
  select(-price) %>%
  as.matrix()

The construction of the network requires 3 steps:
1. Definition of the structure of the network (unmber of layers & units per layer)
2. Setting of the loss function and learning process
3. Training by specifying the batch sizes and number of rounds

The batch size is how many instances are used to compute gradients in the backprop (at a time). The number of rounds is how many times the backprop sees the whole training sample.

The lines below rely on Keras, a deep learning framework originally coded in Python. They require a Python installation as well as the keras package, see the installation section of https://keras.rstudio.com.

library(keras)
# install_keras() # To complete installation
model <- keras_model_sequential()
model %>%   # This defines the structure of the network, i.e. how layers are organized
    layer_dense(units = 16, activation = 'relu', input_shape = ncol(train_features)) %>%
    layer_dense(units = 8, activation = 'sigmoid') %>%
    layer_dense(units = 1) # No activation means linear activation: f(x) = x.
2020-02-27 22:01:48.192533: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-02-27 22:01:48.209129: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fc7640aee50 executing computations on platform Host. Devices:
2020-02-27 22:01:48.209146: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version

We can plot this architecture via tools such as: http://alexlenail.me/NN-SVG/index.html.

model %>% compile(                             # Model specification
    loss = 'mean_squared_error',               # Loss function
    optimizer = optimizer_rmsprop(),           # Optimisation method (weight updating)
    metrics = c('mean_absolute_error')         # Output metric
)
summary(model)                                 # Model architecture
Model: "sequential"
____________________________________________________________________________________________________________________________
Layer (type)                                           Output Shape                                      Param #            
============================================================================================================================
dense (Dense)                                          (None, 16)                                        80                 
____________________________________________________________________________________________________________________________
dense_1 (Dense)                                        (None, 8)                                         136                
____________________________________________________________________________________________________________________________
dense_2 (Dense)                                        (None, 1)                                         9                  
============================================================================================================================
Total params: 225
Trainable params: 225
Non-trainable params: 0
____________________________________________________________________________________________________________________________
fit_NN <- model %>% 
    fit(train_features,                                          # Training features
        train_label,                                             # Training labels
        epochs = 50, batch_size = 512,                           # Training parameters
        validation_data = list(test_features, test_label),       # Test data
        verbose = 0                                              # No messages
        ) 
plot(fit_NN)                                                     # Plot, evidently!

Let’s test the model on the fake diamond mentioned before (0.4 carat, 1 cut, 2 color, 3 clarity):

example <- c(0.4, 1, 2, 3) %>% t()  # Transpose for technical reason
predict(model, example)
          [,1]
[1,] 0.2422513

Example: predicting song popularity

Recycling a model structure is very easy! Let’s try this on the songs dataset. First, we import the data and normalize some columns.

load("songs.RData")
songs <- songs %>%
  mutate(duration = duration / 60,
         popularity = popularity / 100,
         tempo = tempo / 100)

Let’s have a look at the first row.

songs[1,]

We want to be able to predict this popularity of 73 with a neural network. Let’s go! We build the model on all songs except the first one. First, we construct the training data.

feature_songs <- c("duration", "danceability", "energy", "loudness", "speechiness", "tempo", "valence")
train_songs <- songs[2:13070,]
songs_features <- train_songs %>%
  select(feature_songs) %>%
  as.matrix()
songs_label <- train_songs$popularity 

Next, we rebuild the model.

model2 <- keras_model_sequential()
model2 %>%   # This defines the structure of the network, i.e. how layers are organized
    layer_dense(units = 16, activation = 'relu', input_shape = length(feature_songs)) %>% # The orignal size is different
    layer_dense(units = 8, activation = 'sigmoid') %>%
    layer_dense(units = 1) # No activation means linear activation: f(x) = x.

model2 %>% compile(                             # Model specification
    loss = 'mean_squared_error',               # Loss function
    optimizer = optimizer_rmsprop(),           # Optimisation method (weight updating)
    metrics = c('mean_absolute_error')         # Output metric
)
summary(model2)                                 # Model architecture
Model: "sequential_1"
____________________________________________________________________________________________________________________________
Layer (type)                                           Output Shape                                      Param #            
============================================================================================================================
dense_3 (Dense)                                        (None, 16)                                        128                
____________________________________________________________________________________________________________________________
dense_4 (Dense)                                        (None, 8)                                         136                
____________________________________________________________________________________________________________________________
dense_5 (Dense)                                        (None, 1)                                         9                  
============================================================================================================================
Total params: 273
Trainable params: 273
Non-trainable params: 0
____________________________________________________________________________________________________________________________

There are more parameters because the original data has 7 columns and not 4. Now let’s turn to the actual learning: don’t forget to change the inputs.

fit_NN2 <- model2 %>% 
    fit(songs_features,                                          # Training features
        songs_label,                                             # Training labels
        epochs = 30, batch_size = 256,                           # Training parameters
        validation_data = list(songs_features, songs_label),     # Test data
        verbose = 0                                              # No messages
        ) 
plot(fit_NN2)                                                    # Plot, evidently!

predict(model2, songs[1,] %>% select(feature_songs) %>% as.matrix())
          [,1]
[1,] 0.5408809

The number is not great. In addition, it’s random. Because the original weights are determined randomly.

For diamonds, the NN work well because the variables have an obvious direct impact on the price: size, purity, etc.
For songs, this is not obvious. Hence, the lower precision.

---
title: "Supervised learning"
output:
  html_notebook:
    toc: true
    toc_float: true
---

<style type="text/css">
.table {
    width: 50%;
    margin-left:auto; 
    margin-right:auto;
}
</style>

# Introduction

In the broad field of machine learning, there are mainly 3 types of algorithms. The classical classification is:   
- **supervised learning**: try to explain (and then predict) one variable ($y$) with other variables ($X$);    
- **unsupervised learning**: try to make sense of a dataset on its own; often, this amounts to finding clusters;   
- **reinforcement learning**: learn via trial and error; perform actions, get rewards and improve!     


This notebook is dedicated to supervised learning. In supervised learning, there are two key elements: 

|Label|Feature|
|---:|---:|
|y|X|
|Dependent variable|Independent variable|
|Output|Input|
|Predicted variable|Predictor|

The core equation has the form:
$$y = f(X)+e,$$
because we try to make sense of $y$ with the knowledge of $X$. By "*make sense*', we mean: find a **magical formula** ($f$) such that the related error $e$ are not too big (see below). When $y$ is a number, we talk about **regression**, when $y$ is a category or class, the exercise is a **classification**. For simplicity, we only focus on regression in this notebook, but the ideas are very similar for classification.

The kind of questions we can answer are for instance:

- can we explain the price ($y$) of a diamond with is size (carat), cut, color, clarity? All these variables are included/stacked into $X$.     
- can we predict the popularity ($y$) of a song with $X$ being its length, valence, danceability, loudness, etc.?


# Linear models: the genesis, the building blocks
## The basic idea

The simplest functions are linear. For diamonds, the formula would be:

$$price = b + b_{carat} \times carat + b_{cut} \times cut + b_{color} \times color + b_{clarity} \times clarity,$$

and the whole challenge is to find the magical numbers $b_{xxx}$. It is customary to include a constant in the model - here it is $b$. Naturally, the above formulation assumes that all variables are **numerical**: it is impossible to multiply text or categories. This is no big deal for ordered categories: we can simply transform them into numbers. For instance, the **cut** values  of the diamonds will be coded as 1 for **Fair**, 2 for **Good**, 3 for **Very Good**, 4 for **Premium** and 5 for **Ideal**.


Obviously, if we want to find a formula for a large quantity of diamonds, say 10,000, it will be impossible to find values of the coefficients $b_{xxx}$ that exactly match the prices of the diamonds. We will have to admit that the models makes some **errors**. Hence, the formula will be

$$price_i = b + b_{carat} \times carat_i + b_{cut} \times cut_i + b_{color} \times color_i + b_{clarity} \times clarity_i + error_i,$$

where we use the index i to show that we model the price of diamond number i with the carat number i, the cut number i, and so on. For each diamond, the model will make an error, also indexed by i. Note that the coefficients $b_{xxx}$ are not indexed by i: they will be common to all diamonds.

The key question now is: how do we determine these **coefficients**?

The answer is simple: we want to make the smallest errors as possible! A simple objective would be to minimize the sum of errors. This is not a good idea because positive errors would offset negative errors (i.e. if we overprice or underprice the diamonds). Instead, it is more relevant to **minimize the sum of squared errors**:

$$\underset{b_{xxx}}{\text{min}} \ \sum_{i=1}^N(error_i)^2$$

Remember that the errors depend directly on the $b_{xxx}$ numbers. 

## Some practice

Simple linear models are obtained in R via the function lm(). The syntax is a it strange at first because of the tilde ~, but very recurrent in other statistical packages. First, we digitize the diamonds database. For technical reasons, we also divide the price of diamonds by 1,000. Machine learning algorithms prefer when numbers have **homogeneous magnitudes** and are possibly not too big.

```{r, message = FALSE, warning = FALSE}
library(tidyverse)
diams <- diamonds %>%
    select(price, carat, cut, color, clarity) %>%
    mutate_if(is.factor, as.numeric) %>%
    mutate(price = price / 1000)
head(diams)
```

We confirm that all variables are numerical.   
We can then proceed to the regression.

```{r}
lm(price ~ carat + cut + color + clarity, data = diams)
```

The (Intercept) is the constant $b$. The model is thus:

$$price_i = -4.66  + 8.78 \times carat_i + 0.16 \times cut_i-0.32 \times color_i + 0.52 \times clarity_i+ error_i$$

Beyond this simple output, many interesting quantities can be computed that further characterize the regression output ($R^2$, $t$-statistics, $p$-values, etc.). To obtain them, it suffices to pipe a summary. A detailed analysis of these numbers is out of the scope of this introductory session.

```{r}
lm(price ~ carat + cut + color + clarity, data = diams) %>% summary()
```

Now, let us imagine a small diamond, say with 0.4 carat, a fair cut, and color code of 2 and average clarity of 3 (corresponding to SI1). Let us compute the model value (assuming zero error) for this diamond:

$$price = -4.66  + 8.78 \times 0.4 + 0.16 \times 1-0.32 \times 2 + 0.52 \times 3 = -0.07,$$

that is: the predicted price is negative. Hum hum....   
Linear relationships are too limited.   
But it's possible to build models that are more complex... 


# Neural networks

## Introduction

Often, neural networks (NNs) are represented with figures like the one below:

```{r, echo = FALSE, fig.align='center'}
knitr::include_graphics("nn.png")
```

This is a very simplistic view of what NNs are. In fact, in each circle, there is a linear model! And in between different colors, a non-linear model is applied. The output from the blue models are altered and sent as input to the green linear models. The same process is re-iterated from green to red. Each color represents what is called **a layer**. The original input is transformed and passed to layers to the right, until it reaches the last layer, which is simply the output.

The exact structure is detailed below for a very small network.

```{r, echo = FALSE, fig.align='center'}
knitr::include_graphics("NN_scheme.png")
```

Needless to say: NNs can be very complicated functions with lots of parameters. In neural networks, the parameters are all the coefficients (called weights) in all linear units in the network. 

The crucial step is thus to find the *best* parameters, just like for regressions. 
Just like for regression, the model will aim at minimizing the sum of squared errors. This quantity is called the loss.

Finding the optimal parameters for a complex function like a NN cannot be done automatically like for regressions.
The process that finds good parameters is progressive, iterative. 
The idea is the following:

```{r, echo = FALSE}
knitr::include_graphics("Newton.png")
```
The aim is to reduce the total error. So we use the derivative and follow it to reach a minimum point. 

The process is complicated at the scale of the whole network and relies on the **chain rule** for differentiation.
It starts from the right of the network and then diffuses back progressively to the early nodes of the network.
That's why it's called **backpropagation**.

## Example: diamonds

We show how to implement NNs in R via the keras package. 
But first, we must prepare the data.

```{r}
even <- 2*(1:20000)                   # Even numbers
odd <- even + 1                       # Odd numbers
diam_train <- diams[even,]            # Training set
train_label <- diam_train$price       # Training label
train_features <- diam_train %>%      # Training features
  select(-price) %>%
  as.matrix()
diam_test <- diams[odd,]              # Testing sample
test_label <- diam_test$price         # Testing label
test_features <- diam_test %>%        # Testing features
  select(-price) %>%
  as.matrix()
```

The construction of the network requires 3 steps:   
1. Definition of the structure of the network (unmber of layers & units per layer)    
2. Setting of the loss function and learning process    
3. Training by specifying the batch sizes and number of rounds   

The **batch size** is how many instances are used to compute gradients in the backprop (at a time). The number of **rounds** is how many times the backprop sees the whole training sample.

The lines below rely on **Keras**, a deep learning framework originally coded in Python. They require a Python installation as well as the keras package, see the installation section of https://keras.rstudio.com.

```{r, message = FALSE, warning = FALSE}
library(keras)
# install_keras() # To complete installation
model <- keras_model_sequential()
model %>%   # This defines the structure of the network, i.e. how layers are organized
    layer_dense(units = 16, activation = 'relu', input_shape = ncol(train_features)) %>%
    layer_dense(units = 8, activation = 'sigmoid') %>%
    layer_dense(units = 1) # No activation means linear activation: f(x) = x.
```

We can plot this architecture via tools such as: http://alexlenail.me/NN-SVG/index.html.


```{r}
model %>% compile(                             # Model specification
    loss = 'mean_squared_error',               # Loss function
    optimizer = optimizer_rmsprop(),           # Optimisation method (weight updating)
    metrics = c('mean_absolute_error')         # Output metric
)
summary(model)                                 # Model architecture
```

```{r, message = FALSE, warning = FALSE}
fit_NN <- model %>% 
    fit(train_features,                                          # Training features
        train_label,                                             # Training labels
        epochs = 50, batch_size = 512,                           # Training parameters
        validation_data = list(test_features, test_label),       # Test data
        verbose = 0                                              # No messages
        ) 
plot(fit_NN)                                                     # Plot, evidently!
```


Let's test the model on the fake diamond mentioned before (0.4 carat, 1 cut, 2 color, 3 clarity):

```{r}
example <- c(0.4, 1, 2, 3) %>% t()  # Transpose for technical reason
predict(model, example)
```

## Example: predicting song popularity

Recycling a model structure is very easy! Let's try this on the songs dataset. First, we import the data and normalize some columns.

```{r}
load("songs.RData")
songs <- songs %>%
  mutate(duration = duration / 60,
         popularity = popularity / 100,
         tempo = tempo / 100)
```

Let's have a look at the first row.

```{r}
songs[1,]
```
We want to be able to predict this popularity of 73 with a neural network. Let's go!
We build the model on all songs except the first one. First, we construct the training data.

```{r, message = FALSE, warning = FALSE}
feature_songs <- c("duration", "danceability", "energy", "loudness", "speechiness", "tempo", "valence")
train_songs <- songs[2:13070,]
songs_features <- train_songs %>%
  select(feature_songs) %>%
  as.matrix()
songs_label <- train_songs$popularity 
```

Next, we rebuild the model. 

```{r}
model2 <- keras_model_sequential()
model2 %>%   # This defines the structure of the network, i.e. how layers are organized
    layer_dense(units = 16, activation = 'relu', input_shape = length(feature_songs)) %>% # The orignal size is different
    layer_dense(units = 8, activation = 'sigmoid') %>%
    layer_dense(units = 1) # No activation means linear activation: f(x) = x.

model2 %>% compile(                             # Model specification
    loss = 'mean_squared_error',               # Loss function
    optimizer = optimizer_rmsprop(),           # Optimisation method (weight updating)
    metrics = c('mean_absolute_error')         # Output metric
)
summary(model2)                                 # Model architecture
```

There are more parameters because the original data has 7 columns and not 4. 
Now let's turn to the actual learning: don't forget to change the inputs.

```{r}
fit_NN2 <- model2 %>% 
    fit(songs_features,                                          # Training features
        songs_label,                                             # Training labels
        epochs = 30, batch_size = 256,                           # Training parameters
        validation_data = list(songs_features, songs_label),     # Test data
        verbose = 0                                              # No messages
        ) 
plot(fit_NN2)                                                    # Plot, evidently!
```



```{r}
predict(model2, songs[1,] %>% select(feature_songs) %>% as.matrix())
```

The number is not great. In addition, it's random. Because the original weights are determined randomly.

For diamonds, the NN work well because the variables have an obvious direct impact on the price: size, purity, etc.   
For songs, this is not obvious. Hence, the lower precision. 

```{r}

```

