This the companion notebook for the session on Neural Networks applied to factor investing.
The implementation is done using Keras and the Tensorflow environment, both of which run on Python.

These three items need to be installed on the compuer before you proceed.

In the financial applications below, the choice of dependent variable is not optimal. It was chosen to minimise code complexity and ease understanding of basic principles. Results may seem disappointing, but they are realistic! Neural networks are no magic wands in asset management. It take experience/experimentation to make them deliver good results. That’s up to you!

\(\rightarrow\) ABOUT THE NOTEBOOK:

the code chunks are sequential. They must be executed in the correct order.
text between stars appears in bold in html format.

1 Activation functions

We start slowly with plots of activation functions. All of them, except the absence of activation (linear/identity function) and the ReLu, are bounded, either in [0,1] or [-1,1].

library(tidyverse)
# Definition of functions below
ReLu <- function(x){return(pmax(0,x))}
Heaviside <- function(x){return((sign(x)+1)/2)}
Sigmoid <- function(x){return(1/(1+exp(-x)))}
Identity <- function(x){return(x)}

# The graph
ggplot(data.frame(x = c(-1.5,1.5)), aes(x)) +
  stat_function(fun = Identity, aes(color = "Identity")) +
  stat_function(fun = ReLu, aes(color = "ReLu"), linetype = 2, size = 1.2) +
  stat_function(fun = Heaviside, aes(color = "Heaviside")) + 
  stat_function(fun = Sigmoid, aes(color = "Sigmoid")) + 
  stat_function(fun = tanh, aes(color = "TanH")) +
  coord_fixed(0.6) + theme_light() +
    theme(legend.position = c(0.8,0.2))

2 A first example

2.1 Preparation

Data preparation takes time, but proper formatting is imperative. From data, we create data_f, which has normalised data: each feature has a uniform distribution at each date.

load("data.RData")
tick <- unique(data$Tick)               # Vector of ticker symbols
data <- data %>% arrange(Date, Tick)    # Order the data

data <- data %>% 
  select(-ESG_rank) %>%                             # Remove ESG: not enough points!
  group_by(Tick) %>% 
  mutate(F_Return = lead(Close) / Close - 1) %>%    # Forward monthly return 
  ungroup() %>%                                     # Remove grouping
  na.omit()                                         # Remove missing data

We created returns above. Below, we rescale the features.

normalise <-  function(v){  # This is a function that 'uniformalises' a vector
  v <- v %>% as.matrix()
  return(ecdf(v)(v))
}

data_f <- data %>%                        # Creating data with normalised values
  select(-Close) %>%                      # Closing prices are not predictors
  group_by(Date) %>%                      # Normalisation takes place for each given date
  mutate_if(is.numeric, normalise) %>%    # Apply normalisation on numerical columns
  ungroup() %>%                           # Ungroup
  arrange(Date, Tick)                     # Just making sure the order the same as before

data_f$F_Return <- data$F_Return          # Keep original returns: better for interpretation

Then, further formatting is required! We prepare the data so that it fits as Keras inputs.

features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg") # Predictor columns
split_date <- "2015-06-15"                      # Splitting point train versus test
train <- data_f %>% filter(Date < split_date)   # Training data
test <- data_f %>% filter(Date > split_date)    # Testing data
# Below, we split these sets into features versus labels
train_features <- select(train, all_of(features)) %>% as.matrix() # Matrix format is important
train_labels <- train$F_Return
test_features <- select(test, all_of(features)) %>% as.matrix()   # Matrix format is important
test_labels <- test$F_Return

2.2 Model parametrisation

Next step, the model architecture and the training details. We keep low dimensions for the layers to reduce computational cost and also because a large number of parameters would be:

useless given the small number of predictors;
a direct shortcut to overfitting.

if(!require(keras)){install.packages("keras")} # Install Keras
# The user must also install Python & Tensorflow
library(keras)
# install_keras() # To complete installation
model <- keras_model_sequential()
model %>%   # This defines the structure of the network, i.e. how layers are organized
  layer_dense(units = 4, activation = 'relu', input_shape = ncol(train_features)) %>%
  layer_dense(units = 3, activation = 'sigmoid') %>%
  layer_dense(units = 1) # No activation means linear activation: f(x) = x.

model %>% compile(                           # Model specification
  loss = 'mean_squared_error',               # Loss function
  optimizer = optimizer_rmsprop(),           # Optimisation method (weight updating)
  metrics = c('mean_absolute_error')         # Output metric
)
summary(model)                               # Model architecture

## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_2 (Dense)                     (None, 4)                       24          
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 3)                       15          
## ________________________________________________________________________________
## dense (Dense)                       (None, 1)                       4           
## ================================================================================
## Total params: 43
## Trainable params: 43
## Non-trainable params: 0
## ________________________________________________________________________________

The number of parameters is indeed equal to \[\mathcal{N}=\left(\sum_{l=1}^L(U_{l-1}+1)U_l\right)+U_L+1,\] where \(U_l\) is the number of units of layer l and \(U_0\) is the number of features. Here, \[(5+1)\times 4 + (4+1)\times 3+ (3+1) = 28 + 15 + 3 + 1 = 43.\]

2.3 Model implementation

Finally, training and testing!

fit <- model %>% fit(train_features,               # Training features
                     train_labels,                 # Training labels
                     epochs = 10, batch_size = 64, # Training parameters
                     validation_data = list(test_features, test_labels) # Test data
)
plot(fit) + theme_light()              # Plot, evidently!

mean(abs(test_labels)) # Error if zero prediction

## [1] 0.05318341

Training errors are usually smaller, but sometimes that may not be the case. The agnostic bet should reach an absolute error of 5.3% on average. The performance here is not great.

An important step: extracting predicted values! We extract the first output of the test set below.

predict(model, test_features[1:3,]) # The transpose is purely technical - but required!

##             [,1]
## [1,] 0.007683896
## [2,] 0.010230683
## [3,] 0.012292013

This is the prediction from the neural network for the first test instance.

2.4 Model options

Below, we discuss four options (among so many!):

in the fitting phase, early stopping allows to halt training when a performance metric fails to improve.
weight initialisation pertains to the first values of weights and biases before the first round of backpropagation. Many choices exist: some are deterministic (constant) and some are random.
weight regularisation is one technique used to prevent overfitting by penalising the magnitude of weights in the units of the network.
dropout is another option to reduce overfitting: during training, some units are randomly ‘deleted’ and omitted by the network. We refer to the original paper by Srivastava et al. (2014) for the intuition behind this idea.

We discuss these four extensions below.

Sometimes, it is hard to know what an appropriate number of epochs might be. Luckily, it is possible to stop the training when the learning process stagnates. This feature is embedded inside a callback function in Keras, as is shown below. In the code below, we stop training when validation loss (val_loss) stops decreasing for more than 3 epochs.

fit <- model %>% fit(train_features,                        # Training features
                     train_labels,                          # Training labels
                     epochs = 10, batch_size = 32,          # Training parameters
                     validation_data = list(test_features, test_labels), # Test data
                     verbose = 0,                           # No comments
                     callbacks = list(callback_early_stopping(
                       monitor = "val_loss",                # Monitor validation loss
                       min_delta = 0.01,                    # Improvement threshold
                       patience = 3,                        # Nb epochs for no improvmt 
                       verbose = 0                          # No warnings
                     )
                     )
)
plot(fit) + theme_light() # Plot

Indeed, the training stops early!

VERY IMPORTANT NOTE: the model does not reinitiate the weights. It starts back with the fitted values obtain during the previous training! Hence, this new round of training only marginally improves on the previous one. We have simply changed the batch size, so this makes sense: no additional data was provided to the model.

We then turn to weight initialisation and penalisation. In Keras, weights can either be set to (small) random values or to constant values. We refer to the dedicated reference page for more details on the subject: https://keras.rstudio.com/reference/index.html#section-initializers

Moreover, weights can be penalised. This is done easily in the objective function: \[O=\sum_{i=1}^I \text{loss}(y_i,\tilde{y}_i) + \sum_{n=1}^N\left( \lambda_n^1 \| \mathbb{w}_n \|_1+\lambda_n^2\| \mathbb{w}_n \|_2^2\right),\]

where \(\mathbb{w}_n\) are the weights of unit \(n\) and \(\lambda_n^1\) and \(\lambda_n^2\) are the scaling factors related to the \(L^1\) and \(L^2\) norms of weights. Note: another option is to directly constrain the norm of weights, see: https://keras.rstudio.com/reference/constraints.html.

We provide one example below. Note that fixed weight initialisations have the advantage of reproducibility. Random seed are not easy to manage given the two underlying systems (R and Python).

Finally and for the sake of exhaustiveness, we mention the idea of dropout which is another technique used to avoid overfitting. The dropout line applies to the layer that follows it.

model_options <- keras_model_sequential()
model_options %>%   # This defines the structure of the network, i.e. how layers are organized
  layer_dense(units = 8, activation = 'relu', input_shape = ncol(train_features)) %>%
  layer_dropout(rate = 0.25) %>%  # Dropout: 25% of units (1/4) will be omitted 
  layer_dense(units = 4, activation = 'sigmoid', kernel_initializer = "random_normal") %>%
  layer_dense(units = 3,                                     # A layer with many options
              activation = 'relu',                           # Activation: classic
              bias_initializer = initializer_constant(0.2),  # Bias initialisation
              kernel_regularizer = regularizer_l2(0.01)) %>% # Weight regularisation
  layer_dense(units = 1, activation = 'tanh') 

model_options %>% compile(                      # Model specification
  loss = 'mean_squared_error',                # Loss function
  optimizer = optimizer_rmsprop(),            # Optimisation method (weight updating)
  metrics = c('mean_absolute_error')          # Output metric
)
summary(model_options)                          # Model structure

## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_6 (Dense)                     (None, 8)                       48          
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 8)                       0           
## ________________________________________________________________________________
## dense_5 (Dense)                     (None, 4)                       36          
## ________________________________________________________________________________
## dense_4 (Dense)                     (None, 3)                       15          
## ________________________________________________________________________________
## dense_3 (Dense)                     (None, 1)                       4           
## ================================================================================
## Total params: 103
## Trainable params: 103
## Non-trainable params: 0
## ________________________________________________________________________________

Below, we train this model and make a quick prediction.

fit <- model_options %>% fit(train_features,       # Training features
                             train_labels,                 # Training labels
                             epochs = 10, batch_size = 64, # Training parameters
                             validation_data = list(test_features, test_labels) # Test data
)
plot(fit) + theme_light() +              
  scale_color_manual(values = c("#EE9911", "#1199BB")) +# Plot, evidently!
  theme(legend.position = c(0.8,0.8))

A comment here: training error decreases rapidly, but not testing error, which remains relatively constant (and below training error). The chances of high generalisation ability for the model are slim.

The prediction syntax is unchanged.

predict(model_options, test_features[1:3,]) # The transpose is purely technical - but required!

##            [,1]
## [1,] 0.01127903
## [2,] 0.01587060
## [3,] 0.01737709

2.5 Classification

We now turn to classification. We again resort to a ternary coding of returns. We start by preparing the data. Unlike xgboost, Keras requires one-hot encoding.

r_thresh <- 0.02                                                        # Return threshold
data_f$FC_Return <- data_f$F_Return                                     # Duplicate return
data_f$FC_Return[data_f$F_Return <= (-r_thresh)] <- 0                   # Low return
data_f$FC_Return[(data_f$F_Return > (-r_thresh)) & (data_f$F_Return < r_thresh)] <- 1 # Soso return
data_f$FC_Return[data_f$F_Return >= r_thresh] <- 2                      # High return
train_class <- data_f %>% filter(Date < split_date)                     # Training data
test_class <- data_f %>% filter(Date > split_date)                      # Testing data
# Below, we split these sets into features versus labels
train_features_class <- select(train_class, features) %>% as.matrix() 
train_labels_class <- model.matrix(~0+as.factor(train_class$FC_Return)) # One-hot encoding
test_features_class <- select(test_class, features) %>% as.matrix()
test_labels_class <- model.matrix(~0+as.factor(test_class$FC_Return))   # One-hot encoding

We then build a classification model. Several changes occur in the code:

the final layer must end with a number of units equal to the number of classes and the activation function is usually softmax
the loss function must also be adjusted. Several choices are possible, but cross-entropy is the mainstream option
the performance metric can be changed. We use accuracy, which is also a classic choice.

model_class <- keras_model_sequential()
model_class %>%   # This defines the structure of the network, i.e. how layers are organized
  layer_dense(units = 4, activation = 'relu', input_shape = ncol(train_features_class)) %>%
  layer_dense(units = 3, activation = 'softmax') # Softmax for classification

model_class %>% compile(                           # Model specification
  loss = 'categorical_crossentropy',               # Loss function
  optimizer = optimizer_rmsprop(),                 # Optimisation method (weight updating)
  metrics = c('categorical_accuracy')              # Output metric
)
summary(model_class)                               # Model structure

## Model: "sequential_2"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense_8 (Dense)                     (None, 4)                       24          
## ________________________________________________________________________________
## dense_7 (Dense)                     (None, 3)                       15          
## ================================================================================
## Total params: 39
## Trainable params: 39
## Non-trainable params: 0
## ________________________________________________________________________________

And finally, the training!

fit <- model_class %>% fit(train_features_class,                   # Training features
                           train_labels_class,                     # Training labels
                           epochs = 10, batch_size = 32,           # Training parameters
                           validation_data = list(test_features_class, test_labels_class) # Test data
)
plot(fit) + theme_light()                                                   # Plot, evidently!

predict(model_class, t(test_features_class[1,])) # Prediction

##          [,1]      [,2]      [,3]
## [1,] 0.328903 0.2590789 0.4120181

colMeans(test_labels_class)                      # Class imbalance

## as.factor(test_class$FC_Return)0 as.factor(test_class$FC_Return)1 
##                        0.3073529                        0.2759804 
## as.factor(test_class$FC_Return)2 
##                        0.4166667

We get a vector of probabilities for each class. The learning pattern is close to a plateau, which is again disappointing.

3 Recurrent networks

A whole section is dedicated to recurrent networks because they are much more tougher to handle. For simplicity, we will implement a Gated Recurrent Unit (GRU) model.

3.1 Data formatting - again!

The implementation is slightly more complicated for a RNN because:

chronology matters.
the training is usually performed on the unfolded network.

First, we must reshape the inputs, using arrays.

nb_ticks <- length(tick)                            # Nb stocks 
nb_chars <- length(features)                        # Nb features
nb_dates_train <- nrow(train_features) / nb_ticks   # Nb training dates (size of training sample)
nb_dates_test <- nrow(test_features) / nb_ticks     # Nb testing dates

# Below, we morph the inputs in array/tensor format
train_features_rnn <- array(train_features, dim = c(nb_ticks, nb_dates_train, nb_chars))
test_features_rnn <- array(test_features, dim = c(nb_ticks, nb_dates_test, nb_chars))
train_labels_rnn <- matrix(train_labels, nrow = nb_ticks) %>% 
  array(dim = c(nb_ticks, nb_dates_train,1))
test_labels_rnn <- matrix(test_labels, nrow = nb_ticks) %>% 
  array(dim = c(nb_ticks, nb_dates_test,1))

3.2 Training

The dimensions are crucial. In Keras, they are defined for RNNs as:

the size of the batch. In our case, it could be both the number of assets or the number of dates
the timesteps. In our case, it could be just one or the number of dates
the number of features. In our case, there is only one possible figure: the number of predictors

The problem is that these choices are binding when turning to the forecasting stage (as of Jan. 2019). Typically, when the training sample has a given number of timesteps, the testing sample must also have the same amount. Since we are aiming at one step-ahead prediction, it will be more convenient to set timesteps = 1.

This is not a big issue for two reasons: first, Keras models are mutable, hence calling fit() several times will update the training of the model. Second, there is a feature in Keras RNN that allows to take values from a previous batch and consider them as starting points in the current batch. This option is activated when stateful = TRUE in the model arguments.

The solution we employ is thus to loop the training across the different assets and to use batch sizes equal to the number of chronological points in the sample. Nevertheless, to avoid undesired overlapping of hidden states when switching from one batch (i.e., asset) to another while training, we resort to the command reset_states() after each batch.

model <- keras_model_sequential() %>% 
  layer_gru(units = 16,                                          # Nb units in hidden layer
            batch_input_shape = c(nb_dates_train, 1, nb_chars),  # Dimensions = tricky part!
            activation = 'relu',                                 # Activation function
            stateful = TRUE) %>%                                 # Passing last state to nxt batch
  layer_dense(units = 1)                                         # Final aggregation layer

model %>% compile(
  loss = 'mean_squared_error',           # Loss = quadratic
  optimizer = optimizer_rmsprop(),       # Backprop
  metrics = c('mean_absolute_error')     # Output metric MAE
)

for(i in 1:nb_ticks){                      # Loop over assets
  model %>% fit(array(train_features_rnn[i,,], dim = c(nb_dates_train, 1, nb_chars)),        
                train_labels_rnn[i,,],              # Training labels
                epochs = 10,                        # Number of rounds
                batch_size = nb_dates_train,        # Length of sequences
                verbose = 0                         # Do not display progress bar
  )
  model %>% reset_states()   # Reset hidden states in the network when changing asset
}

In the above example, we roll the training via the stateful option. It could be possible to avoid doing this by inverting the batch size with the timesteps, but the label size must be handled differently. Because the batch size would be one, the label should consist of only one point (the last one) and the architecture would be somewhat different.

For a detailed account of all the properties of GRU networks in Keras, we recommend the reference page: https://keras.rstudio.com/reference/layer_gru.html

3.3 Predictions

Below, we create a new model with the parameters of the one we just have trained. Given the input requirements of Keras, we choose (!) to work with a double loop. Indeed the handling of batch sizes is not straightforward, we thus proceed with single batches, one step at a time.

new_model <- keras_model_sequential() %>% 
  layer_gru(units = 16, batch_input_shape = c(1, 1, nb_chars), # NEW BATCH SIZE: ONE ITEM ONLY!
            activation = 'relu',                               # Activation function
            stateful = TRUE) %>%                               # Passing last state to next batch
  layer_dense(units = 1) 
new_model %>% set_weights(get_weights(model))

forecasts <- matrix(0, nrow = nb_dates_test, ncol = nb_ticks)
for(j in 1:nb_ticks){               # Loop on stocks
  for(i in 1:nb_dates_test){        # Loop on test dates
    test_sample <- test_features_rnn[j,i,] %>% array(dim = c(1, 1, nb_chars)) # Temp test sample
    forecasts[i,j] <- predict(new_model, test_sample, batch_size = 1)         # Prediction
  }
}
forecasts <- forecasts %>% data.frame()     # Reformatting the results
colnames(forecasts) <- tick
head(forecasts)                             # Displaying the results

##         AAPL        BA        BAC         C       CSCO        CVS        CVX
## 1 0.01949294 0.1238271 0.06203436 0.1339080 0.12036975 0.11682026 0.01977338
## 2 0.02365539 0.1459534 0.09767368 0.1354819 0.10002992 0.11945204 0.03210038
## 3 0.03345459 0.1409788 0.10650982 0.1172802 0.11369056 0.10393780 0.03851758
## 4 0.04446844 0.1326413 0.12706353 0.1234715 0.10177889 0.09102554 0.05417507
## 5 0.04890548 0.1214298 0.11671874 0.1258701 0.09734003 0.07978674 0.04877361
## 6 0.05294538 0.1139696 0.12207721 0.1205379 0.11232569 0.07107325 0.05154252
##            D        DIS          F          GE         HD         IBM
## 1 0.08462397 0.04617634 0.06466951 -0.02943616 0.05651163 0.005473061
## 2 0.08870284 0.03499522 0.07626037 -0.06653591 0.05382570 0.018538706
## 3 0.08739338 0.05773829 0.06301197 -0.07642568 0.05148355 0.038852725
## 4 0.08094107 0.06522466 0.05196173 -0.05488770 0.04876354 0.045998901
## 5 0.07125250 0.05381625 0.05262602 -0.04663152 0.04824727 0.049476229
## 6 0.07674345 0.05238795 0.02854374 -0.04711435 0.03413342 0.054076016
##         INTC        JNJ        JPM          K        MCK        MRK
## 1 0.08360042 0.10979138 0.03852692 0.13752125 0.09829006 0.03465381
## 2 0.10941762 0.08284923 0.05040085 0.14351591 0.09505129 0.02757927
## 3 0.11979502 0.07167646 0.05541964 0.13038389 0.09392039 0.03120726
## 4 0.11808260 0.05501388 0.06449872 0.11614393 0.09699351 0.05998521
## 5 0.10988586 0.04390071 0.06273219 0.09954669 0.10594976 0.06630570
## 6 0.10702608 0.03607112 0.06193262 0.08743370 0.10409075 0.06360394
##            MSFT       ORCL          PFE           PG            T        UNH
## 1  0.0124983741 0.10099673 0.0008221548  0.012940611  0.028254710 0.03116279
## 2 -0.0082691582 0.09977308 0.0101430742  0.011836113  0.020050183 0.05357419
## 3 -0.0185379088 0.08653473 0.0289592668 -0.003246208  0.005723811 0.06629346
## 4  0.0008776253 0.07808002 0.0332365818 -0.001323950 -0.014161185 0.06943801
## 5  0.0237668119 0.07187674 0.0361789837  0.001643203 -0.024276286 0.06797642
## 6  0.0427880175 0.08039701 0.0450083539  0.007681711 -0.034014985 0.06659734
##          UPS           VZ        WFC         WMT          XOM
## 1 0.07078810  0.003322459 0.01036216  0.04370237 -0.030562453
## 2 0.09271787 -0.004052998 0.02455194 -0.01037045 -0.002536726
## 3 0.10198719 -0.009198244 0.03961971 -0.03069826  0.022028757
## 4 0.10231321 -0.013529894 0.05484089 -0.04052212  0.023464072
## 5 0.09630442 -0.019007489 0.05764509 -0.02398769  0.017434955
## 6 0.08929661 -0.019625962 0.05295432 -0.02582015  0.022202801

In the above table, the figures can serve as input for investment decisions. In the next section, we implement one such NN-based strategy.

4 Portfolio choice

We now turn to the embedding of neural networks in portfolio backtesting. We plan to compare a NN-based strategy to the usual equally-weighted (EW) portfolio. We use the neural network to forecast future returns and form an EW portfolio of the ten stocks with the highest predictions. All parameters of the NN are hard-coded in the weights function (which is pretty bad!). The predictors are the same as above and gathered in the variable features.

weights_multi <- function(train_data, test_data, j){
  N <- test_data$Tick %>% unique() %>% length() # Number of stocks
  features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg") # Hard-coded
  if(j == 1){                             # j = 1 => EW
    return(rep(1/N,N))
  }
  if(j == 2){                             # j = 2 => NN-based
    model <- keras_model_sequential()   # We directly define the model
    model %>%   
      layer_dense(units = 4, activation = 'relu', input_shape = length(features)) %>%
      layer_dense(units = 3, activation = 'sigmoid', kernel_initializer = "random_normal") %>%
      layer_dense(units = 1)                     # Linear activation: f(x) = x
    model %>% compile(                           # Model specification
      loss = 'mean_squared_error',               # Loss function
      optimizer = optimizer_rmsprop(),           # Optimisation method (weight updating)
      metrics = c('mean_absolute_error')         # Output metric
    )
    # Below, we quickly format the data
    train_features <- select(train_data, features) %>% as.matrix() 
    train_labels <- train_data$F_Return
    # And now, it's ready for its training purpose!
    fit <- model %>% fit(train_features,               # Training features
                         train_labels,                 # Training labels
                         epochs = 5, batch_size = 256, # Training parameters
                         verbose = 0
    )
    test_features <- select(test_data, features) %>% as.matrix() # Test input
    pred <- predict(model, test_features)
    return((pred>quantile(pred, 2/3))/10)     # Ten largest values, equally-weighted
  }
}

Then we fix the backtesting canvas, with a focus on dates, essentially.

sep_date <- as.Date("2013-01-01")             # This date separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date > sep_date] %>%  # Out-of-sample dates (i.e., testing set)
  unique() %>%
  as.Date(origin = "1970-01-01")
Tt <- length(t_oos) - 1                                         # Nb of dates, avoid T = TRUE
nb_port <- 2                                                    # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns

We stop the backtest before the last date because on this date, there is no forward return. Also, we try to avoid to set a value to T, as it is reserved to the short version of the boolean TRUE.

We can now turn to the dynamic implementation. Given the computation time for one neural network, the iteration over all out-of-sample dates will take time even though the dataset is rather small. The usual solution is to resort to GPU computing. Neural nets are notoriously slow, compared to tree methods.

for(t in 1:(length(t_oos)-1)){                        # Stop before the last date: no fwd return!
  if(t%%12==0){print(t_oos[t])}                       # Just checking the date status
  train_data <- data_f %>% filter(Date < t_oos[t])    # Expanding window: all past values!!!
  test_data <- data_f %>% filter(Date == t_oos[t])    # Current values
  realized_returns <- data %>%                        # Computing returns via
    filter(Date ==  t_oos[t]) %>%                     # Filtering
    select(F_Return)                                  # Selecting
  for(j in 1:nb_port){                                     
    portf_weights[t,j,] <- weights_multi(train_data, test_data, j)  # Hard-coded params, bad!
    portf_returns[t,j] <- sum(portf_weights[t,j,] * realized_returns)
  }
}

## [1] "2013-12-31"
## [1] "2014-12-31"
## [1] "2015-12-31"
## [1] "2016-12-30"
## [1] "2017-12-29"
## [1] "2018-12-31"
## [1] "2019-12-31"
## [1] "2020-12-31"

We recycle the perf_met() function we already worked with in previous sessions. It is sourced from a separate file in the working directory.

source("perf_met.R")                           # Import external code/script
returns <- data %>%                            # Compute return matrix: start from original data
  filter(Date %in% t_oos) %>%                  # Keep out-of-sample dates
  select(Date, Tick, F_Return) %>%             # Keep 3 attributes
  spread(key = Tick, value = F_Return)         # Shape in matrix format
perf_met_multi(portf_returns = portf_returns,  # Performance metrics (using imported function)
               weights = portf_weights, 
               asset_returns = returns,
               t_oos = t_oos,
               strat_name = c("EW", "NN"))

##        avg_ret        vol Sharpe_ratio       VaR_5       turn
## EW 0.010228461 0.03932889    0.2600750 -0.05506972 0.03884991
## NN 0.009584362 0.04532412    0.2114627 -0.06855076 1.32234583

On paper, the enhanced NN-based strategy seems comparable to the EW benchmark, at least in terms of Sharpe ratio. The big issue in optimised portfolios is always turnover: asset rotation is much too high. This is a classical problem in quantitative portfolio management. There are many ways to solve this drawback and the core ideas are either to smooth the allocation scheme or to enforce hard constraints (e.g., min/max weights).

5 Exercises

5.1 Parameter properties

What is the impact of the batch size on training duration? Test different values in one example.

5.2 Weight initialisation

Train a model with random weights and make a prediction.
Try again: is the prediction the same?