This the companion notebook for the session on validation and tuning machine learning models for factor investing.

In the financial applications below, the choice of dependent variable is not optimal. It was chosen to minimise code complexity and ease understanding of basic principles. Results may seem disappointing, but they are realistic! It take experience/experimentation to make ML algorithms deliver good results. That’s up to you!

\(\rightarrow\) ABOUT THE NOTEBOOK:

  • the code chunks are sequential. They must be executed in the correct order.
  • text between stars appears in bold in html format.

1 Variance-bias tradeoff

In this section, we illustrate how model complexity affect the average error and the variance of predictors. To that purpose, we use simple trees.

1.1 Data preparation

We start by loading packages and importing the data.

library(tidyverse)      # Data wrangling (and so much more) package
library(rpart)          # Simple decision tree package
library(rpart.plot)     # Tree plot package
load("data.RData")      # Load the data
data <- data  %>%       # Compute the returns
  select(-ESG_rank) %>% # Remove ESG: not enough points!
  group_by(Tick) %>%                          
  mutate(F_Return = lead(Close) / Close - 1,              # Adding future 1M returns
         FL_Return = lead(Close, 12) / Close - 1) %>%     # Adding future 1Y returns
  ungroup() %>%
  group_by(Date) %>% 
  mutate(FR_Return = F_Return - mean(F_Return),           # Relative return vs EW market
         FLR_Return = FL_Return - mean(FL_Return)) %>%    # Relative return vs EW market (long)
  ungroup()

Then, we format the inputs we will feed the tree algorithm.

normalise <-  function(v){  # This is a function that 'uniformalises' a vector
  v <- v %>% as.matrix()
  return(ecdf(v)(v))
}

data_f <- data %>%                                  # Data with normalised values
  select(-Close) %>%                                # Closing prices are not predictors
  group_by(Date) %>%                                # Normalisation takes place for each given date
  mutate_at(vars(Vol_1M:Prof_Marg), normalise) %>%  # Apply normalisation on numerical columns
  ungroup() 

sep_date <- as.Date("2012-01-01")                   # Train vs test separation date
train_sample <- data_f %>%                          # Train sample
  filter(Date <= sep_date)
test_sample <- data_f %>%                           # Test sample
  filter(Date > sep_date) %>%
  na.omit()                                         # Remove missing data

1.2 Simple tree

Finally, we train and test simple trees.

fit <- rpart(FLR_Return ~ Mkt_Cap + P2B + Vol_1M + D2E + Prof_Marg, # Model
             data = train_sample,                                   # Data source
             cp = 0.0001,                                           # Tree complexity
             maxdepth = 2                                           # Nb levels
)                                                                   # The model is then trained!
rpart.plot(fit)                                                     # Plot the tree

In the above code, we impose a small complexity (cp) so that we can manually control the depth (and hence, the complexity) of the tree. Given the trained parameters (splits), we test the model on the recent data.

pred <- predict(fit, test_sample)           # Predictions
mean(pred)-mean(test_sample$FLR_Return)     # Bias
## [1] 0.008821996
var(pred)                                   # Variance
## [1] 0.009755871

The first number shows the error in average prediction while the second one measures the dispersion of predictions.
## A deeper tree Below, we test the same configuration, but with a maximum depth of five (beyond five, it is hard to read the values inside the tree).

fit <- rpart(FLR_Return ~ Mkt_Cap + P2B + Vol_1M + D2E + Prof_Marg, # Model
             data = train_sample,                                   # Data source
             cp = 0.0001,                                           # Tree complexity
             maxdepth = 5                                           # Nb levels
)                                                                   # The model is then trained!
rpart.plot(fit)                                                     # Plot the tree

Graphically, the model is evidently more complex. And the predictions follow.

pred <- predict(fit, test_sample)
mean(pred)-mean(test_sample$FLR_Return) # Bias
## [1] -0.007871907
var(pred)                               # Variance
## [1] 0.01415666

The changes in both values follow opposite paths. The magnitude of bias is divided by 1.13 (from 0.88% to 0.78%) while the variance was multiplied by 1.45 (from 0.97% to 1.4%). This is clear evidence of the variance-bias tradeoff.

3 Hyperparameters through time

3.1 Setup

In this section, we investigate the impact of time on the validity of hyperparameters. We split the sample into three parts:

Part Training starts Validation starts Validation ends
Part 1 2000-01-01 2004-01-01 2006-01-01
Part 2 2005-01-01 2009-01-01 2011-01-01
Part 3 2010-01-01 2014-01-01 2016-01-01

The parts have a five year lag. The training sample is 4 year long, while validation is performed over 2 years. Hence, each study requires 6 years of data. We will separate the validation stage from the testing stage.

train_start = c("2000-01-01", "2005-01-01", "2010-01-01")
val_start = c("2004-01-01", "2009-01-01", "2014-01-01")
test_start = c("2006-01-01", "2011-01-01", "2016-01-01")
stop_date = c("2009-01-01", "2014-01-01", "2019-01-01")
# Below, the parameters
nrounds <- c(5, 10, 20, 30, 50)         # gamma values
lambda <- c(0.01, 0.1, 1, 10, 50)       # lambda values
pars <- expand.grid(nrounds, lambda)    # Exploring all combinations!
nrounds <- pars[,1]
lambda <- pars[,2]

Working with time series adds one dimension, hence we restrict our study to only two parameters: gamma and lambda. This will ease the visual representation of the results. We fix the learning rate eta to a reasonable value of 0.5.

3.2 Validation

Because we want to compare outcomes at three different points in time, we work with the hit-ratio, the average level of which is not dependent on economic cycles (as are returns, for instance). Hence, we recycle the grd_par_hit() function built above. For simplicity, we resort to a simple loop over the three periods.

grd <- c()       # Initiate output
for(i in 1:3){   # Loop over the parts
  # FIRST, WE PREPARE THE DATE
  train_features <- data_f %>%                               # Data
    filter(Date >= train_start[i], Date < val_start[i]) %>%  # Date filter
    select(features) %>% as.matrix()                         # Keep features
  train_label <- data_f %>%                                  # Data
    filter(Date >= train_start[i], Date < val_start[i]) %>%  # Date filter
    select(FLR_Return) %>% as.matrix()                       # Keep dependent variable
  train_matrix <- xgb.DMatrix(data = train_features, label = train_label)  # XGB format
  val_features <- data_f %>%                                 # Data
    filter(Date >= val_start[i], Date < test_start[i]) %>%   # Date filter
    select(features) %>% as.matrix()                         # Keep features
  val_label <- data_f %>%                                    # Data        
    filter(Date >= val_start[i], Date < test_start[i]) %>%   # Date filter
    select(FR_Return)                                        # Keep dependent variable: BIG TRICK here!
  # SECOND, WE COMBINE pmap() AND grid_par_hit() & AGGREGATE ITERATIVELY
  grd_temp <- pmap(list(0.5, nrounds, lambda),                # Parameters for the grid search
                   grid_par_hit,                            # Function of the grid search
                   train_matrix = train_matrix,             # Input for function: training data
                   test_features = val_features,            # Input for function: test features
                   test_label = val_label                   # Input for function: test labels (returns)
  )
  grd <- rbind(grd,                                                         # Output of previous part(s)
               data.frame(part = i, nrounds, lambda, hit = unlist(grd_temp))) # Output of latest part
}

Finally, we format and plot.

grd %>% 
  mutate_at(vars(part:lambda), as.factor) %>%  # Change numbers into factors for plotting
  ggplot(aes(x = part, y = hit)) + 
  geom_point() + theme_light() +
  ylim(0.48,0.6) +
  geom_hline(yintercept = 0.5, color = "grey") +
  facet_grid(nrounds ~ lambda)

grd %>% 
  mutate_at(vars(part:lambda), as.factor) %>%  # Change numbers into factors for plotting
  ggplot(aes(x = nrounds, y = hit, color = part)) + 
  geom_point() + theme_light() + 
  ylim(0.48,0.6) +
  geom_hline(yintercept = 0.5, color = "grey") +
  facet_grid(part ~ lambda)

We provide two alternative representation of the results. In the first one, the x-axis is linked to the chronology to assess stability of results over time. In the second one, parts are grouped in color so that we can assess which choices would have been made in a ‘live’ situation.

From the first graph, we see that apart for lambda = 1, the values are relatively stable. It also shows that lambda = 0.01 is not the best choice we can make.

From the second graph, we infer the best choices at the three different periods:

  • in the first period (red points), the best choices are nrounds = 5-10 and lambda = 0.1-10. The hit ratio is 55% and surrounded by values firmly above 52%.
  • in the second period (green points), the highest point is reached for nrounds = 5-10 and lambda = 0.1-10. The value is slightly below 55%.
  • finally, in the third period (blue points), the best choice is lambda = 10.

3.3 Test

For simplicity, we will work with constant parameters in what follows. A full rigorous treatment would impose that the validation be carried out at each period (month) of the backtest in order to choose optimal hyperparameters. We posit that there is some stability in the performance of hyperparameter values and start by looking at the average values over the 3 periods.

In order to choose the parameters, we compute averages.

grd %>% group_by(nrounds, lambda) %>% 
  summarise(avg_hit = mean(hit)) %>%
  pivot_wider(names_from = "lambda", values_from = "avg_hit")
## `summarise()` has grouped output by 'nrounds'. You can override using the
## `.groups` argument.
## # A tibble: 5 × 6
## # Groups:   nrounds [5]
##   nrounds `0.01` `0.1`   `1`  `10`  `50`
##     <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
## 1       5  0.526 0.530 0.517 0.535 0.510
## 2      10  0.531 0.536 0.538 0.543 0.532
## 3      20  0.526 0.534 0.526 0.534 0.519
## 4      30  0.527 0.544 0.538 0.527 0.528
## 5      50  0.529 0.539 0.536 0.535 0.531

Three stable values for nrouds are 10, 20 & 30. The best choices for lambda are 0.1, 1 and 10.

To avoid look-ahead bias, we will only work with data posterior to the last validation point (December 2015). But since we have been working with 12M forward dependent variable, we must shift the starting data to January 2017. This removes any possible overlap!

We then recycle code from the session on Decision Trees below and start with data preparation and initialisation.

tick <- unique(data$Tick)                     # Set of assets
sep_date <- as.Date("2017-01-01")             # This date separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date > sep_date] %>%  # Out-of-sample dates (i.e., testing set)
  unique() %>%
  as.Date(origin = "1970-01-01")
Tt <- length(t_oos) - 1                                         # Nb of dates, avoid T = TRUE
nb_port <- 2                                                    # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns

Below, we adjust the weight function, with the equally-weighted (EW) portfolio as benchmark. We will invest in the 15 stocks with the highest (predicted) relative 12M outperformance.

weights_multi <- function(train_data, test_data, j){
  N <- test_data$Tick %>% n_distinct()                                       # Number of stocks
  features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg")              # Hard-coded
  if(j == 1){                                                                # j = 1 => EW
    return(rep(1/N,N))
  }
  if(j == 2){                                                                # j = 2 => tree-based
    # Below, we quickly format the data
    train_features <- train_data %>% select(features) %>% as.matrix()        # Indep. variable
    train_label <- train_data$FLR_Return                                     # Dependent variable
    train_matrix <- xgb.DMatrix(data = train_features, label = train_label)  # XGB format
    fit <- train_matrix %>% 
      xgb.train(data = .,                       # Data source (pipe input)
                eta = 0.6,                      # Learning rate
                objective = "reg:squarederror", # Number of random trees
                max_depth = 4,                  # Maximum depth of trees
                nrounds = 20,                   # Number of trees used
                lambda = 1,                     # Fixed value of hyperparameter
                gamma = 1                       # Fixed value of hyperparameter
      )
    xgb_test <- test_data %>%                   # Test sample => XGB format
      select(features) %>% 
      as.matrix() %>%
      xgb.DMatrix()
    
    pred <- predict(fit, xgb_test)              # Single prediction
    return((pred>median(pred))/N*2)             # Fifteen largest values, equally-weighted
  }
}

And we proceed to the backtesting loop.

train_window <- 365 * 6                               # We train on rolling windows of 6-1=5 years
for(t in 1:Tt){                                       # Loop on dates!
  train_data <- data_f %>% 
    filter(Date < t_oos[t] - 365,                     # One year offset because FLR_Return  
           Date >= t_oos[t]- train_window)            # Rolling window !!!
  test_data <- data_f %>% filter(Date == t_oos[t])    # Current values
  realized_returns <- data %>%                        # Computing returns via:
    filter(Date ==  t_oos[t]) %>%                     # Filtering
    select(F_Return)                                  # Selecting
  for(j in 1:nb_port){                                     
    portf_weights[t,j,] <- weights_multi(train_data, test_data, j)  # Hard-coded params, beware!
    portf_returns[t,j] <- sum(portf_weights[t,j,] * realized_returns)
  }
}
apply(portf_returns, 2, mean) / apply(portf_returns, 2, sd)         # Sharpe ratios!
## [1] 0.2092178 0.2454613

The ML-driven strategy is well above the EW portfolio in terms of raw Sharpe ratio.

Finally, below, we plot the evolution of the two portfolios.

plot_data <- (portf_returns+1) %>% apply(2,cumprod) # Values
plot_data <- rbind(1, plot_data)                    # Initiate at one
plot_data <- data.frame(t_oos, plot_data)           # Adding the date
colnames(plot_data) <- c("Date", "EW", "ML")        # Changing column names
plot_data %>%                                       # Final formatting & plot
  pivot_longer(-Date, names_to = "strategy", values_to = "value") %>%
  ggplot(aes(x = Date, y = value, color = strategy)) + 
  geom_line() + theme_light() + geom_point() +
  scale_color_manual(values = c("#EE6611", "#2299EE"))

4 Exercises

4.1 Variance-bias trade-off

Train a neural network with roughly 30 parameters and compute the bias and variance on a testing set. Perform the same operation on a network with 300 parameters and see the difference.

4.2 Random Forests

Explore the sensitivity of performance of random forest using grid search for the two key parameters: ntree (number of trees in the forest) and mtry (number of variables). We recommend to:

  1. work on the large dataset (10 features)
  2. keep mtry between 4 and 8 and ntree between 10 and 200.

Don’t forget to change/update the feature space!

4.3 Boosted tree: the forgotten parameters

Complete the grid search by incorporating n_rounds and maxdepth in the grid.