1 Introduction

This notebook is dedicated to the application of decision trees in quantitative investment (low frequency). This notebook will require the following non-standard packages:

rpart for simple trees
rpart.plot for plotting simple trees
randomForest for random forests
xgboost for boosted trees

In the financial applications below, the choice of dependent variable is not optimal. It was chosen to minimise code complexity and ease understanding of basic principles. Results may seem disappointing, but they are realistic! It take experience/experimentation to make ML algorithms deliver good results. That’s up to you!

$\rightarrow$ ABOUT THE NOTEBOOK:

the code chunks are sequential. They must be executed in the correct order.
text between stars appears in bold in html format.

1.1 Principle

First, we provide some intuition on how decision trees work using the diamond database. We show a first simple tree and then explain how it’s built. We start by looking at the data.

if(!require(rpart)){install.packages(c("rpart", "rpart.plot"))}  # The packages for trees!
library(tidyverse)          # Data manipulation package
library(rpart)              # Tree packages
library(rpart.plot)         # Tree plots package
head(diamonds)              # Having a look at the diamonds database (embedded in ggplot2/tidyverse)

## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

summary(diamonds)           # Descriptive stats of relevant columns

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

Briefly, there are some very small diamonds (0.2 carats) and some very big ones (5 carats) and the price range is 326 to 18,823. Diamonds are characterised mostly by their cut, clarity and color - which are categorical variables. We now turn to the first tree.

fit <- rpart(price ~ carat + color + clarity + cut,     # First tree: simple syntax!
             data = diamonds,                           # Data source
             maxdepth = 3,                              # Number of levels of splits 
             cp = 0.001)                                # Complexity parameter
rpart.plot(fit)                                         # Plot

We then provide elements of interpretation. First, the data is split in two: small diamonds (less than 1 carat, to the left of the tree) and big diamonds (more than 1 carat, to the right). The first group makes up 65% of the original sample, versus 35% for the other one. In the first group, the average price is 1633 while in the second one, it is 8142.

In the rpart representation, the splitting condition is written below the split cluster. When the condition is true, the instances go to the left cluster and when it is wrong, they go to the cluster to the right.

Each group is then separately split in two. The first group (small diamonds) is again split according to carat weight:
- the very small diamonds (less than 0.63 carat, with average price of 1052 and representing 46% of the whole sample)
- the small diamonds (between 0.63 and 1 carat, with average price of 3059 and accounting for 19% of the whole sample)

Large diamonds are split in two:
- very large diamonds (more than 1.5 carats, 12% of sample, with average price of 12000)
- large diamonds (between 1 and 1.5 carats, 24% of sample, with average price of 6140)

The interesting part is that the last group is then split according to another variable: clarity. The large diamonds are homogeneous in size (no ‘big’ difference between 1 and 1.5 carats), hence quality comes into play:
- large low quality diamonds: I1 (stands for included, with many big imperfections), SI2, SI1 and VS2, have an average price of 5371
- large high quality diamonds: VS1, VVS1, VVS2 and IF (internally flawless) have an average price of 8535

Now the question is: how did the rpart() function grow this tree? Let’s start by looking at how to choose a splitting value for one variable. For simplicity, we work with the carat variable only: we will thus split the diamonds according to this variable: small diamonds versus large diamonds.

The core idea of decision trees is to create homogeneous clusters. Here we want clusters of diamonds with prices that have the smallest possible dispersion. But let’s not forget there has to be 2 clusters, so we have to take into account the dispersion over the two clusters: \[V(c)= \underbrace{\sum_{i:carat> c} \left( p_i-\bar{p}_{carat > c} \right)^2}_{\text{large diamond cluster }(>c)}+ \underbrace{ \sum_{i:carat< c} \left( p_i-\bar{p}_{carat < c} \right)^2}_{\text{small diamond cluster }(<c)},\] where $\bar{p}_{carat > c}$ is the price of all diamonds with weight larger than $c$ and $\bar{p}_{carat < c}$ is the average price of the small diamond cluster. Hence, one optimal way of choosing c is to pick the value that minimises V(c).

Below, we code a didactic version of the computation of V(c) - i.e., using a loop syntax.

total_var <- c()                                       # Initialization
threshold <- seq(0.3, 4, by = 0.1)                     # Defining carat values
for(i in 1:length(threshold)){                         # Loop over the threshold values
    small <- filter(diamonds, carat <= threshold[i])     # Firms below the threshold (in carat)
    var_small <- var(small$price) * nrow(small)          # Corresponding dispersion in diamond price
    large <- filter(diamonds, carat > threshold[i])      # Firms above the threshold (in carat)
    var_large <- var(large$price) * nrow(large)          # Corresponding dispersion in diamond price
    total_var[i] <- var_small + var_large                # Total variation
}
ggplot(data.frame(threshold, total_var)) + theme_light() + 
    geom_line(aes(x = threshold, y = total_var))         # Plot

Here we go! The minimum is indeed reached for carat = 1! This partly explains the first split in the first tree above. The full explanation requires an analysis of the cross-section of predictors. Indeed, why was carat chosen, and not cut, color or clarity?

The answer is that the above study was also carried out for these variables, but they were no match for carat.

We now introduce some additional math notations. We are given a sample of ($Y_i$,$X_i^{(j)}$) of size $N$, where $Y_i$ is the predicted variable (diamond price) and the $X_i^{(j)}$ are the predictors: carat, color, cut and clarity. i is the index of the diamond and j the index of the predictor.

A regression tree seeks the splitting point as $\underset{c}{\text{argmin}} \ V_N^{(j)}(c)$ with \[\begin{equation} V_N^{(j)}(c)= \sum_{X_i^{(j)}<c}\left(Y_i-m_N^{(j)-}(c) \right)^2 + \sum_{X_i^{(j)}>c}\left(Y_i-m_N^{(j)+}(c) \right)^2, \label{eqn:node} \end{equation}\] where $m_N^{(j)-}(c)=\frac{1}{\#\{i,X_i^{(j)}<c\}}\sum_{X_i^{(j)}<c}Y_i$ and $m_N^{(j)+}(c)=\frac{1}{\#\{i,X_i^{(j)}>c\}}\sum_{X_i^{(j)}>c}Y_i$ are the average values of $Y$, conditional on $X^{(j)}$ being smaller or larger than $c$. The minimalisation can be performed in two steps:

find the optimal $c$ for each individual predictor in the above equation,
the choose the one that minimises $V_N^{(j)}(c)$ globally over j, i.e., over all variables.

This is how carat ends up as the winner splitting variable: it is the one that creates the most homogeneous clusters.

This process is re-iterated at each node of the tree until sufficient precision is reached or until the maximum depth requirement is satisfied. An additional complexity parameter cp can also be used to determine how far the algorithm should go.

1.2 Application to financial data

We now turn to the construction of trees on our small financial dataset.

We start by looking at the conditional average of future returns, given the market capitalisation. Conditional averages can be useful in detecting patterns. For instance, if the variable of interest is monotonous in the predictor, this probably means that the predictor is valuable.

The highly non-linear nature of trees makes it nontheless difficult to discard predictors. Indeed, a predictor may be very important in one lower-level cluster.

load("data.RData")                                        # Load the data
data <- data %>% arrange(Date,Tick)                       # Just making sure all is in order
tick <- unique(data$Tick)                                 # Set of assets
data <- data  %>% 
    select(-ESG_rank) %>%                                   # Remove ESG: too few points
    group_by(Tick) %>%                          
    mutate(F_Return = lead(Close) / Close - 1) %>%          # Adding forward/future returns
    na.omit() %>%                                           # Take out missing data
    ungroup()

data %>% ggplot(aes(x = Mkt_Cap, y = F_Return)) + #geom_point() +
    geom_smooth() + theme_light()                           # Plot

Problem: scale effects. There are very few points with capitalisations above 500B$, thus resulting in high uncertainty (shown by the grey area around the blue curve).
Solution: we normalise the characteristics.

normalise <-  function(v){  # This is a function that 'uniformises' a vector
    v <- v %>% as.matrix()
    return(ecdf(v)(v))
}

data_f <- data %>%                        # Data with normalised values
    dplyr::select(-Close) %>%               # Closing prices are not predictors
    group_by(Date) %>%                      # Normalisation takes place for each given date
    mutate_if(is.numeric, normalise) %>%    # Apply normalisation on numerical columns
    ungroup() %>%                           # Ungroup
    arrange(Date, Tick)                     # Just making sure the order is still the same as before
data_f$F_Return <- data$F_Return          # We retrieve the original returns: better for interpretation

data_f %>% ggplot(aes(x = Mkt_Cap, y = F_Return)) + 
    geom_smooth() + theme_light()

data_f %>% ggplot(aes(x = D2E, y = F_Return)) + 
    geom_smooth() + theme_light()

The pattern is quite different in this case. The overall trend is clearly descending, as the Size anomaly would suggest, even though our very small sample can by no means support or contradict this stylised fact.

It’s not obvious to the eye which binary partition will be the best (i.e., where to set the cursor in terms of Mkt_Cap to find to homogeneous clusters). Just like for diamonds, we must resort to brute force and span all possibilities.

Below, we create 91 partitions depending on an increasing threshold. In each partition, there will be two groups: the small caps and the large caps. The small (resp. large) caps are the firms with capitalisation score below (resp. above) the threshold. What we are interested in is the dispersion of future returns inside each group and more importantly in the sum of the dispersions of the two groups (just like we did for diamonds).

While we could develop a dedicated function to perform this task, we choose to keep the very old fashioned loop approach.

total_var <- c()
threshold <- (5:95)/100    # Looking at all percents from 5% to 95%
for(i in 1:length(threshold)){
    small <- filter(data_f, Mkt_Cap <= threshold[i])     # Firms below the threshold (in Mkt_Cap)
    var_small <- var(small$F_Return) * nrow(small)       # Corresponding dispersion in future return
    large <- filter(data_f, Mkt_Cap > threshold[i])      # Firms above the threshold (in Mkt_Cap)
    var_large <- var(large$F_Return) * nrow(large)       # Corresponding dispersion in future return
    total_var[i] <- var_small + var_large                # Total variation
}
ggplot(data.frame(threshold, total_var)) + theme_light() +
    geom_line(aes(x = threshold, y = total_var))         # Plot

What this graph says is that the highest level of homogeneity intra-clusters is obtained for a split of Mkt_Cap around 0.48.
Now, we can perform this exercise on other predictors. Below, we create a function that tests all of them and plots the result.

pts <- (5:95)/100                                               # Splitting points
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg")   # Predictor list
tot_var <- function(indep, dep, points, data){
    total_var <- c()
    for(i in 1:length(points)){
        low <- filter(data, (!!sym(indep)) <= points[i])        # Firms below the threshold 
        var_low <- var(select(low,dep)) * nrow(low)
        high <- filter(data, (!!sym(indep)) > points[i])        # Firms above the threshold 
        var_high <- var(select(high,dep)) * nrow(high)
        total_var[i] <- var_low + var_high
    }
    return(total_var)
}
tot_var_res <- map(features, tot_var, points = pts, dep = "F_Return", data = data_f) %>%
    unlist() %>%
    matrix(ncol = length(features)) %>%
    data.frame(split = pts)
colnames(tot_var_res) <- c(features, "Split")

tot_var_res %>% 
    pivot_longer(names_to = "Feature", values_to = "Total_variation", -Split) %>%
    ggplot(aes(x = Split, y = Total_variation, color = Feature)) + 
    theme_light() + geom_line() # Plot

Clearly, the patterns are not as marked as for Mkt_Cap. For some variables, the impact of splitting is minuscule (Prof_Marg). Hence, if we had to choose only one splitting variable, Mkt_Cap would be the better pick.

As we underlined above, in order to build the best possible tree, the algorithm needs to performs such studies over all variables (predictors) and find the one that achieves the smallest total dispersion in future returns.

1.3 A first financial tree

We pursue our use of the package rpart to build and plot decision trees. We compare two trees: one is built on the original dataset and the other on the normalised data.

fit <- data %>%                     # Original data
    select(-Tick, -Date, -Close) %>%  # Take out the non-predictive variables
    rpart(F_Return ~ .,               # Model: F_Return as a function of all other variables
          cp = 0.001,                 # Complexity: lower means more nodes/levels
          maxdepth = 2,               # Maximum depth (nb levels) of the tree
          data = .)                   # Data source: from the pipe
rpart.plot(fit)                     # Plot

Again, most important part is interpretation. The cluster information is synthesised in the rounded rectangles. There are two figures: the first is the average value of the dependent variable (to be forecast) and the second is the proportion of the sample inside the cluster. The cluster at the top is the whole sample.

The bold text details the splitting variables and points and are displayed just below the clusters which they divide in two. The first split is made according to the price-to-book (P2B) variable at level 0.23. Those instances with P2B larger or equal to 0.23 go to the cluster to the left and those with P2B strictly smaller than 0.23 go to the cluster to the right.

Sadly, this choice creates two very imbalanced clusters: one with almost all the data (left) and one with a handful of instances Looking at what happens is instructive (see below): the split captures a micro-phenomenon related to two banks: Citigroup and Bank of America, which is essentially their rebound after the subprime crisis. There are only 8 observations in this set, which is a perfect illustration of the variance-bias trade-off: the bias is small (we stick to the data), but the variance is huge. This is the problem with non-normalised data: it allows to form cluters of very high average values (0.3 return!) which will not be meaningful in terms of generalisation out-of-sample.

data %>% filter(P2B < 0.23)

## # A tibble: 8 × 9
##   Tick  Date       Close Vol_1M Mkt_Cap   P2B   D2E Prof_Marg F_Return
##   <chr> <date>     <dbl>  <dbl>   <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1 BAC   2009-01-30  5.73  178.   41837. 0.206  390.    -11.4   -0.400 
## 2 BAC   2009-02-27  3.44  237.   25115. 0.124  390.    -11.4    0.731 
## 3 C     2009-02-27 13.2   244.    8175. 0.101  657.   -106.     0.687 
## 4 BAC   2009-03-31  5.96  212.   43657. 0.229  387.     11.9    0.309 
## 5 C     2009-03-31 22.3   269.   13855. 0.176  512.      4.95   0.206 
## 6 C     2009-04-30 26.8   142.   16814. 0.212  512.      4.95   0.220 
## 7 C     2009-06-30 26.1    39.7  16373. 0.184  474.     11.6    0.0673
## 8 C     2009-07-31 27.9    71.7  35954. 0.197  474.     11.6    0.577

Thus, below, we perform the same analysis on the formatted/normalised data. We also add one layer to the tree.

fit <- data_f %>%                 # Normalised data
    select(-Tick, -Date) %>%        # Take out the non-predictive variables
    rpart(F_Return~.,               # Model: F_Return as a function of all other variables
          cp = 0.001,               # Complexity: lower means more nodes/levels
          maxdepth = 4,             # Maximum depth (nb levels) of the tree
          data = .)                 # Data source: from the pipe
rpart.plot(fit)                   # Plot!

In this case, the initial split is much more reasonable as the two resulting groups encompass 53% and 47% of the original sample each. We are not suprised by the splitting value of 0.48 for Mkt_Cap, as we calculated it just above. Since the dependent variable is exactly future returns, the values in rounded rectangles show the average returns in each cluster.

The average values are coded with color intensity: white (very light blue) for low values to blue for high values. Note that even after the normalisation of features, some very small clusters persist: the second split leads to one such group (the one to the right).

Finally, the scheme shows the highly nonlinear nature of the tree: the first three splits are performed via three different variables.

1.4 Variable importance

Machine learning tools are often referred to as black boxes but simple decision trees are quite transparent. In order to characterise the way they work, researchers have introduced metrics that evaluate the preponderance of variables in the models, i.e., in the prediction process - once the models have been trained (in the case of trees, we refer to Atkinson and Therneau (2015) for a deeper treatment). Each time a variable is chosen for a split, it reduces the (training) loss generated by the model. The variable importance sums these gains for each variable.

Below, we show how to extract such information using rpart. We recycle the last model that was trained.

fit$variable.importance                              # Immediate values

## Prof_Marg       D2E   Mkt_Cap       P2B    Vol_1M 
## 0.3450010 0.2442460 0.2408473 0.1479721 0.1320031

y <- data.frame(fit$variable.importance)[,1]         # Extracting the numbers
tree_importance <- data.frame(features = features,   # A nice dataframe with the values
                              importance = y/sum(y), # Normalised importance
                              model = "Simple tree") # Model name  
tree_importance %>%                                  # Graph
    ggplot(aes(x = features, y = importance)) + 
    geom_col() + theme_light()

Given our sample, Mkt_Cap and P2B have roughly the same importance in the simple tree model.

NOTE: often, variable importance only makes sense if the features have the same scale. This is yet another incentive for pre-processing and rescaling.

1.5 Predictions

Finally, for the sake of completeness, we provide the syntax for predictions below. We again rely on the previously trained model.

test <- data_f[c(1,1000,3333),]    # Test sample (in sample, though)
predict(fit, test)                 # Prediction

##          1          2          3 
## 0.01007559 0.01007559 0.03057492

This means that the test instance will have (according to the model’s forecast) a return slightly above zero.

1.6 Classification

Trees work well for classification tasks. The error is not measured with the MSE, but with impurity scores: the goal is to have clusters that are as pure as possible. Several choices are possible and allow to fine-tune the relative importance of errors: misclassification error, Gini index, cross-entropy are common examples. Below, we show how to implement trees with categorical data, though it’s sufficiently well done in R that it changes almost nothing in syntax.

data_c <- data_f %>% mutate(FC_Return = F_Return > 0)   # Binary categories: + vs - returns
fit <- rpart(FC_Return ~ . - Tick - Date - F_Return,    # The model: remove the irrelevant variables
             method = "class",                          # Classificartion task
             data = data_c,                             # Data source
             cp = 0.001,                                # Complexity
             maxdepth = 4                               # Maximum number of levels
)
rpart.plot(fit)

In the above syntax, ‘method’ can be chosen by the function. For regression analysis, enforce method = “anova”.

The only difference with the regression trees is that the number within the rounded squares is the proportion of item that are TRUE, i.e., with positive future return. Note that the variable is not perfectly balanced in the whole sample, as there are 55% of such instances. Again, the first split is performed over the Mkt_Cap variable, but the secondary splits involve other features. The intensity of the color of each cluster codes the purity of the cluster.

predict(fit, data_c[c(1, 1000, 3333),])

##       FALSE      TRUE
## 1 0.4122939 0.5877061
## 2 0.5735294 0.4264706
## 3 0.4122939 0.5877061

predict(fit, data_c[c(1, 1000, 3333),], type = "vector")

## 1 2 3 
## 2 1 2

The prediction simply yields the probabilities associated to each class.

One example with 3 classes below.

data_c2 <- data_f %>% 
    mutate(FC_Return = if_else(F_Return > 0.02, "Buy", 
                               if_else(F_Return < -0.02, "Sell", "Hold")))   
fit <- rpart(FC_Return ~ . - Tick - Date - F_Return,    # The model: remove the irrelevant variables
             method = "class",                          # Classificartion task
             data = data_c2,                            # Data source
             cp = 0.001,                                # Complexity
             maxdepth = 4                               # Maximum number of levels
)
rpart.plot(fit)

predict(fit, data_c[c(1, 1000, 3333),])

##         Buy      Hold      Sell
## 1 0.4676949 0.2263443 0.3059608
## 2 0.3785311 0.1581921 0.4632768
## 3 0.4676949 0.2263443 0.3059608

predict(fit, data_c[c(1, 1000, 3333),], type = "vector")

## 1 2 3 
## 1 3 1

1.7 Practical discussion

For decision trees, the maximum number of splitting levels (i.e., the depth) is usually 6 (corresponding to a maximum of 2^6 = 64 clusters). When the investment universe comprises many assets (say 1,000), all instances will be disseminated into one of these 64 groups (assuming the tree does indeed define 64 branches). Now, some branches will have a handful of instances and others will have a lot. But when it comes to prediction, all stocks will end up in one ‘bucket’. The problem is that this will not be practical in terms of investment decision. Typically, if the portfolio policy sets a fixed number of assets for diversification purposes, then this will induce problems because the cluster sizes are defined by the tree.

In the next section, we present enhancements and generalisations of trees that help solve these issues and that are also known to perform better in prediction exercises.

2 Extensions

Decision trees have been extended improved in two directions related to ensemble learning: the idea of combining learning/predicting tools in order to diversify (and reduce) sources of error.

2.1 Random forests

Random forests, as their name suggests, are compilations of decision trees with randomly chosen predictors. In our sample, we have 5 predictors (Mkt_Cap, P2B, Vol_1M, D2E, Prof_Marg). The idea of random forests is to grow many trees based on combinations of subsets of these 6 variables. Obviously, when the initial number of predictors is small, all combinations could be tested, but with hundreds of factors, this becomes infeasible. For each instance, the final output is a mix (average or voting decision) of all outputs provided by the trees in the forest.

We provide a simple/compact example of implementation below. We show the syntax using the randomForest package for prediction and variable importance.

if(!require(randomForest)){install.packages("randomForest")}
library(randomForest)
set.seed(42)                      # Fixing random seed
fit <- data_f %>%                 # Normalised data
    select(-Tick, -Date) %>%        # Take out non predictive variables
    randomForest(F_Return~.,        # Same formula as for simple trees!
                 data = .,          # Data source (pipe input)
                 ntree = 12,        # Number of random trees
                 mtry = 4,          # Number of predictive variables used for each tree
                 replace = FALSE,   # All different instances
                 sampsize = 3000    # Training sample size for each tree 
    )
predict(fit, test)                # Prediction

##           1           2           3 
## -0.05031840 -0.03475235  0.22706953

fit$importance                    # Variable importance

##           IncNodePurity
## Vol_1M         3.274516
## Mkt_Cap        3.004441
## P2B            2.818915
## D2E            2.576378
## Prof_Marg      2.896019

RF_importance <- data.frame(features = features,     # We store these results for later on
                            importance = matrix(fit$importance)/sum(fit$importance),
                            model = "Random Forest") # Model names

Both the prediction and variable importances are rather different compared to the simple tree. Note that random forest can provide suboptimal results because they give the same weight to all trees, but some are more relevant than others. When a tree is built on variables that are not great predictors, the resulting forecasts will perform poorly and will have the same weight as the more consistent trees. Boosted trees, which are presented below, do not suffer from this drawback.

Let’s have a look at classification.

set.seed(42)                             # Fixing random seed
fit_c <- data_c %>%                      # Normalised data
    select(-Tick, -Date, -F_Return) %>%  # Take out non predictive variables
    mutate(FC_Return = as.factor(FC_Return)) %>%
    randomForest(FC_Return~.,            # Same formula as for simple trees!
                 data = .,               # Data source (pipe input)
                 ntree = 12,             # Number of random trees
                 mtry = 4,               # Number of predictive variables used for each tree
                 replace = FALSE,        # All different instances
                 sampsize = 3000         # Training sample size for each tree 
    )
predict(fit_c, test)                   # Prediction

##     1     2     3 
##  TRUE FALSE  TRUE 
## Levels: FALSE TRUE

predict(fit_c, test, type = "prob")    # Prediction

##        FALSE      TRUE
## 1 0.25000000 0.7500000
## 2 0.66666667 0.3333333
## 3 0.08333333 0.9166667
## attr(,"class")
## [1] "matrix" "array"  "votes"

fit_c$importance                       # Variable importance

##           MeanDecreaseGini
## Vol_1M            366.4950
## Mkt_Cap           256.0144
## P2B               283.9017
## D2E               247.1468
## Prof_Marg         299.1531

The predictions are the probability of positive return.

2.2 Boosted trees: regression

While random forests aggregate trees in a naive fashion, boosted trees are constructed to minimise the training error: each new tree is added/constructed in an optimal way. In this section, we will work with one such approach called eXtreme Gradient boosting (XGBoost). We refer to the original paper for details (Chen and Guestrin (2016)), as well as to the didactic explanations of https://xgboost.readthedocs.io/en/latest/tutorials/model.html

The full derivation for quadratic loss is given in the last section of the notebook.
Below, we provide an example of how xgboost can be used.

In contrast with the previous tools, additional formatting is required for xgboost (with functions provided by the package).

if(!require(xgboost)){install.packages("xgboost")}                      # Install package
library(xgboost)                                                        # Load package
train_features <- data_f %>% select(features) %>% as.matrix()           # Independent variable
train_label <- data_f$F_Return                                          # Dependent variable
train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format!

And here we go!

fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = 0.4,                      # Learning rate
              objective = "reg:squarederror", # Objective function
              max_depth = 4,                  # Maximum depth of trees
              lambda = 0.1,                   # Penalisation of leaf values
              gamma = 0.1,                    # Penalisation of number of leaves
              subsample = 0.8,                # Proportion of sample used for new trees
              colsample_bytree = 0.8,         # Proportion of columns used
              nrounds = 50,                   # Number of trees used
              verbose = 0                     # No comment
    )
xgb_test <- test %>%                        # Test sample => XGB format
    select(features) %>% 
    as.matrix() %>%
    xgb.DMatrix()

predict(fit, xgb_test)                      # Single prediction

## [1]  0.017173918 -0.007077186  0.048499655

We refer to the derivation of the algorithm for details on the role of eta, lambda and gamma.
The predicted value is different from the first two models! Variable importance is also supported in xgboost, as we show below. Be careful, if the gamma penalisation is too harsh, no tree will be built and variable importance will fail (variables won’t matter in this case: the model is constant, and equal to the sample average).

importance <- xgb.importance(model = fit)           # Variable importance
XGB_importance <- data.frame(features = features,   # Same as above, reformatted
                             importance = importance$Gain,
                             model = "XGBoost")     # Model name
Importance <- rbind(tree_importance, RF_importance, XGB_importance)       # All models!
Importance %>% ggplot(aes(x = features, y = importance, fill = model)) +  # Plot
    geom_col(position = "dodge") + theme_light()

For the simple tree, variable importance is highly concentrated, while for random forests, it is more uniformly distributed (which mechanically makes sense). The boosted tree lies somewhere in the middle but agrees with the simple tree that Mkt_Cap and P2B should be more prevalent.

For a detailed account of the richness of possibilities/parameters of xgboost, we recommend the documentation page: https://xgboost.readthedocs.io/en/latest/parameter.html

2.3 Boosted trees: classification

Below, we show the syntax to perform classification tasks with xgboost. If the dependent variable is an ordered factor, then the simplest approach is to translate the categories into numbers 0, 1, 2, etc. Starting at zero is important. We take the more general approach (though we should not because performance is always ordered and monotonous responses are important in the structure of the model).

Thus, we set a threshold and build 3 categories.

r_thresh <- 0.02                                                        # Return threshold
data_f$FC_Return <- data_f$F_Return                                     # Duplicate return
data_f$FC_Return[data_f$F_Return <= (-r_thresh)] <- 0                   # Low return
data_f$FC_Return[(data_f$F_Return > (-r_thresh)) & (data_f$F_Return < r_thresh)] <- 1 # Soso return
data_f$FC_Return[data_f$F_Return >= r_thresh] <- 2                      # High return
train_features <- data_f %>% select(features) %>% as.matrix()           # Independent variable
train_matrix <- xgb.DMatrix(data = train_features,                      # XGB matrix format!
                            label = data_f$FC_Return,
                            silent = TRUE)

In any case, the label must be converted into numerical format. When dealing with categorical input, it has to be exactly 0, 1, 2, etc. The difference with regression analysis lies in the definition of the objective function. There are many functions coded in xgboost. We class a few of them:

for regression analysis, we recommend reg:linear
for binary classification, the default choice is usually binary:logistic
for multiclass classification, multi:softprob yields a vector with the probabilities for each class while multi:softmax yields only the class with maximum probability.

fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = 0.3,                      # Learning rate
              objective = "multi:softprob",   # Objective function
              num_class = 3,                  # Number of classes
              max_depth = 4,                  # Maximum depth of trees
              nrounds = 10,                   # Number of trees used
              verbosity = 0                   # No warning message
    )
xgb_test <- test %>%                        # Test sample => XGB format
    select(features) %>% 
    as.matrix() %>%
    xgb.DMatrix

predict(fit, xgb_test) %>%                   #Predictions
    matrix(data = ., nrow = nrow(xgb_test), byrow = TRUE)

##           [,1]      [,2]      [,3]
## [1,] 0.3525243 0.1776711 0.4698047
## [2,] 0.4430076 0.1766661 0.3803263
## [3,] 0.2909596 0.2133605 0.4956798

Since we chose multi:softprob as objective function, the output gives the probabilities for each class.

3 Portfolios with trees

We now turn to applications of tree-based methods in portfolio construction. We proceed in three steps:

Definition of weight function (core of allocation process)
Formatting of data (and output initialisation)
Backtesting loop

When the number of assets is large, then simple regression trees may not be a good solution because the number of leaves will be smaller than the number of assets. Hence, several (possibly many) stocks, will share the same forecast. This is of course not a problem for categorical data, but can cause numerical problems when working with continuous variables (typically: portfolio sizes). Hence, we directly work with boosted trees, and doing so solves this issue.

3.1 Data preprocessing

Just as in the previous sessions, we start by formatting the data and creating all necessary variables.

sep_date <- as.Date("2008-01-01")             # This date separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date > sep_date] %>%  # Out-of-sample dates (i.e., testing set)
    unique() %>%
    as.Date(origin = "1970-01-01")
Tt <- length(t_oos) - 1                                         # Nb of dates, avoid T = TRUE
nb_port <- 2                                                    # Nb of portfolios
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns

3.2 The weight function

We now turn to the definition of the weighting scheme. We wish to compare a tree-based strategy to the usual equally-weighted (EW) portfolio. We use the xgboost algorithm to forecast future returns and form an EW portfolio of the ten stocks with the highest predictions (or equivalently, the top 33%, because the sample has 30 stocks).

weights_multi <- function(train_data, test_data, j){
    N <- test_data$Tick %>% unique() %>% length() # Number of stocks
    features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg") # Hard-coded, BEWARE!
    if(j == 1){                             # j = 1 => EW
        return(rep(1/N,N))
    }
    if(j == 2){                             # j = 2 => tree-based
        # Below, we quickly format the data
        train_features <- train_data %>% select(features) %>% as.matrix()       # Indep. variable
        train_label <- train_data$F_Return                                      # Dependent variable
        train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
        fit <- train_matrix %>% 
            xgb.train(data = .,                       # Data source (pipe input)
                      eta = 0.5,                      # Learning rate
                      objective = "reg:squarederror", # Number of random trees
                      max_depth = 4,                  # Maximum depth of trees
                      nrounds = 15                    # Number of trees used
            )
        xgb_test <- test_data %>%                   # Test sample => XGB format
            select(features) %>% 
            as.matrix() %>%
            xgb.DMatrix()
        
        pred <- predict(fit, xgb_test)              # Single prediction
        w <- pred > quantile(pred, 2/3)
        return(w / sum(w))                          # Ten largest values, equally-weighted
    }
}

3.3 Dynamic allocation

We are ready to launch the backtesting loop.

for(t in 1:(length(t_oos)-1)){                        # Stop before the last date: no fwd return!
    if(t%%12==0){print(t_oos[t])}                       # Just checking the date status
    train_data <- data_f %>% filter(Date < t_oos[t])    # Expanding window: all past values!!!
    test_data <- data_f %>% filter(Date == t_oos[t])    # Current values
    realized_returns <- data %>%                        # Computing returns via:
        filter(Date ==  t_oos[t]) %>%                     # Filtering
        select(F_Return)                                  # Selecting
    for(j in 1:nb_port){                                     
        portf_weights[t,j,] <- weights_multi(train_data, test_data, j)  # Hard-coded params, beware!
        portf_returns[t,j] <- sum(portf_weights[t,j,] * realized_returns)
    }
}

## [1] "2008-12-31"
## [1] "2009-12-31"
## [1] "2010-12-31"
## [1] "2011-12-30"
## [1] "2012-12-31"
## [1] "2013-12-31"
## [1] "2014-12-31"
## [1] "2015-12-31"
## [1] "2016-12-30"
## [1] "2017-12-29"
## [1] "2018-12-31"
## [1] "2019-12-31"
## [1] "2020-12-31"

And finally, a brief look at relative performance.

apply(portf_returns, 2, mean)                               # Average monthly return

## [1] 0.00952691 0.01179321

apply(portf_returns, 2, sd)                                 # Monthly volatility

## [1] 0.04694945 0.05411423

apply(portf_returns, 2, mean) / apply(portf_returns, 2, sd) # Sharpe ratio

## [1] 0.2029185 0.2179318

t_test <- t.test(portf_returns[,2]-portf_returns[,1])       # t-test
t_test

## 
##  One Sample t-test
## 
## data:  portf_returns[, 2] - portf_returns[, 1]
## t = 1.488, df = 155, p-value = 0.1388
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.0007422896  0.0052748976
## sample estimates:
##   mean of x 
## 0.002266304

The two strategies are quite comparable: a t-stat of 1.488 is not statistically significant. This is not surprising, given the small disparity in the (above) classical indicators. The boosted portfolio has better return, but higher volatility. All in all, it does not strongly improve the allocation process.

4 Exercises

Try to perform a ternary classification prediction with randomForest (for instance on the sign of the future return).

Use the reference guide if need be: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf

Build a simple tree-based strategy that invests only on stocks with

positive return or
a return above the cross-sectional median or mean

Reminder: one important degree of freedom is the horizon on which this return is computed.

5 APPENDIX: boosted trees

In this section, we provide the more details on how xgboost works. We assume a quadratic loss function.

5.1 Objective function

What we seek to minimise is
\[O=\underbrace{\sum_{i=1}^I \text{loss}(y_i,\tilde{y}_i)}_{\text{error term}}+ \underbrace{\sum_{j=1}^J\Omega(T_j)}_{\text{regularisation term}}\] The first term (over all instances) measures the distance between the true label and the output from the model. The second term (over all trees) penalises models that are too complex.

For simplicity, we will work with $\text{loss}(y,\tilde{y})=(y-\tilde{y})^2$: \[O=\sum_{i=1}^I \left(y_i-m_{J-1}(\mathbf{x}_i)-T_J(\mathbf{x}_i)\right)^2+ \sum_{j=1}^J\Omega(T_j)\]

5.2 Managing Loss

We built tree $T_{J-1}$ (and hence model $m_{J-1}$): how to choose tree $T_J$? \[\begin{align*} O&=\sum_{i=1}^I \left(y_i-m_{J-1}(\mathbf{x}_i)-T_J(\mathbf{x}_i)\right)^2+ \sum_{j=1}^J\Omega(T_j) \\ &=\sum_{i=1}^I\left\{y_i^2+m_{J-1}(\mathbf{x}_i)^2+T_J(\mathbf{x}_i)^2 \right\} + \sum_{j=1}^{J-1}\Omega(T_j)+\Omega(T_J) \\ & \quad -2 \sum_{i=1}^I\left\{y_im_{J-1}(\mathbf{x}_i)+y_iT_J(\mathbf{x}_i)-m_{J-1}(\mathbf{x}_i) T_J(\mathbf{x}_i))\right\}\quad \text{cross terms} \\ &= \sum_{i=1}^I\left\{2 y_iT_J(\mathbf{x}_i)-2m_{J-1}(\mathbf{x}_i) T_J(\mathbf{x}_i))+T_J(\mathbf{x}_i)^2 \right\} +\Omega(T_J) + c \end{align*}\] All terms known at step $J$ (i.e., indexed by J-1) vanish because they do not enter the optimisation scheme.

$\rightarrow$ things are simple with quadratic loss. For more complicated loss functions, Taylor expansions are used (see the original paper).

5.3 Penalisation

We need to take care of the penalisation now. For a given tree $T$, we specify tree $T(x)=w_{q(x)}$, where $w$ is the output value of some leaf and $q(\cdot)$ is the function that maps an input to its final leaf. We write $l=1,\dots,L$ for the indices of the leafs of the tree. In XGBoost, complexity is defined as: \[\Omega(T)=\gamma L+\frac{\lambda}{2}\sum_{l=1}^Lw_l^2,\] where

the first term penalises the total number of leaves.
the second term penalises the magnitude of output values (this helps reduce variance).

Below, we omit the index ($J$) of the tree of interest (i.e., the last one).

5.4 Aggregation

We aggregate both sections of the objective. We write $I_l$ for the set of the indices of the instances belonging to leaf $l$. Then,
\[\begin{align*} O&= 2\sum_{i=1}^I\left\{ y_iT_J(\mathbf{x}_i)-m_{J-1}(\mathbf{x}_i) T_J(\mathbf{x}_i))+\frac{T_J(\mathbf{x}_i)^2}{2} \right\} + \gamma L+\frac{\lambda}{2}\sum_{l=1}^Lw_l^2 \\ &=2\sum_{i=1}^I\left\{ y_iw_{q(\mathbf{x}_i)}-m_{J-1}(\mathbf{x}_i)w_{q(\mathbf{x}_i)})+\frac{w_{q(\mathbf{x}_i)}^2}{2} \right\} + \gamma L+\frac{\lambda}{2}\sum_{l=1}^Lw_l^2 \\ &=2 \sum_{l=1}^L \left(w_l\sum_{i\in I_l}(y_i -m_{J-1}(\mathbf{x}_i))+ \frac{w_l^2}{2}\sum_{i\in I_l}\left(1+\frac{\lambda}{2}\right)\right)+ \gamma L \end{align*}\]
The function is of the form $aw_l+\frac{b}{2}w_l^2$, which has minimum values $-\frac{a^2}{2b}$ at point $w_l=-a/b$. Thus, writing #(.) for the cardinal function that counts the number of items in a set, \[\begin{align*} \mathbf{\rightarrow} \quad w^*_l&=-\frac{\sum_{i\in I_l}(y_i -m_{J-1}(\mathbf{x}_i))}{\left(1+\frac{\lambda}{2}\right)\#\{i\in I_l\}}, \\ O_L(q)&=-\frac{1}{2}\sum_{l=1}^L \frac{\left(\sum_{i\in I_l}(y_i -m_{J-1}(\mathbf{x}_i))\right)^2}{\left(1+\frac{\lambda}{2}\right)\#\{i\in I_l\}}+\gamma L, \end{align*}\] where we added the dependence of the objective both in $q$ (structure of tree) and $L$ (number of leaves). Indeed, the meta-shape of the tree remains to be determined.

5.5 Tree structure

Final problem: the tree structure! This is coded the $q$ function essentially: what’s the best depth? Do we continue to grow the tree or not?

we proceed node by node
for each node, we look at whether a split is useful (in terms of objective) or not: \[\text{Gain}=\frac{1}{2}\left(\text{Gain}_L+\text{Gain}_R+\text{Gain}_O \right)-\gamma\]
each gain is computed with respect to the instances in each bucket: \[\text{Gain}_\mathcal{X}= \frac{\left(\sum_{i\in I_\mathcal{X}}(y_i -m_{J-1}(\mathbf{x}_i))\right)^2}{\left(1+\frac{\lambda}{2}\right)\#\{i\in I_\mathcal{X}\}},\] where $I_\mathcal{X}$ is the set of instances within cluster $\mathcal{X}$.

$\text{Gain}_O$ is the original gain (no split). About the $-\gamma$ adjustment: there is one unit of new leaves (two new minus one old)! This makes a one leaf difference, hence $\Delta L =1$

NOTE: XGBoost also applies a learning rate: the $j^{th}$ tree has a scale of $\eta^j$ with $\eta \in [0,1]$.

Decision trees and extensions