This the companion notebook for the session on extensions of machine learning tools for factor investing.

The extensions go in four very different directions:

  • a short implementation of SVMs, a new (in this course) tool for regression or classification
  • one example of model ensemble in which the combination of models is determined by the covariance of their errors
  • two examples of local interpretability, namely LIME and Shapley values
  • finally, we discuss a test for Sharpe ratios of batches of strategies

\(\rightarrow\) ABOUT THE NOTEBOOK:

  • the code chunks are sequential. They must be executed in the correct order.
  • text between stars appears in bold in html format.

1 Support vector machines

We start by loading the data and the required packages. For SVMs, e1071 is a good choice as it is built on the well-known libsvm library.

if(!require(e1071)){install.packages("e1071")} # ML package used for SVM modelling
library(tidyverse)
library(e1071)                                        # Package for SVMs
load("data.RData")                                    # Load data
data <- data  %>%  
  select(-ESG_rank) %>%                               # We will include ESG later on
  group_by(Tick) %>%                                  # Returns are computed asset-by-asset
  mutate(F_Return = lead(Close) / Close - 1) %>%      # Adding forward/future returns
  ungroup() %>%                                       # Ungroup
  na.omit()                                           # Remove missing data

Below, we also prepare the raw inputs. We start with the normalisation of features and with the train/test split. For simplicity, we work with a suboptimal dependent variable, namely simple future returns.

normalise <-  function(v){  # This is a function that 'uniformalises' a vector
  v <- v %>% as.matrix()
  return(ecdf(v)(v))
}

data_f <- data %>%                                        # Data with normalised values
  group_by(Date) %>%                                      # Normalisation takes place for each date
  mutate_at(vars(Vol_1M:Prof_Marg), normalise) %>%        # Apply normalisation on features
  ungroup()                                               # Ungroup
sep_date <- as.Date("2010-01-01")                         # Separates in-sample vs out-of-sample
t_oos <- data$Date[data$Date > sep_date] %>%              # Out-of-sample dates (i.e., testing set)
  unique() %>%                                            # Remove duplicates
  as.Date(origin = "1970-01-01")                          # Put in date format
features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg")  # Predictors
train_sample <- data_f %>% filter(Date <= sep_date)       # Train sample
train_features <- select(train_sample, all_of(features))  # Train features
test_sample <- data_f %>% filter(Date > sep_date)         # Test sample
test_features <- select(test_sample, all_of(features))    # Train features

We directly move to the implementation of a SVM-based model. We recall that there are two important options (among others): the kernel that is applied upfront to the inputs and the regularisation intensity, C (for cost), which is a scaling factor for the sum of slack errors in the objective function.

  • For the kernel, the package offers options: linear, radial, polynomial and sigmoid.
  • The package also offers options with respect to the task: C-classification, nu-classification, eps-regression and nu-regression. We refer to the LIBSVM documentation for more details on the differences between these approaches (https://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf).
fit <- svm(y = train_sample$F_Return, # Train label
           x = train_features,        # Train features
           type = "nu-regression",    # SVM task type
           kernel = "radial",         # SVM kernel
           gamma = 0.5,               # Constant in the radial kernel 
           cost = 0.1)                # Slack variable penalisation
pred <- predict(fit, test_features)   # Store predictions
mean(pred * test_sample$F_Return > 0) # Compute hit ratio
## [1] 0.5401003
mean((pred-test_sample$F_Return)^2)   # MSE
## [1] 0.004526179

The hit-ratio is slightly above 0.50, which means that the signal is not of outstanding quality.

Some classification code below.

fit <- svm(y = as.factor(train_sample$F_Return > 0), # Train label
           x = train_features,                       # Train features
           type = "nu-classification",               # SVM task type
           kernel = "radial",                        # SVM kernel
           gamma = 0.5,                              # Constant in the radial kernel 
           cost = 0.1)                               # Slack variable penalisation
pred <- predict(fit, test_features)                  # Store predictions
mean(as.logical(pred) * test_sample$F_Return > 0)    # Compute hit ratio
## [1] 0.3135338

2 Ensemble learning

In this section, we implement one optimal combination of ‘independent’ learners. We start by defining the prediction error function in which we stack all models (random forest, boosted trees and SVM). We also add a weight feature in the function: it will compute the errors when the output is a (weighted) linear combination of the 3 algorithms.

library(randomForest)
library(xgboost)
pred_multi <- function(train_data, test_data, weights){
  set.seed(42)    # Sets the pseudo-random generator seed to fix the results (for RF)
  features <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg")     # Hard-coded
  train_features <- train_data %>% 
    select(all_of(features)) %>% as.matrix()                        # Indep. variable
  train_label <- train_data$F_Return                                # Dependent variable
  test <- test_data %>%                                             # Test sample 
    select(all_of(features)) %>% 
    as.matrix()
  # 1. Random Forest
  fit <- randomForest(y = train_label,        # Label
                      x = train_features,     # Features
                      ntree = 10,             # Number of random trees
                      mtry = 4                # Nb of predictive variables used for each tree
  )
  err_RF <- predict(fit, test) - test_data$F_Return # Return errors
  # 2. Boosted Trees
  train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format
  fit <- train_matrix %>% 
    xgb.train(data = .,                       # Data source (pipe input)
              eta = 0.5,                      # Learning rate
              objective = "reg:squarederror", # Number of random trees
              max_depth = 4,                  # Maximum depth of trees
              nrounds = 10                    # Number of trees used
    )
  xgb_test <- test_data %>%                     # Test sample => XGB format
    select(all_of(features)) %>% 
    as.matrix() %>%
    xgb.DMatrix()
  err_XGB <- predict(fit, xgb_test) - test_data$F_Return # Return errors
  # 3. SVM
  fit <- svm(y = train_label,                 # Train label
             x = train_features,              # Train features
             type = "nu-regression",          # SVM task type
             kernel = "radial",               # SVM kernel
             gamma = 0.5,                     # Constant in the radial kernel 
             cost = 0.1)                      # Slack variable penalisation
  err_SVM <- predict(fit, test) - test_data$F_Return                      # Return errors
  err_AGG <- weights[1]*err_RF + weights[2]*err_XGB + weights[3]*err_SVM  # Aggregate error
  return(data.frame(err_RF, err_XGB, err_SVM, err_AGG))                   # Return all errors
}

2.1 Simple combination

We are now ready to compute the errors across all three methods (+ the linear combination). The error is computed on the training samples!

errors <-  pred_multi(train_data = train_sample,
                      test_data = train_sample, # On training data first!
                      weights = rep(1,3)/3)     # Equal weights at first
errors %>% colMeans()                           # Average error
##       err_RF      err_XGB      err_SVM      err_AGG 
## 0.0003423143 0.0004667837 0.0007595887 0.0005228955
errors %>% apply(2, sd)                         # Standard deviation of errors
##     err_RF    err_XGB    err_SVM    err_AGG 
## 0.05486171 0.08661017 0.09329150 0.07584463

We then move to the ‘optimal’ combination. We compute the covariance of errors and then the derived optimal weights.

cor(errors[,1:3])                       # Corr. matrix of errors of first 3 models (singular if 4)
##            err_RF   err_XGB   err_SVM
## err_RF  1.0000000 0.8514612 0.8582702
## err_XGB 0.8514612 1.0000000 0.9703092
## err_SVM 0.8582702 0.9703092 1.0000000
Sigma <- cov(errors[,1:3])              # => stored in Sigma (for future use)
w <- rowSums(solve(Sigma))              # Sum or rows inverse covariance matrix of errors
w / sum(w)                              # Optimal weights
##     err_RF    err_XGB    err_SVM 
##  1.4304300  0.2378478 -0.6682778

The optimal weights are risky because they create an arbitrage: rely heavily on RF. This comes from the high correlation between the models’ errors.

Let’s see how an EW combination performs on the testing set.

errors_2 <- pred_multi(train_data = train_sample,
                       test_data = test_sample,
                       weights = rep(1,3)/3)
errors_2 %>% colMeans()                  # Average error
##        err_RF       err_XGB       err_SVM       err_AGG 
## -0.0002450426  0.0024838701 -0.0049413085 -0.0009008270
errors_2 %>% apply(2, sd)                # Standard deviation of errors
##     err_RF    err_XGB    err_SVM    err_AGG 
## 0.08121525 0.08378682 0.06710358 0.07163709

In light of the randomness of the first method, each time we rerun the code, new metrics should be generated for the first and last column. This does not happen because we fixed the random number generator by freezing the random seed. In terms of bias, it is hard to conclude. In terms of variance, the aggregate forecast usually has a very small advantage.

2.2 ‘Optimal’ aggregation

Let’s see how the optimised combination performs on the testing set.

errors_3 <- pred_multi(train_data = train_sample,
                       test_data = test_sample,
                       weights = w / sum(w))  # Change weights!
errors_3 %>% colMeans()                       # Average error
##        err_RF       err_XGB       err_SVM       err_AGG 
## -0.0002450426  0.0024838701 -0.0049413085  0.0035424332
errors_3 %>% apply(2, sd)                     # Standard deviation of errors
##     err_RF    err_XGB    err_SVM    err_AGG 
## 0.08121525 0.08378682 0.06710358 0.09581434

Because of the negative (hence, risky) weights, the dispersion of errors of the aggregate prediction is far from competitive. This is a well known fact: diversification can only work when the underlying building blocks tell different stories. It is not the case here. Working with variations in feature values could help solve this issue.

2.3 Stacked ensemble

It is in fact well known that unconstrained optimisations of learners are risky. In 1996, Breiman introduced the notion of stacked regressions. He makes the case that enforcing positivity of weights (leading to convex combinations) is the right way to go.

More generally, this has led to a family of sophisticated approaches called stacked ensembles. One implementation is provided in the ML library h2o: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html

Below, we illustrate this idea using the optimisation package quadprog. If we write \(\mathbf{\Sigma}\) for the covariance matrix of errors, we seek \[\mathbf{w}^*=\underset{\mathbf{w}}{\text{argmin}} \ \mathbf{w}'\mathbf{\Sigma}\mathbf{w}, \quad \mathbf{1}'\mathbf{w}=1, \quad w_i\ge 0,\] The constraints will be handled as:

\[\mathbf{A} \mathbf{w}= \begin{bmatrix} 1 & 1 & 1 \\ 1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \mathbf{w} \hspace{9mm} \text{ compared to} \hspace{9mm} \mathbf{b}=\begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix}, \]

where the first line will be an equality and the last three will be inequalities.

if(!require(quadprog)){install.packages("quadprog")}
library(quadprog)
nb_mods <- nrow(Sigma)
w_stacked <- solve.QP(Dmat = Sigma,
                      dvec = rep(0, nb_mods),
                      Amat = rbind(rep(1, nb_mods), diag(nb_mods)) %>% t(),
                      bvec = c(1,rep(0, nb_mods)),
                      meq = 1
)
w_stacked$solution
## [1] 1.000000e+00 0.000000e+00 1.496491e-16

The opimal combination consists of taking RF only! A more diversified combination would require more models or models that capture different patterns (i.e., that are uncorrelated or negatively correlated).

3 Interpretability

3.1 Introduction

In attempts to white-box complex machine learning models, one dichotomy stands out:

  • Global models seek to determine the relative role of features in the construction of the prediction once the model has been trained. This is for instance the case of variable importance is tree-based methods.
  • Local models aim to characterise how the model behaves around one particular instance by considering small variations around this instance. The way these variations are handled by the original model allows to simplify it, i.e., approximate it (e.g., in a linear fashion) to determine the sign and magnitude of each relevant feature in the vicinity of the original instance.

Below, we implement the LIME (Local Interpretable Model-agnostic Explanations from Ribeiro et al. (2016)) algorithm from the eponymous R package. We detail the approach below (we omit the subtlety of the interpretable representations in the original paper as it is not relevant in our context).

The original (complex) model is \(f\) and it is approximated at instance \(x\) by the interpretable model \(g\) that belongs to a large class \(G\). The vicinity of \(x\) is denoted \(\pi_x\) and the complexity of \(g\) is written \(\Omega(g)\). LIME seeks an interpretation of the form \[\xi(x)=\underset{g \in G}{\text{argmin}} \, \mathcal{L}(f,g,\pi_x)+\Omega(g),\] where \(\mathcal{L}(f,g,\pi_x)\) is the loss function (error/imprecision) induced by \(g\) in the vicinity \(\pi_x\) of \(x\).

The penalisation \(\Omega(g)\) is for instance the number of leaves or depth of a tree or the number of predictors in a linear regression.

The vicinity of \(x\) is defined by \(\pi_x(z)=e^{-D(x,z)^2/\sigma^2},\) where \(D\) is some distance measure. Note: this function decreases when \(z\) shifts away from \(x\).

The tricky part is the loss function. In order to minimise it, LIME generates artificial samples close to \(x\) and averages/sums the error on the label that the simple representation makes. The formulation is the following: \[\mathcal{L}(f,g,\pi_x)=\sum_z \pi_x(z)(f(z)-g(z))^2\] the errors are weighted according to their distance from the initial instance \(x\).

3.2 Global importance of ESG

The question here is: how important is ESG in the determination of future returns?

library(rpart)
library(rpart.plot)
load("data_esg.RData")
data_esg <- data_esg %>%
  group_by(Tick) %>%
  mutate(F_Return = lead(Close)/Close - 1) %>%          # Return for each stock
  ungroup()   %>%
  group_by(Date) %>%                                    # Normalisation takes place for each date
  mutate_at(vars(Vol_1M:ESG_rank), normalise) %>%       # Apply normalisation on features
  ungroup() 
fit_esg <- rpart(F_Return ~ Vol_1M + Mkt_Cap + P2B + D2E + ESG_rank,
                 data = data_esg, 
                 cp = 0.001,
                 maxdepth = 5)
rpart.plot(fit_esg)

fit_esg$variable.importance %>%
  data.frame(importance = .) %>%
  rownames_to_column(var = "variable")
##   variable importance
## 1      D2E  1.1431022
## 2      P2B  0.8451331
## 3  Mkt_Cap  0.6669449
## 4 ESG_rank  0.4196548
## 5   Vol_1M  0.3295422
fit_esg$variable.importance %>%
  data.frame(importance = .) %>%
  rownames_to_column(var = "variable") %>%
  ggplot(aes(x = importance, y = reorder(variable, importance))) + 
  geom_col() + theme_light() + ylab("Importance")

Thus, according to this data & decision tree, ESG is not the most prominent driver of future returns.

3.3 LIME in practice

One all-purpose package for machine learning in R is caret (see, e.g., http://topepo.github.io/caret/index.html, it is sort of the ancestor to the tidymodels). We have avoided to work with it up to now (we chose to work directly with individual and more narrowly-scoped packages), but LIME works well with caret, so it is time to test it! Also note that in fact caret resorts to pre-existing packages, like randomForest for instance. Finally, a bluit-in tool is included in caret for grid optimisation on hyperparameters and another one that computes variable importance: caret does pretty much everything!

if(!require(lime)){install.packages(c("lime", "caret"))}
library(lime)
library(caret)

grid_default <- expand.grid( # Must include ALL XGB PARAMETERS! If not => error.
  nrounds = 10,
  max_depth = 5,
  eta = 0.8,
  gamma = 0.1,
  colsample_bytree = 1,
  min_child_weight = 1,
  subsample = 1
)
fit <- train(y = train_sample$F_Return,       # Train features
             x = as.matrix(train_features),   # Train label
             method = "xgbTree",              # For boosted trees
             tuneGrid = grid_default,
             objective = "reg:squarederror"
)

explainer <- lime(train_features, fit)   # First step in LIME

The list of all models in caret can be found here (there are many ~240): https://topepo.github.io/caret/available-models.html

The LIME package proceeds in two steps. First, the lime() function (used above) embeds the model and the distribution of features. This allows the second step (below) to generate random points around the chosen instance. This second step is performed by the explain() function. This function has many options which do not review. We refer to https://uc-r.github.io/lime for a detailed overview.

explanation <- explain(
  x = train_features[1:2,],              # We look at the first two instances in train_sample 
  explainer = explainer,                 # Input: the explainer variable created above 
  n_permutations = 1000,                 # Nb samples used for the loss function
  dist_fun = "gower",                    # Distance function. "euclidean" is one alternative
  n_features = 4
)
plot_features(explanation)               # Plot!

In the arguments of explain(), we asked explanations for 2 instances (the first two in train_feature). Moreover, we asked for exactly 4 features. The above graph shows these features for the two instances. A blue/red colour highlights a positive/negative effect of the feature and the size of the rectangle represents the relative importance between the variables (they are ordered from most important at the top to least important at the bottom).

3.4 Shapley values

There is another way to quantify the relative importance of features in the determination of a local prediction. It stems from cooperative game theory. The idea it to investigate combinations of players and the payoffs that are generated by these combinations. The ‘value’ of a player can then be assessed by looking at what happens to payoffs if the player is removed from the combinations. In a ML framework, the players are the features and payoffs are model values.

We give the formula below and explain it shortly after.

\[\phi_k=\sum_{S \subseteq \{x_1,\dots,x_K \} \backslash x_k}\underbrace{\frac{|S|!(K-|S|-1)!}{K!}}_{\text{weight of coalition}}\underbrace{\left(f_{S \cup \{x_k\}}(S \cup \{x_k\})-f_S(S)\right)}_{\text{gain when adding } x_k}\]

\(S\) is any subset of the coalition that doesn’t include feature \(k\); its size/cardinal is \(|S|\).
In the equation above, the model \(f\) must be altered because it’s impossible to evaluate \(f\) when features are missing. In this case, several possible options:

  • set the missing value to its average or median value (over the whole sample) so that its effect is the ‘average’ effect
  • directly compute an average value \(\int_\mathbb{R} f(x_1,\dots,x_k,\dots,x_K)d\mathbb{P}_{x_k}\), where \(d\mathbb{P}_{x_k}\) is the empirical distribution of \(x_k\) in the sample

Below, we use the iml package (for Interpretable Machine Learning). The packages has several features like variable importance, but we only resort to Shapley values.

if(!require(iml)){install.packages("iml")}
library(iml)                       # Interpretable ML package
predictor <- Predictor$new(fit,    # This wraps the model & data
                           data = train_features, 
                           y = train_sample$F_Return)
shapley <- Shapley$new(predictor, x.interest = train_features[1,]) # Compute the Shapley values
shapley$plot() + theme_light()                                     # Plot the values!

Compared to the LIME interpretation, there are some discrepancies…

4 Sustainable investing (short)

Green finance has soared recently and we refer to www.esgperspectives.com for a review on the topic. One central question pertains to the link between sustainability (green & social firms) and financial performance. Below, we use the data_esg dataset to investigate this question.

We resort to a panel model:

\[r_{t+1,n}= a_n+ \textbf{x}_{t,n}\textbf{b}+e_{t+1,n}\]

to explain returns. In the predictor variables \(\textbf{x}_{t,n}\), we include an ESG component.

if(!require(plm)){install.packages("plm")} # This is a package for panel models
library(plm)
plm(F_Return ~ Vol_1M + Mkt_Cap + P2B + D2E + ESG_rank, 
    data = data_esg,
    model = "within",
    index = "Tick") %>%
    summary()
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = F_Return ~ Vol_1M + Mkt_Cap + P2B + D2E + ESG_rank, 
##     data = data_esg, model = "within", index = "Tick")
## 
## Balanced Panel: n = 384, T = 84, N = 32256
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -0.86378356 -0.04239699  0.00081299  0.04243840  2.07170700 
## 
## Coefficients:
##            Estimate Std. Error  t-value  Pr(>|t|)    
## Vol_1M    0.0021091  0.0022124   0.9533    0.3404    
## Mkt_Cap  -0.0838442  0.0061635 -13.6033 < 2.2e-16 ***
## P2B      -0.0334246  0.0053720  -6.2220 4.970e-10 ***
## D2E       0.0237141  0.0043055   5.5078 3.661e-08 ***
## ESG_rank -0.0040270  0.0038532  -1.0451    0.2960    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    227.46
## Residual Sum of Squares: 224.44
## R-Squared:      0.01328
## Adj. R-Squared: 0.0012665
## F-statistic: 85.7808 on 5 and 31867 DF, p-value: < 2.22e-16

Sadly, we see that the ESG column does not help to explain future profitability.
If we resort to a simple sorting procedure….

data_esg %>%
    group_by(Date) %>%
    mutate(type = if_else(ESG_rank > 0.5, "green", "brown")) %>%
    group_by(type) %>%
    summarise(m = mean(F_Return, na.rm = T))
## # A tibble: 2 × 2
##   type       m
##   <chr>  <dbl>
## 1 brown 0.0107
## 2 green 0.0113

The green assets have average returns barely above those of the brown ones.

5 Deflated Sharpe ratios

5.1 Theory

This section is based on Bailey and Lopez de Prado (2014). The problem we address is that of backtest overfitting. When a manager or analyst tests a strategy on past data, a major risk is that a false positive emerges: the strategy will seem profitable in the test, but will lose money in live trading. In practice the analyst will test many strategies and only select a few: those that performed best in-sample. In order to correct for in-sample results that seem too optimistic, it is customary to shrink Sharpe ratios by some factor, usually close to 0.5.

Bailey and Lopez de Prado (2014) propose an approach to evaluate the statistical significance of Sharpe ratios when \(M\) strategies have been tested. Usually, the one with maximum SR is favoured. But is it a statistically sound choice?

The following test procedure will assess its relevance \[t = \phi\left(\frac{(SR-SR^*)\sqrt{T-1})}{\sqrt{1-\gamma_3SR+\frac{\gamma_4-1}{4}SR^2}} \right),\]

where \(\gamma_3\) and \(\gamma_4\) are the skewness and kurtosis of the returns of the strategy and \(SR^*\) is the theoretical average value of the maximum SR over \(M\) trials:

\[SR^*=\mathbb{E}[SR_m]+\sqrt{\mathbb{V}[SR_m]}\left((1-\gamma)\phi^{-1}\left(1-\frac{1}{N}\right)+\gamma \phi^{-1}\left(1-\frac{1}{Ne}\right) \right),\]

where \(\phi\) is the cdf of the standard Gaussian law and \(\gamma\) is the Euler-Mascheroni constant. The index \(m\) refers to the number of strategy trials and \(N\) is the total number of tested strategies. If \(t\) defined above is below a certain threshold (e.g., 0.95), then the \(SR\) cannot be deemed significant.

5.2 Practice

We now illustrate this result.

Our (very bad) research strategy is to test factor strategies based on thresholds. We pick one feature (e.g., Mkt_Cap) and test a policy that invests uniformly (EW portfolios) only in the stocks that are the most tilted towards one direction, i.e., above or below a particular threshold. In this (Mkt_Cap) case, invest in small firms, or large firms. And we then repeat the operation over all features. This generates a large array of strategies. Each strategy has three components:

  • the choice of the feature
  • the decision threshold (beyond or below which level do we invest in the stock)
  • the decision direction (do we invest when the feature is above or below the threshold)

We create the corresponding function below.

strat <- function(data, feature, thresh, direction){
  data_tmp <- select(data, all_of(feature), Date, F_Return)      # Data
  colnames(data_tmp)[1] <- "feature"                             # Colname
  data_tmp %>% 
    mutate(decision = direction * feature > direction * thresh) %>% # Investment decision
    group_by(Date) %>%                                          # Date-by-date  analysis    
    mutate(nb = sum(decision),                                  # Nb assets in portfolio
           w = decision / nb,                                   # Weights of assets
           return = w * F_Return) %>%                           # Asset contribution
    summarise(p_return = sum(return)) %>%                       # Portfolio return
    summarise(avg = mean(p_return), sd = sd(p_return), SR = avg/sd) %>% # Perf. metrics
    return()
}

We then test the function on two possible strategies (i.e, choice of triplet).

strat(data_f, "Mkt_Cap", 0.5, 1)   # Large cap
## # A tibble: 1 × 3
##       avg     sd    SR
##     <dbl>  <dbl> <dbl>
## 1 0.00473 0.0423 0.112
strat(data_f, "Mkt_Cap", 0.5, -1)  # Small cap
## # A tibble: 1 × 3
##      avg     sd    SR
##    <dbl>  <dbl> <dbl>
## 1 0.0128 0.0492 0.261

The first line tests a strategy that invests when Mkt_Cap is larger than 0.5 (note: we work with data_f, hence the feature values all lie in [0,1]). The second does the opposite (i.e., invests in small firms) and is much more profitable.

We then test a large panel of configuration via a grid approach: we use brute-force data-mining which is not often the best idea.

feature <- c("Mkt_Cap", "P2B", "Vol_1M", "D2E", "Prof_Marg")            # Features
thresh <- seq(0.2,0.8, by = 0.1)                                        # Threshold values values
direction <- c(1,-1)                                                    # Decision direction
pars <- expand.grid(feature, thresh, direction)                         # The grid
feature <- pars[,1] %>% as.character()                                  # re-features
thresh <- pars[,2]                                                      # re-thresholds
direction <- pars[,3]                                                   # re-directions

Finally, we run the loop inside the pmap() wrapper.

grd <- pmap(list(feature, thresh, direction),      # Parameters for the grid search
            strat,                                 # Function on which to apply the grid search
            data = data_f                          # Data source/input
) %>% 
  unlist() %>%
  matrix(ncol = 3, byrow = T)
grd <- data.frame(feature, thresh, direction, grd)              # Gather & reformat results 
colnames(grd)[4:6] <- c("mean", "sd", "SR")                     # Change colnames
grd <- grd %>% mutate_at(vars(direction), as.factor)            # Change type: factor (for plot)
grd %>% ggplot(aes(x = thresh, y = SR, color = feature)) +      # Plot!
  geom_point() + geom_line() + facet_grid(direction~.) + theme_light()

The biggest effect of direction is observed for Mkt_Cap. As usual, this is the most decisive variable in our sample.

Note: these are SR at the monthly frequency. There is a time-scale effect with the SR: the numerator is proportional to time while the denominator is proportional to its square root. Hence a simple proxy for annualised SR is the monthly SR times \(\sqrt{12}\).

The best strategy in the above set is to invest in firms that are below the 0.3 quantile of market capitalisation. This would be our choice. Is it relevant? Let’s find out using the deflated SR test. First, we code the function.

DSR <- function(SR, Tt, M, g3, g4, SR_m, SR_v){ # First, we build the function
  gamma <- -digamma(1)                        # Euler-Mascheroni constant
  SR_star <- SR_m + sqrt(SR_v)*((1-gamma)*qnorm(1-1/M) + gamma*qnorm(1-1/M/exp(1))) # Avg. Max SR
  num <- (SR-SR_star) * sqrt(Tt-1)            # Numerator
  den <- sqrt(1 - g3*SR + (g4-1)/4*SR^2)      # Denominator
  return(pnorm(num/den))
}

Next, we compute its arguments… and pass them to DSR().

# Next, the parameters:
M <- nrow(pars)             # Number of strategies we tested
SR <- max(grd$SR)           # The SR we want to test
SR_m <- mean(grd$SR)        # Average SR across all strategies
SR_v <- var(grd$SR)         # Std dev of SR
# Below, we compute the returns of the strategy by recycling the code of the strat() function
data_tmp <- select(data_f, "Mkt_Cap", Date, F_Return) # Mkt_Cap is the retained feature
colnames(data_tmp)[1] <- "feature"
returns <-  data_tmp %>% 
  mutate(decision = feature < 0.3) %>% # Investment decision: 0.3 is the best threshold
  group_by(Date) %>%                   # Date-by-date computations
  mutate(nb = sum(decision),           # Nb assets in portfolio
         w = decision / nb,            # Portfolio weights
         return = w * F_Return) %>%    # Asset contribution to return
  summarise(p_return = sum(return))    # Portfolio return
g3 <- skewness(returns$p_return)        # Function from the e1071 package
g4 <- kurtosis(returns$p_return) + 3    # Function from the e1071 package
Tt <- nrow(returns)                     # Number of dates
DSR(SR, Tt, M, g3, g4, SR_m, SR_v)      # The sought value!
## [1] 0.45605

This value is very far from the 0.95 confidence threshold. The SR of our best strategy is not statistically significant because its location compared to the distribution of all SR is not enough to the right.

6 Exercises

6.1 Optimised forecasts

We have seen that the correlation between errors are too high in our base case ensemble model (with 3 learners). Try to build a similar aggregation model but with the predictors being the variations of the raw predictors (\(\Delta X_t=X_t-X_{t-1}\)). Use the mutate() or mutate_at() and lag() functions to create the variations.

6.2 SVM portfolios

Create a portfolio strategy based on SVM forecasts. Choose your variables wisely.

6.3 Stacked ensemble

Create an ensemble with 5 or 6 learners (you can add LASSO and ridge regressions to the ones used above) and see if the combination involves more than one model.