This is the companion notebook dedicated to data processing. This stage can easily be overlooked but is crucial in machine learning. We briefly discuss categorical variables and logistic regression. We also make a few steps towards ML-driven factor investing in the last section.

\(\rightarrow\) ABOUT THE NOTEBOOK:

the code chunks are sequential. They must be executed in the correct order.
text between stars appears in bold in html format.

1 Missing data & imputation

First, one of the most important issue: MISSING DATA!

Below, we import packages & the data.

if(!require(naniar)){install.packages("naniar", "visdat", "zoo")}
library(tidyverse)                          # Loading THE package/environment
library(naniar)                             # Missing data exploration
library(visdat)                             # Missing data visualization
library(zoo)                                # Great package for time-series
load("data_new.RData")
data$p2b[999:1010] <- NA                   # Removing data to illustrate imputation

Let’s determine which variables are the most affected… (this step takes a bit of time because we use the large dataset)

gg_miss_var(data)                      # Proportion of missing points per column

vis_dat(data, warn_large_data = F)     # Visualization of data types per column

The two most affected columns are indeed revenue and scope_1 (emissions).
Let’s deal with them using 2 different techniques: times-series imputation and cross-sectional imputation.
First, let us have a look closer.

data %>% filter(is.na(p2b))      # Where are the missing p2b ratios?

## # A tibble: 133 × 9
##    date       symbol close vol_1y       mkt_cap   p2b   d2e revenue scope_1
##    <date>     <chr>  <dbl>  <dbl>         <dbl> <dbl> <dbl>   <dbl>   <dbl>
##  1 2011-01-31 C       48.2  0.296 137446045212.    NA  392.      NA   39120
##  2 2011-02-28 C       46.8  0.303 137446045212.    NA  392.      NA   39120
##  3 2011-03-31 C       44.2  0.257 137446045212.    NA  392.      NA   39120
##  4 2011-04-30 C       45.9  0.248 137446045212.    NA  392.      NA   39120
##  5 2011-05-31 C       41.2  0.253 137446045212.    NA  392.      NA   39120
##  6 2011-06-30 C       41.6  0.245 137446045212.    NA  392.      NA   39120
##  7 2011-07-31 C       38.3  0.244 137446045212.    NA  392.      NA   39120
##  8 2011-08-31 C       31.0  0.297 137446045212.    NA  392.      NA   39120
##  9 2011-09-30 C       25.6  0.330 137446045212.    NA  392.      NA   39120
## 10 2011-10-31 C       31.6  0.413 137446045212.    NA  392.      NA   39120
## # ℹ 123 more rows

data %>% 
  filter(symbol == "C") %>%
  ggplot(aes(x = date, y = p2b)) + geom_line() + theme_light()

For time-series imputation, we will use the na.locf function from the zoo package. Let’s see what it does.

seq_na <- sin(1:24) %>% round(3)
seq_na[10:12] <- NA
seq_na

##  [1]  0.841  0.909  0.141 -0.757 -0.959 -0.279  0.657  0.989  0.412     NA
## [11]     NA     NA  0.420  0.991  0.650 -0.288 -0.961 -0.751  0.150  0.913
## [21]  0.837 -0.009 -0.846 -0.906

na.locf(seq_na)

##  [1]  0.841  0.909  0.141 -0.757 -0.959 -0.279  0.657  0.989  0.412  0.412
## [11]  0.412  0.412  0.420  0.991  0.650 -0.288 -0.961 -0.751  0.150  0.913
## [21]  0.837 -0.009 -0.846 -0.906

The missing points were replaced by the last preceding value. Handy!

data %>% 
  group_by(symbol) %>%                         # Perform the operation for EACH stock
  mutate(p2b = na.locf(p2b, na.rm = F)) %>%    # Replace p2b by imputed p2b
  ungroup() %>%                                # Ungroup
  filter(symbol == "C",                        # Let's check!
         date > "2005-01-01",
         date < "2015-01-01") %>%                
  ggplot(aes(x = date, y = p2b)) + geom_line() + theme_light() +
  geom_point(data = data |> filter(symbol == "C", year(date) == 2011), color = "red")

No more holes!

Another way to proceed is to assign a mean or median score, as if the missing value for the company is equal to some “average” or representative value. The median is more robust than the mean because it is less affected by outliers.

data %>% 
  group_by(date) %>%                                   # Perform the operation for EACH date
  mutate(med_p2b = median(p2b, na.rm = T)) %>%         # Compute the median p2b
  ungroup() %>%                                        # Ungroup
  mutate(p2b = if_else(is.na(p2b), med_p2b, p2b)) %>%  # If p2b is NA, replace it! If not, don't.
  filter(symbol == "C",                               # Let's check!
         date > "2005-01-01",
         date < "2015-01-01") %>%       
  ggplot(aes(x = date, y = p2b)) + geom_line() + theme_light() +
  geom_point(data = data |> filter(symbol == "C", year(date) == 2011), color = "red")

Here again, the holes have been filled, but the new values are very different.

2 Feature engineering

We choose to skip the complicated topic of outlier detection to focus directly on feature engineering (we refer to Aggarwal (2016) for an in-depth treatment of outlier analysis). Below, we start with artificially generated data and introduce simple rescaling methods.

On the particular topic of feature engineering, we recommend the book http://www.feat.engineering.

Length <- 100                                    # Length of the sequence
x <- exp(sin(1:Length))                          # Original data
data_simu <- data.frame(index = 1:Length, x = x) # Data framed into dataframe
ggplot(data_simu, aes(x = index, y = x)) +       # Plot
  geom_col() + theme_light()

Nothing special about this sample. We choose to work with positive values, but this is without much loss of generality. Below, we scale the variable using 3 methods:

standardisation: \(\tilde{x}_i=(x_i-m)/\sigma\), where \(m\) is the mean and \(\sigma\) the standard deviation of the sample
min-max rescaling: \(\tilde{x_i}=(x_i-min(x))/(max(x)-min(x))\)
‘uniformisation’: \(\tilde{x}_i=ecdf(x)\), where ecdf is the empirical cumulative distribution function of the sample

standard <- (x - mean(x)) / sd(x)           # Standardisation 
norm_0_1 <- (x - min(x)) / (max(x)-min(x))  # [0,1] reduction
unif <- ecdf(x)(x)                          # Uniformisation
data_norm <- data.frame(                    # Formatting the data
  index = 1:Length,                       # Index of point/instance
  standard = standard,
  norm_0_1 = norm_0_1,
  unif = unif) %>% pivot_longer(-index, names_to = "Type", values_to = "value")
ggplot(data_norm, aes(x = index, y = value, fill = Type)) +   # Plot!
  geom_bar(stat = "identity", position = "dodge") + theme_light()

ggplot(data_norm, aes(x = index, y = value, fill = Type)) +   # Plot!
  geom_col() + theme_light() + 
  facet_grid(Type~.)          # This option creates 3 concatenated graphs to ease comparison

By definition, standardisation produces both positive and negative values.
In terms of distribution of the data, this gives the following differences:

ggplot(data_norm, aes(x = value, fill = Type)) + 
  geom_histogram(position = "dodge") + theme_light() +
  facet_grid(Type ~.)

With respect to shape, the green and red distributions are close to the original one. It is only the support that changes: the min/max rescaling ensures all values lie in the [0,1] interval. In both cases, the smallest values (on the left) display a spike in distribution. By construction, this spike disappears under the uniformisation: the points are evenly distributed over the unit interval.

3 Application to financial data

In this section, we investigate the impact of these techniques and choices on simple analyses. We compare the impact of rescaling on basic regression results.

3.1 Baseline results

We start by importing the smaller dataset and computing returns (our dependent variable).

data <- data %>% arrange(date,symbol)                     # Just making sure all is in order
tick <- unique(data$symbol)                               # Set of assets
data <- data  %>% 
  group_by(symbol) %>%                                    # Perform analysis stock-by-stock
  mutate(f_return = lead(close) / close - 1) %>%          # Adding forward/future returns
  select(-scope_1) %>%                                    # Too many missing points for emissions         
  na.omit() %>%                                           # Take out missing data
  ungroup()
summary(data)                                             # Descriptive statistics

##       date               symbol              close              vol_1y       
##  Min.   :2000-11-30   Length:7389        Min.   :  0.2525   Min.   :0.05741  
##  1st Qu.:2007-03-31   Class :character   1st Qu.: 29.3100   1st Qu.:0.15773  
##  Median :2013-03-31   Mode  :character   Median : 52.5312   Median :0.20503  
##  Mean   :2013-03-09                      Mean   : 75.0324   Mean   :0.23624  
##  3rd Qu.:2019-02-28                      3rd Qu.: 92.2700   3rd Qu.:0.28197  
##  Max.   :2025-01-31                      Max.   :960.0200   Max.   :1.59204  
##     mkt_cap               p2b                d2e             revenue         
##  Min.   :5.024e+09   Min.   :  0.7486   Min.   :   0.00   Min.   :4.827e+09  
##  1st Qu.:5.598e+10   1st Qu.:  2.1620   1st Qu.:  30.48   1st Qu.:3.759e+10  
##  Median :1.318e+11   Median :  3.2933   Median :  59.19   Median :6.661e+10  
##  Mean   :1.856e+11   Mean   : 13.1589   Mean   : 246.84   Mean   :1.027e+11  
##  3rd Qu.:2.110e+11   3rd Qu.:  5.9361   3rd Qu.: 155.28   3rd Qu.:1.299e+11  
##  Max.   :3.463e+12   Max.   :540.0111   Max.   :5814.97   Max.   :6.810e+11  
##     f_return        
##  Min.   :-0.578846  
##  1st Qu.:-0.032787  
##  Median : 0.004775  
##  Mean   : 0.007198  
##  3rd Qu.: 0.046769  
##  Max.   : 1.273764

First, the descriptive statistics. Looking at the dispersion between the median and both the minimum and maximum gives a first impression on outliers. For instance, the maximum d2e is 242 times the median: this is a big number. The same observation holds for p2b. Sometimes, it is worthwhile to understand if these values are errors or due to very small values in the denominator. For instance, the outlier in p2b comes from very low book values for Boeing in 2017.

data %>%
    filter(symbol == "BA") %>%
    ggplot(aes(x = date, y = p2b)) + geom_line() + theme_bw()

Then, we rescale the predictors (features). We use two normalising functions. For consistency reasons, we will compare the two methods that scale inputs into the [0,1] interval. One secondary reason for that is that some ML tools work well/better with bounded inputs (neural networks for instance). Below, we create the two dataframes used in the analysis.

NOTE: it is imperative to rescale the data AT EACH DATE.

norm_unif <-  function(v){  # This is a function that uniformalises a vector.
  v <- v %>% as.matrix()
  return(ecdf(v)(v))
}

norm_0_1 <-  function(v){  # This is a function that uniformalises a vector.
  return((v-min(v))/(max(v)-min(v)))
}

data_0 <- data %>%                          # Data with normalised values
  group_by(date) %>%                      # Normalisation takes place for each given date
  mutate_if(is.numeric, norm_0_1) %>%     # Apply normalisation on numerical columns
  ungroup() %>%                           # Ungroup
  select(-symbol, -date, -close)            # Take out superfluous columns        
data_0$f_return <- data$f_return            # We recover the original returns

data_u <- data %>%                          # Data with normalised values
  group_by(date) %>%                      # Normalisation takes place for each given date
  mutate_if(is.numeric, norm_unif) %>%    # Apply normalisation on numerical columns
  ungroup() %>%                           # Ungroup
  select(-symbol, -date, -close)            # Take out superfluous columns      
data_u$f_return <- data$f_return            # We recover the original returns

Below, we check the distribution of the variables.

data_0 %>% 
  select(-f_return) %>%                               # Remove returns
  pivot_longer(cols = everything(),
               names_to = "Attribute", values_to = "Value") %>%  # Convert to 'compact' column mode
  ggplot(aes(x = Value, fill = Attribute)) + 
  geom_histogram() + theme_light() +                  # Plot histograms
  facet_grid(Attribute~., scales = "free")            # Stack the histograms

data_u %>% 
  select(-f_return) %>%                               # Remove returns
  pivot_longer(cols = everything(),
               names_to = "Attribute", values_to = "Value") %>%  # Convert to 'compact' mode
  ggplot(aes(x = Value, fill = Attribute)) + 
  geom_histogram(bins = 10) + theme_light() +                    # Plot histograms
  facet_grid(Attribute~.)                                        # Stack the histograms

Apart for a minor issue with d2e (due to outliers?), the scaling seems ok. But clearly, the distribution of all variables are strongly impacted by the scaling choice.

We are now ready to perform the regressions. We start with the scaled data.

fit <- lm(f_return ~ ., data = data_0)      # Regression on first set
summary(fit)                                # Regression output

## 
## Call:
## lm(formula = f_return ~ ., data = data_0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59505 -0.03959 -0.00165  0.03995  1.25733 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0066686  0.0019433   3.432 0.000603 ***
## vol_1y       0.0072119  0.0035422   2.036 0.041785 *  
## mkt_cap      0.0007097  0.0042091   0.169 0.866112    
## p2b         -0.0073691  0.0036211  -2.035 0.041883 *  
## d2e          0.0052430  0.0044268   1.184 0.236303    
## revenue     -0.0055841  0.0043143  -1.294 0.195587    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07516 on 7383 degrees of freedom
## Multiple R-squared:  0.001487,   Adjusted R-squared:  0.0008107 
## F-statistic: 2.199 on 5 and 7383 DF,  p-value: 0.05161

If we set the significance threshold at 5%, then some variables stand out: Prof_Marg (strongly), mkt_cap (weakly) and Vol_1M (weakly). With this database, future returns are negatively related to firm size. This is true in all analyses that we can/will do (with trees, notably).

We now turn to the ‘uniformised’ data.

fit <- lm(f_return ~ ., data = data_u)      # Regression on second set
summary(fit)                                # Regression output

## 
## Call:
## lm(formula = f_return ~ ., data = data_u)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59001 -0.03975 -0.00182  0.04003  1.26167 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.0077726  0.0037887   2.052   0.0403 *
## vol_1y       0.0063118  0.0031379   2.011   0.0443 *
## mkt_cap     -0.0088907  0.0042861  -2.074   0.0381 *
## p2b          0.0058436  0.0036268   1.611   0.1072  
## d2e         -0.0047801  0.0036521  -1.309   0.1906  
## revenue      0.0004112  0.0038159   0.108   0.9142  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07514 on 7383 degrees of freedom
## Multiple R-squared:  0.001867,   Adjusted R-squared:  0.001191 
## F-statistic: 2.762 on 5 and 7383 DF,  p-value: 0.01695

In this case, mkt_cap is still strongly significant, buth other variables’ impacts have faded, while d2e and Prof_Marg emerge as weak drivers. This underlines the importance of feature engineering. For nonlinear models, we refer to Galili and Meilijson (2016 - arxiv) for an analysis of the impact of feature transformation on tree splits.

3.2 A toy example

To illustrate the difference between the two rescaling methods, we build a simple dataset, comprising 3 firms and 3 dates.

firm <- c(rep(1,3), rep(2,3), rep(3,3))             # Firms (3 lines for each)
date <- rep(c(1,2,3),3)                             # dates
cap <- c(10, 50, 100,                               # Market capitalisation
         15, 10, 15, 
         200, 120, 80)
return <- c(0.06, 0.01, -0.06,                      # Return values
            -0.03, 0.00, 0.02,
            -0.04, -0.02,0.00)
data_toy <- data.frame(firm, date, cap, return)     # Aggregation of data
data_toy <- data_toy %>%                            # Transformation of data
  group_by(date) %>%                            
  mutate(cap_0 = norm_0_1(cap), cap_u = norm_unif(cap))
data_toy                                            # Display the data

## # A tibble: 9 × 6
## # Groups:   date [3]
##    firm  date   cap return  cap_0 cap_u
##   <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
## 1     1     1    10   0.06 0      0.333
## 2     1     2    50   0.01 0.364  0.667
## 3     1     3   100  -0.06 1      1    
## 4     2     1    15  -0.03 0.0263 0.667
## 5     2     2    10   0    0      0.333
## 6     2     3    15   0.02 0      0.333
## 7     3     1   200  -0.04 1      1    
## 8     3     2   120  -0.02 1      1    
## 9     3     3    80   0    0.765  0.667

Let’s briefly comment on this synthetic data. We assume that dates are ordered chronologically and far away: each date stands for a year or the beginning of a decade, but the (forward) returns are computed on a monthly basis. The first firm is hugely successful and multiplies its cap ten times over the periods. The second firms remains stable cap-wise, while the third one plummets. If we look at ‘local’ future returns, they are strongly negatively related to size for the first and third firms. For the second one, there is no clear pattern.

date-by-date, the analysis is fairly similar, though slightly nuanced.

On date 1, the smallest firm has the largest return and the two others have negative returns.
On date 2, the biggest firm has a negative return while the two smaller firms do not.
On date 3, returns are decreasing with size.

While the relationship is not always perfectly monotonous, there seems to be a link between size and return and typically, investing in the smallest firm would be a very good strategy with this sample.

Now let’s look at the regressions.

lm(return ~ cap_0, data = data_toy) %>% summary() # First regression (min-max rescaling)

## 
## Call:
## lm(formula = return ~ cap_0, data = data_toy)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.044970 -0.016278  0.003722  0.013425  0.043722 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.01628    0.01374   1.185   0.2746  
## cap_0       -0.04970    0.02137  -2.326   0.0529 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02867 on 7 degrees of freedom
## Multiple R-squared:  0.4359, Adjusted R-squared:  0.3553 
## F-statistic: 5.409 on 1 and 7 DF,  p-value: 0.05294

lm(return ~ cap_u, data = data_toy) %>% summary() # Second regression (uniformised feature)

## 
## Call:
## lm(formula = return ~ cap_u, data = data_toy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.02667 -0.02000  0.00000  0.01667  0.03333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.06000    0.01981   3.028  0.01916 * 
## cap_u       -0.10000    0.02752  -3.634  0.00835 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02247 on 7 degrees of freedom
## Multiple R-squared:  0.6536, Adjusted R-squared:  0.6041 
## F-statistic: 13.21 on 1 and 7 DF,  p-value: 0.008351

In terms of p-value (last column), the first estimation for the cap coefficient is above 5% while the second is below 1%. One possible explanation for this discrepancy is the standard deviation of the variables. The deviations are equal to 0.47 and 0.29 for cap_0 and cap_u, respectively. Values like market capitalisations can have very large ranges and are thus subject to substantial deviations (even after scaling). Working with uniformised variables reduces dispersion and can help solve this problem.

Note that this is a double-edged sword: while it can help avoid false negatives, it can also lead to false positives.

4 Handling categorical data

In supervised learning, regression tasks consist of explaining or predicting real numbers while classification exercises rely on cateogorical data, which are often harder to handle.
Below, we talk interchangeably of categories or classes.

4.1 Creating categorical data

First, we start from the scaled sample data_u, duplicate it and create a categorical variable: FC_Return (forward categorical). We use a simple function for that purpose. It is defined as \[\tilde{y}_{t,i}=\left\{ \begin{array}{rll} -1 & \text{ if } & y_{t,i} < -r \\ 0 & \text{ if } & y_{t,i} \in [-r,r] \\ +1 & \text{ if } & y_{t,i} > r \\ \end{array} \right.\] for some threshold value \(r>0\). Intuitively, if the dependent variable y is a measure of profitability, a +1 value signals a buying intention, a -1 value a selling intention, while a zero value recommends the status quo.

thresh_coding <- function(x, r){    # The function that creates the categorical data
  z <- x                          # Duplicate original data
  z[x < (-r)] <- (-1)             # Sell!
  z[x > r] <- 1                   # Buy!
  z[x >= (-r) & x <= r] <- 0      # Hold!
  return(z)
}
r_thresh <- 0.02                    # Decision threshold
data_c <- round(data_u,3)           # Duplicate data and round numbers (3 decimals is enough)
data_c <- data_c %>% mutate(fc_return = thresh_coding(f_return, r_thresh))
head(data_c)                        # Let's check!

## # A tibble: 6 × 7
##   vol_1y mkt_cap   p2b   d2e revenue f_return fc_return
##    <dbl>   <dbl> <dbl> <dbl>   <dbl>    <dbl>     <dbl>
## 1  0.958   0.125 0.25  0.208   0.125   -0.098        -1
## 2  0.208   0.333 0.292 0.625   0.75    -0.044        -1
## 3  0.5     0.25  0.375 0.292   0.25     0.054         1
## 4  0.125   0.375 0.125 0.583   0.792    0.031         1
## 5  0.25    0.083 0.083 0.875   0.042    0.117         1
## 6  0.625   0.5   0.333 0.458   0.375    0             0

We confirm the coding of the categorical variable on a small subset of the sample.

data_c$fc_return <- data_c$fc_return %>% as.factor() # Turn it into real category!
summary(data_c$fc_return)

##   -1    0    1 
## 2308 2172 2909

The last line shows the relative importance of the categories, which greatly depends on r_thresh. Our choice of r_thresh gives a rather balanced distribution for the categories. Smaller values of r_thresh would decrease the number of zeros and increase the number of \(\pm 1\).

4.2 One-hot encoding

library(fastDummies)                         # One package for one-hot encoding
one_hot <- dummy_cols(data_c$fc_return)      # The encoding
colnames(one_hot) <- c("Original", "Sell", "Hold", "Buy")
data_c <- cbind(data_c, one_hot)    # Combine the two
data_c %>%                          # Let's see!
  select(-vol_1y, -mkt_cap) %>%     # Remove a few variables first..
  head()

##     p2b   d2e revenue f_return fc_return Original Sell Hold Buy
## 1 0.250 0.208   0.125   -0.098        -1       -1    1    0   0
## 2 0.292 0.625   0.750   -0.044        -1       -1    1    0   0
## 3 0.375 0.292   0.250    0.054         1        1    0    0   1
## 4 0.125 0.583   0.792    0.031         1        1    0    0   1
## 5 0.083 0.875   0.042    0.117         1        1    0    0   1
## 6 0.333 0.458   0.375    0.000         0        0    0    1   0

The number of additional columns is logically equal to the number of categories.
One-hot encoding will mostly be used for neural networks (decision trees handle categorical variables quite well).

4.3 Logistic regression

When there are only two classes (e.g., buy vs sell), it is possible to extend the concept of regression analysis to categorical data. This classical tool is referred to as logistic regression because it is linked to the logistic function \((1+e^{-x})^{-1}\). This function takes values in [0,1] and such output forms are convenient because they can be interpreted as the probability of one occurrence of belonging to one class. If we use the simple model \[\frac{p}{1-p}=e^{\beta_0+\sum_{i=1}^N\beta_ix_i},\]

where \(p\) is the probability to buy, then \(p=(1+e^{-\beta_0+\sum_{i=1}^N\beta_ix_i})^{-1}\). The \(\beta_j\) are usually estimated via maximum likelihood (assuming iid Bernoulli samples).

Below, we show how this kind of model can be very simply implemented in R.

data_c$fb_Return <- (data_c$f_return>0) # Binary variable: 1 (true) if positive return, 0 if not
fit <- glm(fb_Return ~ mkt_cap + p2b + vol_1y + d2e + revenue,    # Generalised linear model
           family = "binomial",                                   # Binary variable
           data = data_c)                                         # Data source
summary(fit)                                                      # Output of regression

## 
## Call:
## glm(formula = fb_Return ~ mkt_cap + p2b + vol_1y + d2e + revenue, 
##     family = "binomial", data = data_c)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.27768    0.10108   2.747  0.00601 **
## mkt_cap     -0.13271    0.11431  -1.161  0.24568   
## p2b          0.10731    0.09672   1.109  0.26725   
## vol_1y      -0.01910    0.08370  -0.228  0.81946   
## d2e         -0.12762    0.09738  -1.310  0.19004   
## revenue     -0.15802    0.10174  -1.553  0.12039   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10223  on 7388  degrees of freedom
## Residual deviance: 10212  on 7383  degrees of freedom
## AIC: 10224
## 
## Number of Fisher Scoring iterations: 3

Each unit of increase in mkt_cap decreases the probability of positive return by a factor \(e^{-0.132}=0.876\). Finally, we show the syntax for prediction.

predict(fit, data_c[1:3,])                        # Predicts the log-odds

##          1          2          3 
## 0.22332137 0.06257379 0.19842360

predict(fit, data_c[1:3,], type = "response")     # Gives the probability

##         1         2         3 
## 0.5555995 0.5156383 0.5494438

1/(1+exp(-predict(fit, data_c[1:3,])))            # Check via logistic function

##         1         2         3 
## 0.5555995 0.5156383 0.5494438

Several output formats are possible. The log-odds are \(\log\left(\frac{p}{1-p}\right)\). The probability of buying is obtained via the logistic function.

4.4 Discussion on extensions

Refinements in labelling include:

conditional labelling, i.e. when the labelling process depends on some macro-economic variable (e.g., aggregate volatility - VIX for instance)
dynamic labelling via the triple barrier method (see Lopez de Prado’s book). Three thresholds are fixed: an investment horizon (0), a maximum profit (+1) and a maximum loss (-1). The label is determined by the first barrier that is attained. This is useful as it can match stop-loss triggers automatically enforced by a trading system.
meta labelling (Lopez de Prado as well), i.e. when the direction of the forecast is dealt with separately from the intensity of the trading position.

5 A few more regressions

(subtitle: ANOTHER STEP TOWARDS ML-BASED FACTOR INVESTING)

The core idea of the course is to translate data about firms into investment decisions. We seek to infer correlation (‘causal’ would be too strong) patterns between corporations’ characteristics and their subsequent returns. The base mode is \[\mathbf{y}=f(\mathbf{X})+\mathbf{\epsilon},\]

where y is some proxy for future performance while X embeds the current values of firm attributes. More precisely, for one given date, t, we will try to estimate \(f_t\) \[y_{s+1,j}=f_t\left(x_{s,j}^{(1)}, \dots, x_{s,j}^{(K)}\right)+\mathbf{\epsilon}_{s,j}, \quad s= t-1, \ t-2,\dots\]

where j is the index of the firm and s the index of time. In the above equation, there are K features or firm attributes. To make things simple, we will consider that y is the return of the asset. This (future) return will thus be a function of the current values of stocks’ characteristics.

If our sample comprises only one date (s = t-1), then we will try to build a model that maps the features of t-1 to the returns computed from t-1 to t (i.e., after the feature are observed). This stage where we determine the optimal (in some sense) model \(f_t\) according to past observations is called training: the parameters of the model are adjusted so that the predicted values are as close as possible to those that actually occurred.

Using only one date will probably be insufficient and the model will lack robustness. Hence, it will be preferable to train the model on many dates so that the model learns from different states of the market.

Before we move to a simple application of this principle using a linear \(f_t\), we illustrate the concept of learning on increasingly large samples.

5.1 Introduction to training via animated regressions

Below, we use the gganimate package to demonstrate the simple idea behind training.

if(!require(gganimate)){install.packages("gganimate")}
library(gganimate)
Iteration <- 16                     # Number of steps in the animation
reg_data <- data.frame(             # Create dataframe with:
  x = sin(1:Iteration),           # x-axis (independent/explanatory variable)
  iteration = Iteration,          # Index of the step
  status = c(rep("Past",Iteration-1),"Current")           # Which point is new versus old
)
reg_data <- reg_data %>% mutate(y = x + rnorm(Iteration)/3) # Linear Data Generator

tmp <- reg_data                     # Copy of the data with fewer points to pass to animation
for (j in 1:(Iteration-2)) {            # Index of animation step
  tmp <- tmp[-nrow(tmp),]             # Take out one point
  tmp$iteration <- Iteration - j      # Change animation step number
  tmp$status[nrow(tmp)] <- "Current"  # Change label of current point
  reg_data <- rbind(reg_data, tmp)    # Aggregate
}

p <- ggplot(reg_data) + theme_light() + 
  geom_point(aes(x = x, y = y, color = status), size = 4) +   # Points
  stat_smooth(aes(x = x, y = y), color = "black",  method = "lm", se = FALSE) + # Linear model
  transition_manual(iteration)                                # Animation variable
#anim <- animate(p, duration = 20)                               # 20 second animation
animate(p, duration = 5, fps = 20, renderer = gifski_renderer())

anim_save(filename = "reg.gif")                                      # Saving for export/include

Each time a new (red) point is added to the training data, the linear model is adjusted: the black line comes slightly closer to the new point. As more data accumulates, the impact of each new point is less and less pronounced because the parameter values converge. If the data generating process exhibits some stationarity properties (samples follow a reccurrent pattern) and if the model type is well chosen, learning will be effective.

5.2 Simple predictive regressions

In this section, we will consider a linear model: \[f_t\left(x_{s,j}^{(1)}, \dots, x_{s,j}^{(K)}\right)=\alpha_t+\sum_{k=1}^K\beta_{t}^{(k)}x_{s,j}^{(k)},\]

where the parameters alpha and beta are estimated via simple Ordinary Least Squares. In the above expression, we do not refer to the error term, but do not forget that it exists!

First, we have a look on the full sample (using data_u).

lm(f_return ~ ., data = data_u) %>% summary() # Regression output

## 
## Call:
## lm(formula = f_return ~ ., data = data_u)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59001 -0.03975 -0.00182  0.04003  1.26167 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.0077726  0.0037887   2.052   0.0403 *
## vol_1y       0.0063118  0.0031379   2.011   0.0443 *
## mkt_cap     -0.0088907  0.0042861  -2.074   0.0381 *
## p2b          0.0058436  0.0036268   1.611   0.1072  
## d2e         -0.0047801  0.0036521  -1.309   0.1906  
## revenue      0.0004112  0.0038159   0.108   0.9142  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07514 on 7383 degrees of freedom
## Multiple R-squared:  0.001867,   Adjusted R-squared:  0.001191 
## F-statistic: 2.762 on 5 and 7383 DF,  p-value: 0.01695

As a confirmation of our previous exercise, mkt_cap stands out, meaning that the addition of other predictors does not alter its impact on future returns.

5.3 Portfolio implications

We now turn to portfolio choice.
First, we create the weights function. The strategy forecasts future returns and invests equally in the half of the assets that have forecasts above the cross-sectional median return. This is in fact a particular case of the penalised predictive regression seen in the previous (LASSO) session, except that:

here the features have been rescaled,
we take into account relative performance and not raw performance (the threshold is the median and not just zero).

We first rebuild the original data with data imputation.

load("data_new.RData")                              # Loading the data
data <- data %>% arrange(date, symbol)              # Just making sure all is in order
data <- data  %>% 
    select(-scope_1) %>%                            # Remove ESG: too many missing points
    group_by(symbol) %>%                            # Group asset by asset
    mutate(p_return = close / lag(close) - 1) %>%   # Adding past returns
    mutate(f_return = lead(p_return)) %>%           # Adding forward returns: could to better
    ungroup()                                       # Ungroup

impute_ <- function(v){
  v[is.na(v)] <- median(v, na.rm = T)
  v
}
data <- data |> 
  group_by(date) |>
  mutate(across(where(is.numeric), impute_)) |>
  ungroup() |>
  na.omit()

data_u <- data %>%                               # Data with normalised values
  group_by(date) %>%                             # Normalisation takes place for each given date
  mutate_if(where(is.numeric), norm_unif) %>%    # Apply normalisation on numerical columns
  ungroup() %>%                                  # Ungroup
  select(-symbol, -date, -close)                 # Take out superfluous columns

## `mutate_if()` ignored the following grouping variables:
## • Column `date`

data_u$f_return <- data$f_return                 # We recover the original returns

weights_multi <- function(past_data, curr_data, j, thresh){ 
  if(j == 1){ # j = 1 => regression-based
    fit <- lm(f_return ~ ., data = past_data)   # Training on past data
    pred <- predict(fit, curr_data)             # Prediction
    w <- pred > quantile(pred, thresh)          # Investment decision: TRUE  = 1
    return(w / sum(w))
  }
  if(j == 2){ # j = 2 => EW
    N <- nrow(curr_data)
    return(rep(1/N,N))
  }
}

We then initiate the backtesting variables.

t_all <- unique(data$date)              # Set of dates
sep_date <- as.Date("2010-01-01")       # This date separates in-sample vs out-of-sample
t_oos <- t_all[t_all > sep_date]        # Out-of-sample dates (i.e., testing set)
nb_port <- 2                                                    # Nb of portfolios
Tt <- length(t_oos)                                             # Nb dates
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick)))   # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port)           # Store returns

Finally, we are ready for the backtesting loop.

thresh <- 0.8
data_u <- data.frame(data_u, date = data$date)              # We recover the dates from original file
for(t in 1:length(t_oos)){
  past_data <- data_u %>%                                 # The source is data_u!
    filter(date < t_oos[t]) %>%                         # Past dates: expanding window!
    select(-date)                                       # Take out dates
  curr_data <- data_u %>%                                 # The source is data_u!
    filter(date == t_oos[t]) %>%                        # Current date
    select(-date)                                       # Take out dates
  for(j in 1:nb_port){                                    # The loop on the strategies 
    portf_weights[t,j,] <- weights_multi(past_data,     # Portfolio weights...
                                         curr_data,
                                         j,             # The weights' function is indexed by j
                                         thresh)        # The threshold     
    realised_returns <- data_u %>%                      # Realised returns: take data_u
      filter(date ==  t_oos[t]) %>%                   # Filter the right date
      select(f_return)                                # Take the returns only
    portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns) # Porfolio returns
  }
}
12*portf_returns %>% apply(2, mean) # Average return

## [1] 0.1075491 0.1109401

sqrt(12)*portf_returns %>% apply(2, sd)   # Volatility

## [1] 0.1861675 0.1360854

sqrt(12)*portf_returns %>% apply(2, mean) / portf_returns %>% apply(2, sd)

## [1] 0.5777008 0.8152241

The regression-based allocation is slightly more profitable, but also more volatile: in the end, both Sharpe ratios are close. In order to statistically assess whether a strategy is significantly preferable, it is possible to perform a t-test on the difference of returns.

t_test <- t.test(portf_returns[,1]-portf_returns[,2]) 
t_test

## 
##  One Sample t-test
## 
## data:  portf_returns[, 1] - portf_returns[, 2]
## t = -0.1341, df = 180, p-value = 0.8935
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.004440608  0.003875441
## sample estimates:
##     mean of x 
## -0.0002825835

The difference is not sufficiently pronounced. A t-stat of -0.134 is not negligible, but not impressive either.

6 One last thing…

6.1 Consistency in persistence!

Inputs in financial backtesting can all be considered as time-series. There are many ways these tools can be characterised, but we underline one important property: persistence. One simply way to assess persistence is to compute auto-correlations.

acf_2 <- function(v){    # We build a function that keeps the first order correlation
  return(acf(v, lag.max = 1, plot = FALSE)$acf[2])
}
data %>% 
  group_by(symbol) %>% 
  summarise_if(is.numeric, acf_2) # We apply this function to features

## # A tibble: 30 × 9
##    symbol close vol_1y mkt_cap   p2b   d2e revenue p_return f_return
##    <chr>  <dbl>  <dbl>   <dbl> <dbl> <dbl>   <dbl>    <dbl>    <dbl>
##  1 AAPL   0.979  0.956   0.978 0.973 0.991   0.991  0.0445   0.0634 
##  2 BA     0.983  0.959   0.984 0.988 0.986   0.985  0.0538   0.0515 
##  3 BAC    0.980  0.977   0.969 0.985 0.985   0.991  0.108    0.102  
##  4 C      0.989  0.978   0.968 0.977 0.985   0.991  0.0554   0.0548 
##  5 CSCO   0.972  0.970   0.909 0.906 0.940   0.990 -0.0661  -0.0691 
##  6 CVS    0.988  0.941   0.987 0.946 0.982   0.987 -0.116   -0.0900 
##  7 CVX    0.975  0.949   0.974 0.955 0.982   0.970 -0.109   -0.107  
##  8 D      0.986  0.942   0.988 0.973 0.964   0.967 -0.108   -0.0971 
##  9 DIS    0.986  0.950   0.985 0.970 0.968   0.985 -0.00278 -0.00258
## 10 F      0.940  0.962   0.962 0.956 0.975   0.964 -0.0120  -0.0130 
## # ℹ 20 more rows

Almost all attributes are highly autocorrelated at the monthly frequency. This is obviously true for mkt_cap and Vol_1M which are indeed expected to show some memory over time.

The main issue is that the last column, which is the predicted variable, is not persistent at all! Hence, this is a major problem because it is unlikely that any good is going to come out of a model - even a very sophisticated one. If the returns are oscillating very quickly while the predictors have very smooth trajectories, then there is a framing issue. What will probably happen is that the model/algorithm will find spurious relationships/arbitrages between variables that will fail out-of-sample.

6.2 Dealing with persistence

Two solutions (at least) exist:

either get rid of the persistence of the predictors (and keep the original dependent variable). One way to achieve this is to compute variations (relative \(\Delta X_t=X_t/X_{t-1}-1\) or absolute \(\Delta X_t=X_t-X_{t-1}\)). In this case, it is the change in feature value that is expected to explain the change in prices.
or change the computation of the dependent variable to make it more persistent. This is easily done by augmenting the horizon of the returns.

Below, we provide examples of implementation for each alternative. We start with the second one.

data_1 <- data  %>% 
  group_by(symbol) %>%                                   # Grouping
  mutate(f_return = lead(close) / close - 1,           # Adding 1M future returns
         fl_return = lead(close, 12) / close - 1) %>%  # Adding 12M (Long-term) future returns   
  na.omit() %>%                                        # Take out missing data
  ungroup()                                            # Ungrouping
data_1 %>% group_by(symbol) %>% 
  summarise(autocor = acf_2(fl_return))                  # Showing the results

## # A tibble: 30 × 2
##    symbol autocor
##    <chr>    <dbl>
##  1 AAPL     0.925
##  2 BA       0.918
##  3 BAC      0.848
##  4 C        0.881
##  5 CSCO     0.865
##  6 CVS      0.901
##  7 CVX      0.889
##  8 D        0.881
##  9 DIS      0.903
## 10 F        0.896
## # ℹ 20 more rows

By looking 12 months ahead, we are smoothing the dependent variable, but we are also losing some data points. The result (in terms of autocorrelation patterns) is nonetheless satisfactory.

Finally, we turn to feature differences, which is the first option.

data_2 <- data %>% 
  group_by(symbol) %>%                                          # Group by stock
  mutate(D_mkt_cap = mkt_cap / lag(mkt_cap) - 1,              # Relative variation: scale issue
         D_p2b = p2b - lag(p2b),                              # Absolute variation for the rest
         D_vol_1y = vol_1y - lag(vol_1y),
         D_d2e = d2e - lag(d2e),
         D_rev = revenue - lag(revenue)) %>% 
  na.omit() %>%                                               # Take out missing data
  ungroup()                                                   # Ungroup!
data_2 %>% group_by(symbol) %>% 
  summarise_if(is.numeric, acf_2) %>%
  mutate_if(is.numeric, round, digits = 3)                    # Show the results

## # A tibble: 30 × 14
##    symbol close vol_1y mkt_cap   p2b   d2e revenue p_return f_return D_mkt_cap
##    <chr>  <dbl>  <dbl>   <dbl> <dbl> <dbl>   <dbl>    <dbl>    <dbl>     <dbl>
##  1 AAPL   0.979  0.944   0.978 0.973 0.991   0.99     0.063    0.096    -0.029
##  2 BA     0.982  0.959   0.985 0.988 0.986   0.985    0.051    0.054    -0.006
##  3 BAC    0.981  0.977   0.968 0.985 0.985   0.991    0.102    0.106    -0.009
##  4 C      0.987  0.978   0.968 0.977 0.985   0.991    0.055    0.058    -0.002
##  5 CSCO   0.972  0.97    0.902 0.896 0.94    0.99    -0.07    -0.073     0    
##  6 CVS    0.987  0.943   0.987 0.943 0.982   0.987   -0.116   -0.09     -0.005
##  7 CVX    0.975  0.949   0.973 0.954 0.981   0.97    -0.108   -0.107    -0.01 
##  8 D      0.985  0.938   0.988 0.973 0.964   0.966   -0.098   -0.091    -0.009
##  9 DIS    0.986  0.954   0.985 0.97  0.968   0.985   -0.003   -0.003    -0.005
## 10 F      0.928  0.962   0.962 0.956 0.975   0.964   -0.013   -0.013    -0.003
## # ℹ 20 more rows
## # ℹ 4 more variables: D_p2b <dbl>, D_vol_1y <dbl>, D_d2e <dbl>, D_rev <dbl>

The difference operator did wipe out all autocorrelations.

Finally, do not hesitate to work with relative performance with respect to a benchmark. For instance, removing the cross-sectional average return at each point in time will single out the stocks with above-average performance. Moreover, the cross-sectional average of this indicator is not time-varying and subject to macro-economics fluctuations.

7 Exercises

7.1 Labelling

Feature engineering is not all: labelling is also very important. Recycle the code above to change the dependent variable (future returns) into:

a binary variable equal to zero when the future return is negative and to one when it is positive.
a binary variable equal to zero when the future return is below the cross-sectional average and to one when it is above.
a ternary variable equal to zero when the return is between [-r,r], to one if it is above r and to -1 if it is below -r, where r is a user-specified threshold.
a variable that assess the probability (frequency) of positive return over the next year.
a variable that assess the probability (frequency) of outpeforming the market (CS average) over the next year.

7.2 Strategies with categorical data

Create a weighting scheme (function) that is based on categorical output (-1,0,1). The weights are initialised uniformly, then * if the output is 1, then the prior weight is one
* if the output is -1, then the prior weight is zero
* if the output is 0, then the prior weight is the sign of the previous weight
* finally, normalise the prior weights so that they sum to one.

7.3 Cross-sectional persistence

Verify if the persistence of attributes such as market capitalisation or volatility remains after the rescaling of features.

8 BONUS: impact of threshold in the strategy

f_thresh <- function(thresh){
  for(t in 1:length(t_oos)){
    past_data <- data_u %>%                                 # The source is data_u!
      filter(date < t_oos[t]) %>%                         # Past dates: expanding window!
      select(-date)                                       # Take out dates
    curr_data <- data_u %>%                                 # The source is data_u!
      filter(date == t_oos[t]) %>%                        # Current date
      select(-date)                                       # Take out dates
    for(j in 1:nb_port){                                    # The loop on the strategies 
      portf_weights[t,j,] <- weights_multi(past_data,     # Portfolio weights...
                                           curr_data,
                                           j,             # The weights' function is indexed by j
                                           thresh)        # The threshold     
      realised_returns <- data_u %>%                      # Realised returns: take data_u
        filter(date ==  t_oos[t]) %>%                   # Filter the right date
        select(f_return)                                # Take the returns only
      portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns) # Porfolio returns
    }
  }
  return(portf_returns %>% apply(2, mean))   
}

map(seq(0.1,0.9, by = 0.1),
    f_thresh)

## [[1]]
## [1] 0.009024008 0.009245011
## 
## [[2]]
## [1] 0.009085861 0.009245011
## 
## [[3]]
## [1] 0.008910866 0.009245011
## 
## [[4]]
## [1] 0.009602975 0.009245011
## 
## [[5]]
## [1] 0.009707873 0.009245011
## 
## [[6]]
## [1] 0.010056054 0.009245011
## 
## [[7]]
## [1] 0.009161366 0.009245011
## 
## [[8]]
## [1] 0.008962428 0.009245011
## 
## [[9]]
## [1] 0.009414032 0.009245011

Data pre-processing