This is the companion notebook dedicated to data processing. This stage can easily be overlooked but is crucial in machine learning. We briefly discuss categorical variables and logistic regression. We also make a few steps towards ML-driven factor investing in the last section.
\(\rightarrow\) ABOUT THE NOTEBOOK:
First, one of the most important issue: MISSING DATA!
Below, we import packages & the data.
if(!require(naniar)){install.packages("naniar", "visdat", "zoo")}
library(tidyverse) # Loading THE package/environment
library(naniar) # Missing data exploration
library(visdat) # Missing data visualization
library(zoo) # Great package for time-series
load("data_new.RData")
data$p2b[999:1010] <- NA # Removing data to illustrate imputation
Let’s determine which variables are the most affected… (this step takes a bit of time because we use the large dataset)
gg_miss_var(data) # Proportion of missing points per column
vis_dat(data, warn_large_data = F) # Visualization of data types per column
The two most affected columns are indeed revenue and
scope_1 (emissions).
Let’s deal with them using 2 different techniques: times-series
imputation and cross-sectional
imputation.
First, let us have a look closer.
data %>% filter(is.na(p2b)) # Where are the missing p2b ratios?
## # A tibble: 133 × 9
## date symbol close vol_1y mkt_cap p2b d2e revenue scope_1
## <date> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2011-01-31 C 48.2 0.296 137446045212. NA 392. NA 39120
## 2 2011-02-28 C 46.8 0.303 137446045212. NA 392. NA 39120
## 3 2011-03-31 C 44.2 0.257 137446045212. NA 392. NA 39120
## 4 2011-04-30 C 45.9 0.248 137446045212. NA 392. NA 39120
## 5 2011-05-31 C 41.2 0.253 137446045212. NA 392. NA 39120
## 6 2011-06-30 C 41.6 0.245 137446045212. NA 392. NA 39120
## 7 2011-07-31 C 38.3 0.244 137446045212. NA 392. NA 39120
## 8 2011-08-31 C 31.0 0.297 137446045212. NA 392. NA 39120
## 9 2011-09-30 C 25.6 0.330 137446045212. NA 392. NA 39120
## 10 2011-10-31 C 31.6 0.413 137446045212. NA 392. NA 39120
## # ℹ 123 more rows
data %>%
filter(symbol == "C") %>%
ggplot(aes(x = date, y = p2b)) + geom_line() + theme_light()
For time-series imputation, we will use the na.locf function from the zoo package. Let’s see what it does.
seq_na <- sin(1:24) %>% round(3)
seq_na[10:12] <- NA
seq_na
## [1] 0.841 0.909 0.141 -0.757 -0.959 -0.279 0.657 0.989 0.412 NA
## [11] NA NA 0.420 0.991 0.650 -0.288 -0.961 -0.751 0.150 0.913
## [21] 0.837 -0.009 -0.846 -0.906
na.locf(seq_na)
## [1] 0.841 0.909 0.141 -0.757 -0.959 -0.279 0.657 0.989 0.412 0.412
## [11] 0.412 0.412 0.420 0.991 0.650 -0.288 -0.961 -0.751 0.150 0.913
## [21] 0.837 -0.009 -0.846 -0.906
The missing points were replaced by the last preceding value. Handy!
data %>%
group_by(symbol) %>% # Perform the operation for EACH stock
mutate(p2b = na.locf(p2b, na.rm = F)) %>% # Replace p2b by imputed p2b
ungroup() %>% # Ungroup
filter(symbol == "C", # Let's check!
date > "2005-01-01",
date < "2015-01-01") %>%
ggplot(aes(x = date, y = p2b)) + geom_line() + theme_light() +
geom_point(data = data |> filter(symbol == "C", year(date) == 2011), color = "red")
No more holes!
Another way to proceed is to assign a mean or median score, as if the missing value for the company is equal to some “average” or representative value. The median is more robust than the mean because it is less affected by outliers.
data %>%
group_by(date) %>% # Perform the operation for EACH date
mutate(med_p2b = median(p2b, na.rm = T)) %>% # Compute the median p2b
ungroup() %>% # Ungroup
mutate(p2b = if_else(is.na(p2b), med_p2b, p2b)) %>% # If p2b is NA, replace it! If not, don't.
filter(symbol == "C", # Let's check!
date > "2005-01-01",
date < "2015-01-01") %>%
ggplot(aes(x = date, y = p2b)) + geom_line() + theme_light() +
geom_point(data = data |> filter(symbol == "C", year(date) == 2011), color = "red")
Here again, the holes have been filled, but the new values are very different.
We choose to skip the complicated topic of outlier detection to focus directly on feature engineering (we refer to Aggarwal (2016) for an in-depth treatment of outlier analysis). Below, we start with artificially generated data and introduce simple rescaling methods.
On the particular topic of feature engineering, we recommend the book http://www.feat.engineering.
Length <- 100 # Length of the sequence
x <- exp(sin(1:Length)) # Original data
data_simu <- data.frame(index = 1:Length, x = x) # Data framed into dataframe
ggplot(data_simu, aes(x = index, y = x)) + # Plot
geom_col() + theme_light()
Nothing special about this sample. We choose to work with positive values, but this is without much loss of generality. Below, we scale the variable using 3 methods:
standard <- (x - mean(x)) / sd(x) # Standardisation
norm_0_1 <- (x - min(x)) / (max(x)-min(x)) # [0,1] reduction
unif <- ecdf(x)(x) # Uniformisation
data_norm <- data.frame( # Formatting the data
index = 1:Length, # Index of point/instance
standard = standard,
norm_0_1 = norm_0_1,
unif = unif) %>% pivot_longer(-index, names_to = "Type", values_to = "value")
ggplot(data_norm, aes(x = index, y = value, fill = Type)) + # Plot!
geom_bar(stat = "identity", position = "dodge") + theme_light()
ggplot(data_norm, aes(x = index, y = value, fill = Type)) + # Plot!
geom_col() + theme_light() +
facet_grid(Type~.) # This option creates 3 concatenated graphs to ease comparison
By definition, standardisation produces both
positive and negative values.
In terms of distribution of the data, this gives the following
differences:
ggplot(data_norm, aes(x = value, fill = Type)) +
geom_histogram(position = "dodge") + theme_light() +
facet_grid(Type ~.)
With respect to shape, the green and red distributions are close to the original one. It is only the support that changes: the min/max rescaling ensures all values lie in the [0,1] interval. In both cases, the smallest values (on the left) display a spike in distribution. By construction, this spike disappears under the uniformisation: the points are evenly distributed over the unit interval.
In this section, we investigate the impact of these techniques and choices on simple analyses. We compare the impact of rescaling on basic regression results.
We start by importing the smaller dataset and computing returns (our dependent variable).
data <- data %>% arrange(date,symbol) # Just making sure all is in order
tick <- unique(data$symbol) # Set of assets
data <- data %>%
group_by(symbol) %>% # Perform analysis stock-by-stock
mutate(f_return = lead(close) / close - 1) %>% # Adding forward/future returns
select(-scope_1) %>% # Too many missing points for emissions
na.omit() %>% # Take out missing data
ungroup()
summary(data) # Descriptive statistics
## date symbol close vol_1y
## Min. :2000-11-30 Length:7389 Min. : 0.2525 Min. :0.05741
## 1st Qu.:2007-03-31 Class :character 1st Qu.: 29.3100 1st Qu.:0.15773
## Median :2013-03-31 Mode :character Median : 52.5312 Median :0.20503
## Mean :2013-03-09 Mean : 75.0324 Mean :0.23624
## 3rd Qu.:2019-02-28 3rd Qu.: 92.2700 3rd Qu.:0.28197
## Max. :2025-01-31 Max. :960.0200 Max. :1.59204
## mkt_cap p2b d2e revenue
## Min. :5.024e+09 Min. : 0.7486 Min. : 0.00 Min. :4.827e+09
## 1st Qu.:5.598e+10 1st Qu.: 2.1620 1st Qu.: 30.48 1st Qu.:3.759e+10
## Median :1.318e+11 Median : 3.2933 Median : 59.19 Median :6.661e+10
## Mean :1.856e+11 Mean : 13.1589 Mean : 246.84 Mean :1.027e+11
## 3rd Qu.:2.110e+11 3rd Qu.: 5.9361 3rd Qu.: 155.28 3rd Qu.:1.299e+11
## Max. :3.463e+12 Max. :540.0111 Max. :5814.97 Max. :6.810e+11
## f_return
## Min. :-0.578846
## 1st Qu.:-0.032787
## Median : 0.004775
## Mean : 0.007198
## 3rd Qu.: 0.046769
## Max. : 1.273764
First, the descriptive statistics. Looking at the dispersion between the median and both the minimum and maximum gives a first impression on outliers. For instance, the maximum d2e is 242 times the median: this is a big number. The same observation holds for p2b. Sometimes, it is worthwhile to understand if these values are errors or due to very small values in the denominator. For instance, the outlier in p2b comes from very low book values for Boeing in 2017.
data %>%
filter(symbol == "BA") %>%
ggplot(aes(x = date, y = p2b)) + geom_line() + theme_bw()
Then, we rescale the predictors (features). We use two normalising functions. For consistency reasons, we will compare the two methods that scale inputs into the [0,1] interval. One secondary reason for that is that some ML tools work well/better with bounded inputs (neural networks for instance). Below, we create the two dataframes used in the analysis.
NOTE: it is imperative to rescale the data AT EACH DATE.
norm_unif <- function(v){ # This is a function that uniformalises a vector.
v <- v %>% as.matrix()
return(ecdf(v)(v))
}
norm_0_1 <- function(v){ # This is a function that uniformalises a vector.
return((v-min(v))/(max(v)-min(v)))
}
data_0 <- data %>% # Data with normalised values
group_by(date) %>% # Normalisation takes place for each given date
mutate_if(is.numeric, norm_0_1) %>% # Apply normalisation on numerical columns
ungroup() %>% # Ungroup
select(-symbol, -date, -close) # Take out superfluous columns
data_0$f_return <- data$f_return # We recover the original returns
data_u <- data %>% # Data with normalised values
group_by(date) %>% # Normalisation takes place for each given date
mutate_if(is.numeric, norm_unif) %>% # Apply normalisation on numerical columns
ungroup() %>% # Ungroup
select(-symbol, -date, -close) # Take out superfluous columns
data_u$f_return <- data$f_return # We recover the original returns
Below, we check the distribution of the variables.
data_0 %>%
select(-f_return) %>% # Remove returns
pivot_longer(cols = everything(),
names_to = "Attribute", values_to = "Value") %>% # Convert to 'compact' column mode
ggplot(aes(x = Value, fill = Attribute)) +
geom_histogram() + theme_light() + # Plot histograms
facet_grid(Attribute~., scales = "free") # Stack the histograms
data_u %>%
select(-f_return) %>% # Remove returns
pivot_longer(cols = everything(),
names_to = "Attribute", values_to = "Value") %>% # Convert to 'compact' mode
ggplot(aes(x = Value, fill = Attribute)) +
geom_histogram(bins = 10) + theme_light() + # Plot histograms
facet_grid(Attribute~.) # Stack the histograms
Apart for a minor issue with d2e (due to outliers?), the scaling seems ok. But clearly, the distribution of all variables are strongly impacted by the scaling choice.
We are now ready to perform the regressions. We start with the scaled data.
fit <- lm(f_return ~ ., data = data_0) # Regression on first set
summary(fit) # Regression output
##
## Call:
## lm(formula = f_return ~ ., data = data_0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59505 -0.03959 -0.00165 0.03995 1.25733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0066686 0.0019433 3.432 0.000603 ***
## vol_1y 0.0072119 0.0035422 2.036 0.041785 *
## mkt_cap 0.0007097 0.0042091 0.169 0.866112
## p2b -0.0073691 0.0036211 -2.035 0.041883 *
## d2e 0.0052430 0.0044268 1.184 0.236303
## revenue -0.0055841 0.0043143 -1.294 0.195587
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07516 on 7383 degrees of freedom
## Multiple R-squared: 0.001487, Adjusted R-squared: 0.0008107
## F-statistic: 2.199 on 5 and 7383 DF, p-value: 0.05161
If we set the significance threshold at 5%, then some variables stand out: Prof_Marg (strongly), mkt_cap (weakly) and Vol_1M (weakly). With this database, future returns are negatively related to firm size. This is true in all analyses that we can/will do (with trees, notably).
We now turn to the ‘uniformised’ data.
fit <- lm(f_return ~ ., data = data_u) # Regression on second set
summary(fit) # Regression output
##
## Call:
## lm(formula = f_return ~ ., data = data_u)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59001 -0.03975 -0.00182 0.04003 1.26167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0077726 0.0037887 2.052 0.0403 *
## vol_1y 0.0063118 0.0031379 2.011 0.0443 *
## mkt_cap -0.0088907 0.0042861 -2.074 0.0381 *
## p2b 0.0058436 0.0036268 1.611 0.1072
## d2e -0.0047801 0.0036521 -1.309 0.1906
## revenue 0.0004112 0.0038159 0.108 0.9142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07514 on 7383 degrees of freedom
## Multiple R-squared: 0.001867, Adjusted R-squared: 0.001191
## F-statistic: 2.762 on 5 and 7383 DF, p-value: 0.01695
In this case, mkt_cap is still strongly significant, buth other variables’ impacts have faded, while d2e and Prof_Marg emerge as weak drivers. This underlines the importance of feature engineering. For nonlinear models, we refer to Galili and Meilijson (2016 - arxiv) for an analysis of the impact of feature transformation on tree splits.
To illustrate the difference between the two rescaling methods, we build a simple dataset, comprising 3 firms and 3 dates.
firm <- c(rep(1,3), rep(2,3), rep(3,3)) # Firms (3 lines for each)
date <- rep(c(1,2,3),3) # dates
cap <- c(10, 50, 100, # Market capitalisation
15, 10, 15,
200, 120, 80)
return <- c(0.06, 0.01, -0.06, # Return values
-0.03, 0.00, 0.02,
-0.04, -0.02,0.00)
data_toy <- data.frame(firm, date, cap, return) # Aggregation of data
data_toy <- data_toy %>% # Transformation of data
group_by(date) %>%
mutate(cap_0 = norm_0_1(cap), cap_u = norm_unif(cap))
data_toy # Display the data
## # A tibble: 9 × 6
## # Groups: date [3]
## firm date cap return cap_0 cap_u
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 10 0.06 0 0.333
## 2 1 2 50 0.01 0.364 0.667
## 3 1 3 100 -0.06 1 1
## 4 2 1 15 -0.03 0.0263 0.667
## 5 2 2 10 0 0 0.333
## 6 2 3 15 0.02 0 0.333
## 7 3 1 200 -0.04 1 1
## 8 3 2 120 -0.02 1 1
## 9 3 3 80 0 0.765 0.667
Let’s briefly comment on this synthetic data. We assume that dates are ordered chronologically and far away: each date stands for a year or the beginning of a decade, but the (forward) returns are computed on a monthly basis. The first firm is hugely successful and multiplies its cap ten times over the periods. The second firms remains stable cap-wise, while the third one plummets. If we look at ‘local’ future returns, they are strongly negatively related to size for the first and third firms. For the second one, there is no clear pattern.
date-by-date, the analysis is fairly similar, though slightly nuanced.
While the relationship is not always perfectly monotonous, there seems to be a link between size and return and typically, investing in the smallest firm would be a very good strategy with this sample.
Now let’s look at the regressions.
lm(return ~ cap_0, data = data_toy) %>% summary() # First regression (min-max rescaling)
##
## Call:
## lm(formula = return ~ cap_0, data = data_toy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.044970 -0.016278 0.003722 0.013425 0.043722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01628 0.01374 1.185 0.2746
## cap_0 -0.04970 0.02137 -2.326 0.0529 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02867 on 7 degrees of freedom
## Multiple R-squared: 0.4359, Adjusted R-squared: 0.3553
## F-statistic: 5.409 on 1 and 7 DF, p-value: 0.05294
lm(return ~ cap_u, data = data_toy) %>% summary() # Second regression (uniformised feature)
##
## Call:
## lm(formula = return ~ cap_u, data = data_toy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.02667 -0.02000 0.00000 0.01667 0.03333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.06000 0.01981 3.028 0.01916 *
## cap_u -0.10000 0.02752 -3.634 0.00835 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02247 on 7 degrees of freedom
## Multiple R-squared: 0.6536, Adjusted R-squared: 0.6041
## F-statistic: 13.21 on 1 and 7 DF, p-value: 0.008351
In terms of p-value (last column), the first estimation for the cap coefficient is above 5% while the second is below 1%. One possible explanation for this discrepancy is the standard deviation of the variables. The deviations are equal to 0.47 and 0.29 for cap_0 and cap_u, respectively. Values like market capitalisations can have very large ranges and are thus subject to substantial deviations (even after scaling). Working with uniformised variables reduces dispersion and can help solve this problem.
Note that this is a double-edged sword: while it can help avoid false negatives, it can also lead to false positives.
In supervised learning, regression
tasks consist of explaining or predicting real numbers
while classification exercises rely on
cateogorical data, which are often harder to
handle.
Below, we talk interchangeably of categories or
classes.
First, we start from the scaled sample data_u, duplicate it and create a categorical variable: FC_Return (forward categorical). We use a simple function for that purpose. It is defined as \[\tilde{y}_{t,i}=\left\{ \begin{array}{rll} -1 & \text{ if } & y_{t,i} < -r \\ 0 & \text{ if } & y_{t,i} \in [-r,r] \\ +1 & \text{ if } & y_{t,i} > r \\ \end{array} \right.\] for some threshold value \(r>0\). Intuitively, if the dependent variable y is a measure of profitability, a +1 value signals a buying intention, a -1 value a selling intention, while a zero value recommends the status quo.
thresh_coding <- function(x, r){ # The function that creates the categorical data
z <- x # Duplicate original data
z[x < (-r)] <- (-1) # Sell!
z[x > r] <- 1 # Buy!
z[x >= (-r) & x <= r] <- 0 # Hold!
return(z)
}
r_thresh <- 0.02 # Decision threshold
data_c <- round(data_u,3) # Duplicate data and round numbers (3 decimals is enough)
data_c <- data_c %>% mutate(fc_return = thresh_coding(f_return, r_thresh))
head(data_c) # Let's check!
## # A tibble: 6 × 7
## vol_1y mkt_cap p2b d2e revenue f_return fc_return
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.958 0.125 0.25 0.208 0.125 -0.098 -1
## 2 0.208 0.333 0.292 0.625 0.75 -0.044 -1
## 3 0.5 0.25 0.375 0.292 0.25 0.054 1
## 4 0.125 0.375 0.125 0.583 0.792 0.031 1
## 5 0.25 0.083 0.083 0.875 0.042 0.117 1
## 6 0.625 0.5 0.333 0.458 0.375 0 0
We confirm the coding of the categorical variable on a small subset of the sample.
data_c$fc_return <- data_c$fc_return %>% as.factor() # Turn it into real category!
summary(data_c$fc_return)
## -1 0 1
## 2308 2172 2909
The last line shows the relative importance of the categories, which greatly depends on r_thresh. Our choice of r_thresh gives a rather balanced distribution for the categories. Smaller values of r_thresh would decrease the number of zeros and increase the number of \(\pm 1\).
library(fastDummies) # One package for one-hot encoding
one_hot <- dummy_cols(data_c$fc_return) # The encoding
colnames(one_hot) <- c("Original", "Sell", "Hold", "Buy")
data_c <- cbind(data_c, one_hot) # Combine the two
data_c %>% # Let's see!
select(-vol_1y, -mkt_cap) %>% # Remove a few variables first..
head()
## p2b d2e revenue f_return fc_return Original Sell Hold Buy
## 1 0.250 0.208 0.125 -0.098 -1 -1 1 0 0
## 2 0.292 0.625 0.750 -0.044 -1 -1 1 0 0
## 3 0.375 0.292 0.250 0.054 1 1 0 0 1
## 4 0.125 0.583 0.792 0.031 1 1 0 0 1
## 5 0.083 0.875 0.042 0.117 1 1 0 0 1
## 6 0.333 0.458 0.375 0.000 0 0 0 1 0
The number of additional columns is logically equal to the number of
categories.
One-hot encoding will mostly be used for neural
networks (decision trees handle categorical variables quite well).
When there are only two classes (e.g., buy vs sell), it is possible to extend the concept of regression analysis to categorical data. This classical tool is referred to as logistic regression because it is linked to the logistic function \((1+e^{-x})^{-1}\). This function takes values in [0,1] and such output forms are convenient because they can be interpreted as the probability of one occurrence of belonging to one class. If we use the simple model \[\frac{p}{1-p}=e^{\beta_0+\sum_{i=1}^N\beta_ix_i},\]
where \(p\) is the probability to buy, then \(p=(1+e^{-\beta_0+\sum_{i=1}^N\beta_ix_i})^{-1}\). The \(\beta_j\) are usually estimated via maximum likelihood (assuming iid Bernoulli samples).
Below, we show how this kind of model can be very simply implemented in R.
data_c$fb_Return <- (data_c$f_return>0) # Binary variable: 1 (true) if positive return, 0 if not
fit <- glm(fb_Return ~ mkt_cap + p2b + vol_1y + d2e + revenue, # Generalised linear model
family = "binomial", # Binary variable
data = data_c) # Data source
summary(fit) # Output of regression
##
## Call:
## glm(formula = fb_Return ~ mkt_cap + p2b + vol_1y + d2e + revenue,
## family = "binomial", data = data_c)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.27768 0.10108 2.747 0.00601 **
## mkt_cap -0.13271 0.11431 -1.161 0.24568
## p2b 0.10731 0.09672 1.109 0.26725
## vol_1y -0.01910 0.08370 -0.228 0.81946
## d2e -0.12762 0.09738 -1.310 0.19004
## revenue -0.15802 0.10174 -1.553 0.12039
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10223 on 7388 degrees of freedom
## Residual deviance: 10212 on 7383 degrees of freedom
## AIC: 10224
##
## Number of Fisher Scoring iterations: 3
Each unit of increase in mkt_cap decreases the probability of positive return by a factor \(e^{-0.132}=0.876\). Finally, we show the syntax for prediction.
predict(fit, data_c[1:3,]) # Predicts the log-odds
## 1 2 3
## 0.22332137 0.06257379 0.19842360
predict(fit, data_c[1:3,], type = "response") # Gives the probability
## 1 2 3
## 0.5555995 0.5156383 0.5494438
1/(1+exp(-predict(fit, data_c[1:3,]))) # Check via logistic function
## 1 2 3
## 0.5555995 0.5156383 0.5494438
Several output formats are possible. The log-odds are \(\log\left(\frac{p}{1-p}\right)\). The probability of buying is obtained via the logistic function.
Refinements in labelling include:
(subtitle: ANOTHER STEP TOWARDS ML-BASED FACTOR INVESTING)
The core idea of the course is to translate data about firms into investment decisions. We seek to infer correlation (‘causal’ would be too strong) patterns between corporations’ characteristics and their subsequent returns. The base mode is \[\mathbf{y}=f(\mathbf{X})+\mathbf{\epsilon},\]
where y is some proxy for future performance while X embeds the current values of firm attributes. More precisely, for one given date, t, we will try to estimate \(f_t\) \[y_{s+1,j}=f_t\left(x_{s,j}^{(1)}, \dots, x_{s,j}^{(K)}\right)+\mathbf{\epsilon}_{s,j}, \quad s= t-1, \ t-2,\dots\]
where j is the index of the firm and s the index of time. In the above equation, there are K features or firm attributes. To make things simple, we will consider that y is the return of the asset. This (future) return will thus be a function of the current values of stocks’ characteristics.
If our sample comprises only one date (s = t-1), then we will try to build a model that maps the features of t-1 to the returns computed from t-1 to t (i.e., after the feature are observed). This stage where we determine the optimal (in some sense) model \(f_t\) according to past observations is called training: the parameters of the model are adjusted so that the predicted values are as close as possible to those that actually occurred.
Using only one date will probably be insufficient and the model will lack robustness. Hence, it will be preferable to train the model on many dates so that the model learns from different states of the market.
Before we move to a simple application of this principle using a linear \(f_t\), we illustrate the concept of learning on increasingly large samples.
Below, we use the gganimate package to demonstrate the simple idea behind training.
if(!require(gganimate)){install.packages("gganimate")}
library(gganimate)
Iteration <- 16 # Number of steps in the animation
reg_data <- data.frame( # Create dataframe with:
x = sin(1:Iteration), # x-axis (independent/explanatory variable)
iteration = Iteration, # Index of the step
status = c(rep("Past",Iteration-1),"Current") # Which point is new versus old
)
reg_data <- reg_data %>% mutate(y = x + rnorm(Iteration)/3) # Linear Data Generator
tmp <- reg_data # Copy of the data with fewer points to pass to animation
for (j in 1:(Iteration-2)) { # Index of animation step
tmp <- tmp[-nrow(tmp),] # Take out one point
tmp$iteration <- Iteration - j # Change animation step number
tmp$status[nrow(tmp)] <- "Current" # Change label of current point
reg_data <- rbind(reg_data, tmp) # Aggregate
}
p <- ggplot(reg_data) + theme_light() +
geom_point(aes(x = x, y = y, color = status), size = 4) + # Points
stat_smooth(aes(x = x, y = y), color = "black", method = "lm", se = FALSE) + # Linear model
transition_manual(iteration) # Animation variable
#anim <- animate(p, duration = 20) # 20 second animation
animate(p, duration = 5, fps = 20, renderer = gifski_renderer())
anim_save(filename = "reg.gif") # Saving for export/include
Each time a new (red) point is added to the training data, the linear model is adjusted: the black line comes slightly closer to the new point. As more data accumulates, the impact of each new point is less and less pronounced because the parameter values converge. If the data generating process exhibits some stationarity properties (samples follow a reccurrent pattern) and if the model type is well chosen, learning will be effective.
In this section, we will consider a linear model: \[f_t\left(x_{s,j}^{(1)}, \dots, x_{s,j}^{(K)}\right)=\alpha_t+\sum_{k=1}^K\beta_{t}^{(k)}x_{s,j}^{(k)},\]
where the parameters alpha and beta are estimated via simple Ordinary Least Squares. In the above expression, we do not refer to the error term, but do not forget that it exists!
First, we have a look on the full sample (using data_u).
lm(f_return ~ ., data = data_u) %>% summary() # Regression output
##
## Call:
## lm(formula = f_return ~ ., data = data_u)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.59001 -0.03975 -0.00182 0.04003 1.26167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0077726 0.0037887 2.052 0.0403 *
## vol_1y 0.0063118 0.0031379 2.011 0.0443 *
## mkt_cap -0.0088907 0.0042861 -2.074 0.0381 *
## p2b 0.0058436 0.0036268 1.611 0.1072
## d2e -0.0047801 0.0036521 -1.309 0.1906
## revenue 0.0004112 0.0038159 0.108 0.9142
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07514 on 7383 degrees of freedom
## Multiple R-squared: 0.001867, Adjusted R-squared: 0.001191
## F-statistic: 2.762 on 5 and 7383 DF, p-value: 0.01695
As a confirmation of our previous exercise, mkt_cap stands out, meaning that the addition of other predictors does not alter its impact on future returns.
We now turn to portfolio choice.
First, we create the weights function. The strategy forecasts future
returns and invests equally in the half of the assets that have
forecasts above the cross-sectional median return. This is in fact a
particular case of the penalised predictive regression seen in the
previous (LASSO) session, except that:
We first rebuild the original data with data imputation.
load("data_new.RData") # Loading the data
data <- data %>% arrange(date, symbol) # Just making sure all is in order
data <- data %>%
select(-scope_1) %>% # Remove ESG: too many missing points
group_by(symbol) %>% # Group asset by asset
mutate(p_return = close / lag(close) - 1) %>% # Adding past returns
mutate(f_return = lead(p_return)) %>% # Adding forward returns: could to better
ungroup() # Ungroup
impute_ <- function(v){
v[is.na(v)] <- median(v, na.rm = T)
v
}
data <- data |>
group_by(date) |>
mutate(across(where(is.numeric), impute_)) |>
ungroup() |>
na.omit()
data_u <- data %>% # Data with normalised values
group_by(date) %>% # Normalisation takes place for each given date
mutate_if(where(is.numeric), norm_unif) %>% # Apply normalisation on numerical columns
ungroup() %>% # Ungroup
select(-symbol, -date, -close) # Take out superfluous columns
## `mutate_if()` ignored the following grouping variables:
## • Column `date`
data_u$f_return <- data$f_return # We recover the original returns
weights_multi <- function(past_data, curr_data, j, thresh){
if(j == 1){ # j = 1 => regression-based
fit <- lm(f_return ~ ., data = past_data) # Training on past data
pred <- predict(fit, curr_data) # Prediction
w <- pred > quantile(pred, thresh) # Investment decision: TRUE = 1
return(w / sum(w))
}
if(j == 2){ # j = 2 => EW
N <- nrow(curr_data)
return(rep(1/N,N))
}
}
We then initiate the backtesting variables.
t_all <- unique(data$date) # Set of dates
sep_date <- as.Date("2010-01-01") # This date separates in-sample vs out-of-sample
t_oos <- t_all[t_all > sep_date] # Out-of-sample dates (i.e., testing set)
nb_port <- 2 # Nb of portfolios
Tt <- length(t_oos) # Nb dates
portf_weights <- array(0, dim = c(Tt, nb_port, length(tick))) # Store weights
portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Store returns
Finally, we are ready for the backtesting loop.
thresh <- 0.8
data_u <- data.frame(data_u, date = data$date) # We recover the dates from original file
for(t in 1:length(t_oos)){
past_data <- data_u %>% # The source is data_u!
filter(date < t_oos[t]) %>% # Past dates: expanding window!
select(-date) # Take out dates
curr_data <- data_u %>% # The source is data_u!
filter(date == t_oos[t]) %>% # Current date
select(-date) # Take out dates
for(j in 1:nb_port){ # The loop on the strategies
portf_weights[t,j,] <- weights_multi(past_data, # Portfolio weights...
curr_data,
j, # The weights' function is indexed by j
thresh) # The threshold
realised_returns <- data_u %>% # Realised returns: take data_u
filter(date == t_oos[t]) %>% # Filter the right date
select(f_return) # Take the returns only
portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns) # Porfolio returns
}
}
12*portf_returns %>% apply(2, mean) # Average return
## [1] 0.1075491 0.1109401
sqrt(12)*portf_returns %>% apply(2, sd) # Volatility
## [1] 0.1861675 0.1360854
sqrt(12)*portf_returns %>% apply(2, mean) / portf_returns %>% apply(2, sd)
## [1] 0.5777008 0.8152241
The regression-based allocation is slightly more profitable, but also more volatile: in the end, both Sharpe ratios are close. In order to statistically assess whether a strategy is significantly preferable, it is possible to perform a t-test on the difference of returns.
t_test <- t.test(portf_returns[,1]-portf_returns[,2])
t_test
##
## One Sample t-test
##
## data: portf_returns[, 1] - portf_returns[, 2]
## t = -0.1341, df = 180, p-value = 0.8935
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.004440608 0.003875441
## sample estimates:
## mean of x
## -0.0002825835
The difference is not sufficiently pronounced. A t-stat of -0.134 is not negligible, but not impressive either.
Inputs in financial backtesting can all be considered as time-series. There are many ways these tools can be characterised, but we underline one important property: persistence. One simply way to assess persistence is to compute auto-correlations.
acf_2 <- function(v){ # We build a function that keeps the first order correlation
return(acf(v, lag.max = 1, plot = FALSE)$acf[2])
}
data %>%
group_by(symbol) %>%
summarise_if(is.numeric, acf_2) # We apply this function to features
## # A tibble: 30 × 9
## symbol close vol_1y mkt_cap p2b d2e revenue p_return f_return
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.979 0.956 0.978 0.973 0.991 0.991 0.0445 0.0634
## 2 BA 0.983 0.959 0.984 0.988 0.986 0.985 0.0538 0.0515
## 3 BAC 0.980 0.977 0.969 0.985 0.985 0.991 0.108 0.102
## 4 C 0.989 0.978 0.968 0.977 0.985 0.991 0.0554 0.0548
## 5 CSCO 0.972 0.970 0.909 0.906 0.940 0.990 -0.0661 -0.0691
## 6 CVS 0.988 0.941 0.987 0.946 0.982 0.987 -0.116 -0.0900
## 7 CVX 0.975 0.949 0.974 0.955 0.982 0.970 -0.109 -0.107
## 8 D 0.986 0.942 0.988 0.973 0.964 0.967 -0.108 -0.0971
## 9 DIS 0.986 0.950 0.985 0.970 0.968 0.985 -0.00278 -0.00258
## 10 F 0.940 0.962 0.962 0.956 0.975 0.964 -0.0120 -0.0130
## # ℹ 20 more rows
Almost all attributes are highly autocorrelated at the monthly frequency. This is obviously true for mkt_cap and Vol_1M which are indeed expected to show some memory over time.
The main issue is that the last column, which is the predicted variable, is not persistent at all! Hence, this is a major problem because it is unlikely that any good is going to come out of a model - even a very sophisticated one. If the returns are oscillating very quickly while the predictors have very smooth trajectories, then there is a framing issue. What will probably happen is that the model/algorithm will find spurious relationships/arbitrages between variables that will fail out-of-sample.
Two solutions (at least) exist:
Below, we provide examples of implementation for each alternative. We start with the second one.
data_1 <- data %>%
group_by(symbol) %>% # Grouping
mutate(f_return = lead(close) / close - 1, # Adding 1M future returns
fl_return = lead(close, 12) / close - 1) %>% # Adding 12M (Long-term) future returns
na.omit() %>% # Take out missing data
ungroup() # Ungrouping
data_1 %>% group_by(symbol) %>%
summarise(autocor = acf_2(fl_return)) # Showing the results
## # A tibble: 30 × 2
## symbol autocor
## <chr> <dbl>
## 1 AAPL 0.925
## 2 BA 0.918
## 3 BAC 0.848
## 4 C 0.881
## 5 CSCO 0.865
## 6 CVS 0.901
## 7 CVX 0.889
## 8 D 0.881
## 9 DIS 0.903
## 10 F 0.896
## # ℹ 20 more rows
By looking 12 months ahead, we are smoothing the dependent variable, but we are also losing some data points. The result (in terms of autocorrelation patterns) is nonetheless satisfactory.
Finally, we turn to feature differences, which is the first option.
data_2 <- data %>%
group_by(symbol) %>% # Group by stock
mutate(D_mkt_cap = mkt_cap / lag(mkt_cap) - 1, # Relative variation: scale issue
D_p2b = p2b - lag(p2b), # Absolute variation for the rest
D_vol_1y = vol_1y - lag(vol_1y),
D_d2e = d2e - lag(d2e),
D_rev = revenue - lag(revenue)) %>%
na.omit() %>% # Take out missing data
ungroup() # Ungroup!
data_2 %>% group_by(symbol) %>%
summarise_if(is.numeric, acf_2) %>%
mutate_if(is.numeric, round, digits = 3) # Show the results
## # A tibble: 30 × 14
## symbol close vol_1y mkt_cap p2b d2e revenue p_return f_return D_mkt_cap
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AAPL 0.979 0.944 0.978 0.973 0.991 0.99 0.063 0.096 -0.029
## 2 BA 0.982 0.959 0.985 0.988 0.986 0.985 0.051 0.054 -0.006
## 3 BAC 0.981 0.977 0.968 0.985 0.985 0.991 0.102 0.106 -0.009
## 4 C 0.987 0.978 0.968 0.977 0.985 0.991 0.055 0.058 -0.002
## 5 CSCO 0.972 0.97 0.902 0.896 0.94 0.99 -0.07 -0.073 0
## 6 CVS 0.987 0.943 0.987 0.943 0.982 0.987 -0.116 -0.09 -0.005
## 7 CVX 0.975 0.949 0.973 0.954 0.981 0.97 -0.108 -0.107 -0.01
## 8 D 0.985 0.938 0.988 0.973 0.964 0.966 -0.098 -0.091 -0.009
## 9 DIS 0.986 0.954 0.985 0.97 0.968 0.985 -0.003 -0.003 -0.005
## 10 F 0.928 0.962 0.962 0.956 0.975 0.964 -0.013 -0.013 -0.003
## # ℹ 20 more rows
## # ℹ 4 more variables: D_p2b <dbl>, D_vol_1y <dbl>, D_d2e <dbl>, D_rev <dbl>
The difference operator did wipe out all autocorrelations.
Finally, do not hesitate to work with relative performance with respect to a benchmark. For instance, removing the cross-sectional average return at each point in time will single out the stocks with above-average performance. Moreover, the cross-sectional average of this indicator is not time-varying and subject to macro-economics fluctuations.
Feature engineering is not all: labelling is also very important. Recycle the code above to change the dependent variable (future returns) into:
Create a weighting scheme (function) that is based on categorical
output (-1,0,1). The weights are initialised uniformly, then * if the
output is 1, then the prior weight is one
* if the output is -1, then the prior weight is zero
* if the output is 0, then the prior weight is the sign of the previous
weight
* finally, normalise the prior weights so that they sum to one.
Verify if the persistence of attributes such as market capitalisation or volatility remains after the rescaling of features.
f_thresh <- function(thresh){
for(t in 1:length(t_oos)){
past_data <- data_u %>% # The source is data_u!
filter(date < t_oos[t]) %>% # Past dates: expanding window!
select(-date) # Take out dates
curr_data <- data_u %>% # The source is data_u!
filter(date == t_oos[t]) %>% # Current date
select(-date) # Take out dates
for(j in 1:nb_port){ # The loop on the strategies
portf_weights[t,j,] <- weights_multi(past_data, # Portfolio weights...
curr_data,
j, # The weights' function is indexed by j
thresh) # The threshold
realised_returns <- data_u %>% # Realised returns: take data_u
filter(date == t_oos[t]) %>% # Filter the right date
select(f_return) # Take the returns only
portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns) # Porfolio returns
}
}
return(portf_returns %>% apply(2, mean))
}
map(seq(0.1,0.9, by = 0.1),
f_thresh)
## [[1]]
## [1] 0.009024008 0.009245011
##
## [[2]]
## [1] 0.009085861 0.009245011
##
## [[3]]
## [1] 0.008910866 0.009245011
##
## [[4]]
## [1] 0.009602975 0.009245011
##
## [[5]]
## [1] 0.009707873 0.009245011
##
## [[6]]
## [1] 0.010056054 0.009245011
##
## [[7]]
## [1] 0.009161366 0.009245011
##
## [[8]]
## [1] 0.008962428 0.009245011
##
## [[9]]
## [1] 0.009414032 0.009245011