This is the companion notebook to the course on empirical
foundations of machine learning in asset
management.
The topic is strongly tilted towards asset pricing
anomalies and serves as the economic justification of ML-based
factor portfolios. Indeed, since the 1990s, a large body of empirical
work has shown that firms’ characteristics have an impact on future
returns. This impact is possibly time-varying.
The first chunks of the code are inspired from the book
Reproducible Finance by Jonathan Regenstein (http://www.reproduciblefinance.com).
The code is commented and the results are discussed as much as
possible.
\(\rightarrow\) ABOUT THE NOTEBOOK:
First, we collect the data for our baseline assets. We work with 6 large American companies which are associated to their ticker symbol: Apple (AAPL, technologies), Boeing (BA, industry), Citigroup (C, banking), Pfizer (PFE, healthcare), Walmart (WMT, retail) and Exxon (XOM, oil/commodities). The six firms have been chosen in very diverse industry sectors.
NOTE: The analysis below should be carried out on a much larger number of assets. We restrict the width of the analysis for simplicity and for computational purposes.
if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(quantmod)){install.packages("quantmod")}
library(tidyverse) # Package for data manipulation
library(quantmod) # Package for data extraction
tickers <- c("AAPL", "BA", "C", "PFE", "WMT", "XOM")
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS", "AMZN", "NVDA")
min_date <- "2000-01-01" # Starting date
max_date <- "2025-09-30" # Ending date
prices <- getSymbols(tickers, src = 'yahoo', # The data comes from Yahoo Finance
from = min_date, # Start date
to = max_date, # End date
auto.assign = TRUE,
warnings = FALSE) %>%
map(~Ad(get(.))) |> # Retrieving the data
reduce(merge) |> # Merge in one dataframe
`colnames<-`(tickers) # Set the column names
head(prices[,1:5],7) # Have a look at the result!
## AAPL BA C PFE WMT
## 2000-01-03 0.8400941 25.94028 209.0070 11.85604 14.23962
## 2000-01-04 0.7692658 25.89995 196.1906 11.41434 13.70681
## 2000-01-05 0.7805229 27.51363 204.0776 11.60033 13.42708
## 2000-01-06 0.7129774 27.79603 213.9364 12.01877 13.57360
## 2000-01-07 0.7467505 28.60290 212.9505 12.83242 14.59927
## 2000-01-10 0.7336167 28.19946 212.2111 12.80917 14.33288
## 2000-01-11 0.6960915 27.67502 209.4998 12.64644 14.11974
rm(list = tickers) # Remove unnecessary variables
The data is stored in the ‘prices’ variable. Let’s have a look at the series, to make sure they make sense.
prices %>% # OLD PIPE here
data.frame(Date = index(.)) |>
remove_rownames() |>
pivot_longer(-Date, names_to = "Asset",
values_to = "Price") |>
ggplot(aes(x = Date, y = Price, color = Asset)) + geom_line() +
facet_wrap(vars(Asset), scales = "free") + theme_light()
This seems alright, though Citigroup suffered massively in the subprime crisis.
Before we move towards the factors, we transform (via mutate) the daily prices into monthly returns. Returns are indeed the raw material in portfolio management. In the code, we resort to a lot of piping using the pipe operator |>.
returns <- prices |> # Start from prices
to.monthly(indexAt = "lastof", OHLC = FALSE) %>% # Convert to monthly - OLD PIPE!
data.frame(Date = index(.)) |> # Convert the index to a date
remove_rownames() |> # Remove index => converted into row names
pivot_longer(-Date, names_to = "Asset",
values_to = "Price") |> # Put the data in 'tidy' format
group_by(Asset) |> # Group the rows by asset to compute returns
mutate(returns = Price / dplyr::lag(Price) - 1) |> # Compute the returns
select(-Price) |> # Remove price column
pivot_wider(names_from = "Asset",
values_from = "returns") |> # Put the data back in matrix form
select(all_of(c("Date", tickers))) |> # Keep only the relevant columns
na.omit() # Discard rows with missing/NA data
tail(returns[,1:5]) # Again, show the result!
Date | AAPL | BA | C | PFE |
---|---|---|---|---|
2025-04-30 | -0.0433530 | 0.0744063 | -0.0367658 | -0.0367008 |
2025-05-31 | -0.0535841 | 0.1314124 | 0.1102998 | -0.0193313 |
2025-06-30 | 0.0215085 | 0.0106598 | 0.1301115 | 0.0319285 |
2025-07-31 | 0.0116977 | 0.0587505 | 0.1007988 | -0.0226124 |
2025-08-31 | 0.1196389 | 0.0578795 | 0.0374080 | 0.0631172 |
2025-09-30 | 0.0960196 | -0.0749957 | 0.0682407 | -0.0367528 |
We now turn to the importation of factor data. It’s extracted from the amazing database maintained by Kenneth French: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html
temp <- tempfile()
link <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip"
download.file(link, temp, quiet = TRUE) # Download!
nb_factors <- 5 # Number of factors
FF_factors <- read_csv(unz(temp,
"F-F_Research_Data_5_Factors_2x3.csv"), # The file name
skip = 3) |> # Check the nb of lines skip (could be 2!)
rename(Date = `...1`, MKT_RF = `Mkt-RF`) |> # Change name of the first column
mutate_at(vars(-Date), as.numeric) |> # Convert to number
mutate(Date = ymd(parse_date_time(Date, "%Y%m"))) |> # Date in right format
mutate(Date = rollback(Date + months(1))) # End of month date
FF_factors <- FF_factors |> mutate(MKT_RF = MKT_RF / 100, # Scaling returns
SMB = SMB / 100,
HML = HML / 100,
RMW = RMW / 100,
CMA = CMA / 100,
RF = RF / 100) |> # Risk-free rate
filter(Date >= min_date) # Finally, we keep the recent points
tail(FF_factors) # And we have a look!
Date | MKT_RF | SMB | HML | RMW | CMA | RF |
---|---|---|---|---|---|---|
2025-03-31 | -0.0639 | -0.0149 | 0.0290 | 0.0211 | -0.0047 | 0.0034 |
2025-04-30 | -0.0084 | -0.0186 | -0.0340 | -0.0285 | -0.0267 | 0.0035 |
2025-05-31 | 0.0606 | -0.0072 | -0.0288 | 0.0126 | 0.0251 | 0.0038 |
2025-06-30 | 0.0486 | -0.0002 | -0.0160 | -0.0319 | 0.0145 | 0.0034 |
2025-07-31 | 0.0198 | -0.0015 | -0.0127 | -0.0029 | -0.0207 | 0.0034 |
2025-08-31 | 0.0185 | 0.0488 | 0.0441 | -0.0069 | 0.0207 | 0.0038 |
Finally, we merge the two sources of data into one single variable.
data <- left_join(returns, FF_factors, by = "Date")
# left_join() is an incredible function that merges two sets through one common column
tail(data[,1:6]) # See the outcome!
Date | AAPL | BA | C | PFE | WMT |
---|---|---|---|---|---|
2025-04-30 | -0.0433530 | 0.0744063 | -0.0367658 | -0.0367008 | 0.1077570 |
2025-05-31 | -0.0535841 | 0.1314124 | 0.1102998 | -0.0193313 | 0.0175701 |
2025-06-30 | 0.0215085 | 0.0106598 | 0.1301115 | 0.0319285 | -0.0095219 |
2025-07-31 | 0.0116977 | 0.0587505 | 0.1007988 | -0.0226124 | 0.0020455 |
2025-08-31 | 0.1196389 | 0.0578795 | 0.0374080 | 0.0631172 | -0.0078944 |
2025-09-30 | 0.0960196 | -0.0749957 | 0.0682407 | -0.0367528 | 0.0628995 |
With our full dataset at hand, we are ready to begin the analysis.
The Fama-Macbeth regressions can be used to assess the premium associated to factors. They are performed in two steps.
A natural way to proceed would be to resort to a for() loop to perform the regressions iteratively. We take another route and instead make us of the ‘apply’ family of functions. We use a procedure that creates all models via their character syntax in R. And finally, we gather the results in the ‘betas’ variable.
models <- lapply(paste(tickers, ' ~ MKT_RF + SMB + HML + RMW + CMA'), # Specification of models
function(f){ lm(as.formula(f), data = data) |> # Calling lm(.)
summary() %>% # Gathering the output
"$"(coef) |> # Keeping only the coefs
data.frame() |> # Convert to dataframe
select(Estimate)} # Keep the coefficients
)
betas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) |> # Extracting the betas
data.frame(row.names = tickers) # Formatting: row names
colnames(betas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Formatting: col names
betas # Showing the values
Constant | MKT_RF | SMB | HML | RMW | CMA | |
---|---|---|---|---|---|---|
AAPL | 0.0186491 | 1.2341543 | 0.2402203 | -0.5152830 | 0.2492055 | -0.7717273 |
BA | 0.0010666 | 1.1884310 | 0.1894390 | 0.4894745 | 0.4182510 | -0.0278159 |
C | -0.0017137 | 1.5380004 | -0.5803671 | 1.9945939 | -0.9240327 | -1.1158045 |
PFE | -0.0030847 | 0.7504517 | -0.2909267 | -0.0668321 | 0.3743308 | 0.5207637 |
WMT | 0.0040476 | 0.5382472 | -0.3638913 | -0.0816542 | 0.3399109 | 0.0980102 |
XOM | 0.0007407 | 0.7342551 | -0.0151136 | 0.4107610 | 0.2312412 | 0.3629603 |
rm(models) # Remove models to save space
The market beta is the second column. When there are no other factors, a beta above 1 (resp. below 1) means that the stock is more volatile than the market (resp. less volatile). The two stocks with high market betas also load heavily (and negatively on CMA). Citigroup also has a strong positive exposure to HML.
We then turn to the second step of the procedure. This time, we regress asset returns against the betas, one date at a time. We start by formatting the data.
loadings <- betas |> # Start from loadings
select(-Constant) |> # Keep only the loadings
data.frame() # Convert to dataframe
ret <- returns |> # Start from returns
select(-Date) |> # Keep the returns only
data.frame(row.names = returns$Date) |> # Set row names
t() # Transpose
FM_data <- cbind(loadings, ret) # Aggregate both
FM_data[,1:7] # Quick look at the first 8 columns
MKT_RF | SMB | HML | RMW | CMA | 2000-02-29 | 2000-03-31 | |
---|---|---|---|---|---|---|---|
AAPL | 1.2341543 | 0.2402203 | -0.5152830 | 0.2492055 | -0.7717273 | 0.1048204 | 0.1848414 |
BA | 1.1884310 | 0.1894390 | 0.4894745 | 0.4182510 | -0.0278159 | -0.1670994 | 0.0236888 |
C | 1.5380004 | -0.5803671 | 1.9945939 | -0.9240327 | -1.1158045 | -0.0895824 | 0.1570050 |
PFE | 0.7504517 | -0.2909267 | -0.0668321 | 0.3743308 | 0.5207637 | -0.1100105 | 0.1381321 |
WMT | 0.5382472 | -0.3638913 | -0.0816542 | 0.3399109 | 0.0980102 | -0.1095889 | 0.1604324 |
XOM | 0.7342551 | -0.0151136 | 0.4107610 | 0.2312412 | 0.3629603 | -0.0861578 | 0.0356851 |
This will be the inputs of the second stage of regressions: for each stock, we have both the returns and the associated loading values.
models <- lapply(paste("`", returns$Date, "`", ' ~ MKT_RF + SMB + HML + RMW + CMA', sep = ""),
function(f){ lm(as.formula(f), data = FM_data) |> # Calling lm(.)
summary() %>% # Gathering the output
"$"(coef) |> # Keeping only the coefs
data.frame() |> # Convert to dataframe
select(Estimate)} # Keep the coefficients
)
gammas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) |> # Switch to dataframe
data.frame(row.names = returns$Date) # & set row names
colnames(gammas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Set col names
gammas[1:10,] # Show first 20 lines
Constant | MKT_RF | SMB | HML | RMW | CMA | |
---|---|---|---|---|---|---|
2000-02-29 | 0.1239586 | -0.0394559 | 0.2214864 | -0.2042830 | -0.4434440 | 0.0238477 |
2000-03-31 | 0.0581506 | 0.0759263 | -0.2102380 | -0.1237718 | -0.0610725 | -0.0452642 |
2000-04-30 | -0.4059478 | 0.4274252 | -0.2842407 | -0.0475548 | 0.0748032 | 0.2370561 |
2000-05-31 | 0.0186219 | -0.1241037 | -0.1407553 | 0.1927900 | 0.1246562 | 0.1118167 |
2000-06-30 | -0.2753429 | 0.3787721 | -0.1288425 | -0.1841318 | 0.0255560 | 0.0199481 |
2000-07-31 | -0.0758135 | -0.0434212 | 0.0762947 | 0.2561665 | 0.3599206 | -0.1600895 |
2000-08-31 | -0.1446250 | 0.3299020 | 0.2465834 | -0.1209519 | -0.3083543 | 0.1399167 |
2000-09-30 | -0.3315829 | -0.0648002 | -0.2781077 | 0.5613907 | 0.9335740 | 0.0550061 |
2000-10-31 | -0.1640260 | -0.0078580 | -0.0565691 | 0.2164587 | 0.3112272 | 0.0246202 |
2000-11-30 | 0.0017934 | -0.2095724 | -0.3062634 | 0.1946519 | 0.5393848 | -0.1810960 |
rm(models) # Remove models, save space
Note that the estimates are not going to be very reliable because the number of observations (number of stocks, i.e., 6) is barely larger than the number of variables (number of factors, 5). Accuracy would require a much larger number of assets (hundreds, or, preferably, thousands).
gammas |> colMeans() |> round(4) # Means (over all periods)
## Constant MKT_RF SMB HML RMW CMA
## 0.0198 -0.0067 0.0161 -0.0052 -0.0033 -0.0105
apply(gammas, 2, t.test) |> unlist() |> # t-stats
matrix(ncol = ncol(gammas), dimnames = list(c(), colnames(betas))) |>
data.frame() |> slice(1) |> as.numeric() |> round(4) # Take only the first line of the test
## [1] 1.6744 -0.6259 1.4781 -0.7743 -0.2025 -1.5627
Surprisingly, some averages are quite negative. But the t-stats are
hardly significant.
The results depend on the initial assets. More accurate results require
a much larger set of assets.
Let’s nonetheless visually assess the stability of premia across time.
gammas |> # Start from gamma (estimates)
select(-Constant) |> # Take out the constant
rownames_to_column("Date") |> # Convert row names to the first column => dates
mutate(Date = as.Date(Date)) |> # Convert first date column to true date format
pivot_longer(-Date, names_to = "Factor", # Put in column/tidy format for ggplot
values_to = "Value") |>
ggplot(aes(x = Date, y = Value, color = Factor)) + # Plot
geom_line() + facet_grid(Factor~.) + theme_light() +
scale_color_viridis_d()
CMA and HML appears more stable,
while RMW and SMB are more volatile.
Finally, what are the annual average of these series? We build a simple
pivot-table below.
gammas |> # To create a pivot table, start with the gammas
select(-Constant) |> # Take out the constant
rownames_to_column("Date") |> # Convert row names to the first column
mutate(Date = year(as.Date(Date))) |> # Convert this new first column to year
pivot_longer(-Date, names_to = "Factor", # Put in column/tidy format for ggplot
values_to = "Value") |>
group_by(Date, Factor) |> # Prepare pivot-table
summarise(avg_premium = mean(Value)) |> # Pivot table!
ggplot(aes(x = Date, y = avg_premium, color = Factor)) + # Plot
geom_line() + geom_point() + theme_light() +
scale_color_brewer(palette = "Set2")
## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.
The final graph shows the average of premias computed over each year in the sample. It unveils some cycles for most factors. The Size factor is known to be very sensitive to macro-economic conditions: indeed smaller firms thrive in good economic times, but are hit hardly during economic downturns. Both in 2008 and 2009, the SMB returns were negative but it’s not necessarily reflected here again because the original universe is too small.
Below, we regress each factor against all its ‘competitors’. If the alpha (coefficient of the constant/Integer vector) in the regression is not significantly different from zero, it means that the factor is somewhat redundant because it can be ‘explained’ by exposures to other factors. If not, the factor cannot be explained by the other factors and is thus adding value to the analysis.
We can thus work simply with the Fama-French factors with no external data.
factors <- c("MKT_RF", "SMB", "HML", "RMW", "CMA")
models <- lapply(paste(factors, ' ~ MKT_RF + SMB + HML + RMW + CMA-',factors),
function(f){ lm(as.formula(f), data = FF_factors) |> # Calling lm(.)
summary() %>% # Gathering the output
"$"(coef) |> # Keeping only the coefs
data.frame() %>% # Convert to dataframe
filter(rownames(.) == "(Intercept)") |> # Keep only the Intercept
select(Estimate,`Pr...t..`)} # Keep the coef & p-value
)
alphas <- matrix(unlist(models), ncol = 2, byrow = T) |> # Switch from list to dataframe
data.frame(row.names = factors) # Set row names
colnames(alphas) <- c("coef", "p-value") # Set col names
alphas # Show
coef | p-value | |
---|---|---|
MKT_RF | 0.0086840 | 0.0003485 |
SMB | 0.0032523 | 0.0378363 |
HML | -0.0030639 | 0.0388970 |
RMW | 0.0052793 | 0.0000503 |
CMA | 0.0023372 | 0.0190311 |
rm(models) # Remove models
The p-values are rather small. The results are somewhat in line with the conclusions of Fama & French (2015) - Section 7 in which the authors conclude that HML is redundant (not always true!).
In R, panel models are handled via the great plm package, so we need it! Also: let’s load a small dataset and compute forward returns.
if(!require(plm)){install.packages("plm")} # Package for panels
library(plm)
load("data_new.RData") # Load data
data <- data |>
group_by(symbol) |>
mutate(f_return = dplyr::lead(close)/close - 1, # Add return column
mkt_cap = mkt_cap / 10^9,
revenue = revenue / 10^9) |>
ungroup()
data[1:6,1:6]
date | symbol | close | vol_1y | mkt_cap | p2b |
---|---|---|---|---|---|
2000-11-30 | AAPL | 0.2946426 | 0.8388373 | 8.367954 | 2.144287 |
2000-12-31 | AAPL | 0.2656247 | 0.8322865 | 8.367954 | 2.144287 |
2001-01-31 | AAPL | 0.3861603 | 0.9804490 | 8.367954 | 2.144287 |
2001-02-28 | AAPL | 0.3258925 | 0.9769291 | 8.367954 | 2.144287 |
2001-03-31 | AAPL | 0.3941067 | 0.9835925 | 8.367954 | 2.144287 |
2001-04-30 | AAPL | 0.4551781 | 1.0023544 | 8.367954 | 2.144287 |
In equations below: - \(t\) is
time;
- \(n\) is stock;
- \(k\) is characteristic.
The predictive relationship is
\[r_{t+1,n} = a + \sum_{k=1}^Kb^kx^k_{t,n} + u_{t+1,n}\]
If we estimate this like a stacked regression, the esimtator is called “pooled”.
if(!require(broom)){install.packages("broom")} # Neat package for regression output
## Loading required package: broom
library(broom)
est_pooled <- plm(f_return ~ vol_1y + mkt_cap + p2b + d2e, # Model
data = data, # Data source
model = "pooling") # Estimation method
summary(est_pooled) # Summary of output
## Pooling Model
##
## Call:
## plm(formula = f_return ~ vol_1y + mkt_cap + p2b + d2e, data = data,
## model = "pooling")
##
## Unbalanced Panel: n = 282, T = 28-30, N = 8283
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.6094527 -0.0391810 -0.0010802 0.0409443 1.2312728
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -2.4915e-03 1.9091e-03 -1.3050 0.1919
## vol_1y 3.5907e-02 6.6090e-03 5.4331 5.696e-08 ***
## mkt_cap 2.4081e-06 3.5133e-06 0.6854 0.4931
## p2b -2.8181e-05 2.0621e-05 -1.3666 0.1718
## d2e 2.0204e-06 1.7178e-06 1.1762 0.2395
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 51.381
## Residual Sum of Squares: 51.157
## R-Squared: 0.0043514
## Adj. R-Squared: 0.0038703
## F-statistic: 9.04464 on 4 and 8278 DF, p-value: 2.7545e-07
coef_pooled <- tidy(est_pooled) # Output to dataframe
rm(est_pooled)
If we allow the error term to be decomposed so that
\[r_{t+1,n} = \sum_{k=1}^Kb^kx^k_{t,n} + \mu_n+ e_{t+1,n},\]
then estimators are called within (fixed-effect estimation). This means that the model allows for stock-specific idiosyncrasies.
est_within <- plm(f_return ~ vol_1y + mkt_cap + p2b + d2e,
data = data,
index = c("symbol", "date"),
model = "within")
coef_within <- tidy(est_within)
coef_within
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
vol_1y | 0.0350638 | 0.0072393 | 4.843527 | 0.0000013 |
mkt_cap | -0.0000066 | 0.0000044 | -1.496378 | 0.1345934 |
p2b | -0.0000553 | 0.0000246 | -2.250270 | 0.0244581 |
d2e | 0.0000048 | 0.0000020 | 2.355716 | 0.0185103 |
Let’s plot the differences. There are some discrepancies!
We plot t-statistics to normalize scales.
bind_rows(
bind_cols(coef_within, type = "within"), # Create new type column
bind_cols(coef_pooled, type = "pooled") # Create new type column
) |>
ggplot(aes(x = term, y = statistic, fill = type)) + # Plot
theme_light() + # White background
geom_col(position = "dodge", alpha = 0.8) +
scale_fill_manual(values = c("#77ABD9", "#FFAF43")) # Custom colors
The differences seem to matter! Except for Mkt_Cap which seem to be a driver in both types of estimations
If you want to extract the fixed effects, you can use the fixef() function. Those are the \(\mu_n\).
fixef(est_within)
## AAPL BA BAC C CSCO CVS
## 2.0475e-02 1.4810e-03 -2.4782e-03 -1.2243e-02 -1.1469e-03 -2.8890e-03
## CVX D DIS F FDX GE
## 4.7774e-04 -2.9751e-03 -5.2309e-04 -1.4253e-02 1.6062e-04 -7.0500e-03
## HD IBM INTC JNJ JPM K
## 2.6030e-03 -2.9687e-03 -5.8184e-03 1.2309e-03 9.6266e-05 -1.8847e-03
## LLY MCK MRK MSFT ORCL PFE
## 2.9676e-03 1.5160e-03 -3.0948e-03 8.0064e-03 -7.9531e-04 -5.2093e-03
## PG T VZ WFC WMT XOM
## 2.0477e-03 -7.5625e-03 -3.5452e-03 -1.5579e-03 1.3831e-03 8.1789e-04