This is the companion notebook to the course on empirical foundations of machine learning in asset management.
The topic is strongly tilted towards asset pricing anomalies and serves as the economic justification of ML-based factor portfolios. Indeed, since the 1990s, a large body of empirical work has shown that firms’ characteristics have an impact on future returns. This impact is possibly time-varying.
The first chunks of the code are inspired from the book Reproducible Finance by Jonathan Regenstein (http://www.reproduciblefinance.com).
The code is commented and the results are discussed as much as possible.

\(\rightarrow\) ABOUT THE NOTEBOOK:

  • the code chunks are sequential. They must be executed in the correct order.
  • text between stars appears in bold in html format.

1 Data collection

First, we collect the data for our baseline assets. We work with 6 large American companies which are associated to their ticker symbol: Apple (AAPL, technologies), Boeing (BA, industry), Citigroup (C, banking), Pfizer (PFE, healthcare), Walmart (WMT, retail) and Exxon (XOM, oil/commodities). The six firms have been chosen in very diverse industry sectors.

NOTE: The analysis below should be carried out on a much larger number of assets. We restrict the width of the analysis for simplicity and for computational purposes.

if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(quantmod)){install.packages("quantmod")}
library(tidyverse)                            # Package for data manipulation
library(quantmod)                             # Package for data extraction
tickers <- c("AAPL", "BA", "C", "PFE", "WMT", "XOM")  
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS", "AMZN", "NVDA")
min_date <- "2000-01-01"                      # Starting date
max_date <- "2025-09-30"                      # Ending date
prices <- getSymbols(tickers, src = 'yahoo',  # The data comes from Yahoo Finance
                     from = min_date,         # Start date
                     to = max_date,           # End date
                     auto.assign = TRUE, 
                     warnings = FALSE) %>% 
  map(~Ad(get(.))) |>                         # Retrieving the data
  reduce(merge) |>                            # Merge in one dataframe
  `colnames<-`(tickers)                       # Set the column names
head(prices[,1:5],7)                          # Have a look at the result!
##                 AAPL       BA        C      PFE      WMT
## 2000-01-03 0.8400941 25.94028 209.0070 11.85604 14.23962
## 2000-01-04 0.7692658 25.89995 196.1906 11.41434 13.70681
## 2000-01-05 0.7805229 27.51363 204.0776 11.60033 13.42708
## 2000-01-06 0.7129774 27.79603 213.9364 12.01877 13.57360
## 2000-01-07 0.7467505 28.60290 212.9505 12.83242 14.59927
## 2000-01-10 0.7336167 28.19946 212.2111 12.80917 14.33288
## 2000-01-11 0.6960915 27.67502 209.4998 12.64644 14.11974
rm(list = tickers)                            # Remove unnecessary variables

The data is stored in the ‘prices’ variable. Let’s have a look at the series, to make sure they make sense.

prices %>%                         # OLD PIPE here
  data.frame(Date = index(.)) |>
  remove_rownames() |>
  pivot_longer(-Date, names_to = "Asset", 
               values_to = "Price") |>
  ggplot(aes(x = Date, y = Price, color = Asset)) + geom_line() + 
  facet_wrap(vars(Asset), scales = "free") + theme_light()

This seems alright, though Citigroup suffered massively in the subprime crisis.

Before we move towards the factors, we transform (via mutate) the daily prices into monthly returns. Returns are indeed the raw material in portfolio management. In the code, we resort to a lot of piping using the pipe operator |>.

returns <- prices |>                                 # Start from prices
  to.monthly(indexAt = "lastof", OHLC = FALSE) %>%   # Convert to monthly - OLD PIPE!
  data.frame(Date = index(.)) |>                     # Convert the index to a date
  remove_rownames() |>                               # Remove index => converted into row names
  pivot_longer(-Date, names_to = "Asset", 
               values_to = "Price") |>               # Put the data in 'tidy' format
  group_by(Asset) |>                                 # Group the rows by asset to compute returns
  mutate(returns = Price / dplyr::lag(Price) - 1) |> # Compute the returns
  select(-Price) |>                                  # Remove price column
  pivot_wider(names_from = "Asset", 
              values_from = "returns") |>            # Put the data back in matrix form
  select(all_of(c("Date", tickers))) |>              # Keep only the relevant columns
  na.omit()                                          # Discard rows with missing/NA data
tail(returns[,1:5])                                  # Again, show the result!
Date AAPL BA C PFE
2025-04-30 -0.0433530 0.0744063 -0.0367658 -0.0367008
2025-05-31 -0.0535841 0.1314124 0.1102998 -0.0193313
2025-06-30 0.0215085 0.0106598 0.1301115 0.0319285
2025-07-31 0.0116977 0.0587505 0.1007988 -0.0226124
2025-08-31 0.1196389 0.0578795 0.0374080 0.0631172
2025-09-30 0.0960196 -0.0749957 0.0682407 -0.0367528

We now turn to the importation of factor data. It’s extracted from the amazing database maintained by Kenneth French: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

temp <- tempfile()
link <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip"
download.file(link, temp, quiet = TRUE)         # Download!
nb_factors <- 5                                 # Number of factors
FF_factors <- read_csv(unz(temp, 
                           "F-F_Research_Data_5_Factors_2x3.csv"),  # The file name
                       skip = 3) |>                # Check the nb of lines skip (could be 2!)
  rename(Date = `...1`, MKT_RF = `Mkt-RF`) |>      # Change name of the first column
  mutate_at(vars(-Date), as.numeric) |>                             # Convert to number
  mutate(Date = ymd(parse_date_time(Date, "%Y%m"))) |>              # Date in right format
  mutate(Date = rollback(Date + months(1)))                         # End of month date
FF_factors <- FF_factors |> mutate(MKT_RF = MKT_RF / 100,           # Scaling returns
                                    SMB = SMB / 100,
                                    HML = HML / 100,
                                    RMW = RMW / 100,
                                    CMA = CMA / 100,
                                    RF = RF / 100) |>               # Risk-free rate
  filter(Date >= min_date)                      # Finally, we keep the recent points
tail(FF_factors)                                # And we have a look!
Date MKT_RF SMB HML RMW CMA RF
2025-03-31 -0.0639 -0.0149 0.0290 0.0211 -0.0047 0.0034
2025-04-30 -0.0084 -0.0186 -0.0340 -0.0285 -0.0267 0.0035
2025-05-31 0.0606 -0.0072 -0.0288 0.0126 0.0251 0.0038
2025-06-30 0.0486 -0.0002 -0.0160 -0.0319 0.0145 0.0034
2025-07-31 0.0198 -0.0015 -0.0127 -0.0029 -0.0207 0.0034
2025-08-31 0.0185 0.0488 0.0441 -0.0069 0.0207 0.0038

Finally, we merge the two sources of data into one single variable.

data <- left_join(returns, FF_factors, by = "Date") 
# left_join() is an incredible function that merges two sets through one common column
tail(data[,1:6]) # See the outcome!
Date AAPL BA C PFE WMT
2025-04-30 -0.0433530 0.0744063 -0.0367658 -0.0367008 0.1077570
2025-05-31 -0.0535841 0.1314124 0.1102998 -0.0193313 0.0175701
2025-06-30 0.0215085 0.0106598 0.1301115 0.0319285 -0.0095219
2025-07-31 0.0116977 0.0587505 0.1007988 -0.0226124 0.0020455
2025-08-31 0.1196389 0.0578795 0.0374080 0.0631172 -0.0078944
2025-09-30 0.0960196 -0.0749957 0.0682407 -0.0367528 0.0628995

With our full dataset at hand, we are ready to begin the analysis.

2 Fama-Macbeth regressions

2.1 Principle

The Fama-Macbeth regressions can be used to assess the premium associated to factors. They are performed in two steps.

  1. First, all assets are regressed against the chosen factors (which gives the factor loadings, i.e., the estimated coefficients in the regression).
  2. Then, for each date, the (cross-sectional) returns of the assets are regressed against the loadings obtained in the first step.
  • This yields time-series of premia, which can then be averaged into a proxy for the risk premium associated with each factor.

2.2 First pass

A natural way to proceed would be to resort to a for() loop to perform the regressions iteratively. We take another route and instead make us of the ‘apply’ family of functions. We use a procedure that creates all models via their character syntax in R. And finally, we gather the results in the ‘betas’ variable.

models <- lapply(paste(tickers, ' ~  MKT_RF + SMB + HML + RMW + CMA'), # Specification of models
                 function(f){ lm(as.formula(f), data = data) |>        # Calling lm(.)
                     summary() %>%                                     # Gathering the output
                     "$"(coef) |>                                      # Keeping only the coefs
                     data.frame() |>                                   # Convert to dataframe
                     select(Estimate)}                                 # Keep the coefficients
)

betas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) |>    # Extracting the betas
  data.frame(row.names = tickers)                                       # Formatting: row names
colnames(betas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA")  # Formatting: col names
betas                                                                   # Showing the values
Constant MKT_RF SMB HML RMW CMA
AAPL 0.0186491 1.2341543 0.2402203 -0.5152830 0.2492055 -0.7717273
BA 0.0010666 1.1884310 0.1894390 0.4894745 0.4182510 -0.0278159
C -0.0017137 1.5380004 -0.5803671 1.9945939 -0.9240327 -1.1158045
PFE -0.0030847 0.7504517 -0.2909267 -0.0668321 0.3743308 0.5207637
WMT 0.0040476 0.5382472 -0.3638913 -0.0816542 0.3399109 0.0980102
XOM 0.0007407 0.7342551 -0.0151136 0.4107610 0.2312412 0.3629603
rm(models)                                                              # Remove models to save space

The market beta is the second column. When there are no other factors, a beta above 1 (resp. below 1) means that the stock is more volatile than the market (resp. less volatile). The two stocks with high market betas also load heavily (and negatively on CMA). Citigroup also has a strong positive exposure to HML.

2.3 Second pass

We then turn to the second step of the procedure. This time, we regress asset returns against the betas, one date at a time. We start by formatting the data.

loadings <- betas |>                          # Start from loadings
  select(-Constant) |>                        # Keep only the loadings
  data.frame()                                # Convert to dataframe             
ret <- returns |>                             # Start from returns
  select(-Date) |>                            # Keep the returns only
  data.frame(row.names = returns$Date) |>     # Set row names
  t()                                         # Transpose

FM_data <- cbind(loadings, ret)               # Aggregate both
FM_data[,1:7]                                 # Quick look at the first 8 columns
MKT_RF SMB HML RMW CMA 2000-02-29 2000-03-31
AAPL 1.2341543 0.2402203 -0.5152830 0.2492055 -0.7717273 0.1048204 0.1848414
BA 1.1884310 0.1894390 0.4894745 0.4182510 -0.0278159 -0.1670994 0.0236888
C 1.5380004 -0.5803671 1.9945939 -0.9240327 -1.1158045 -0.0895824 0.1570050
PFE 0.7504517 -0.2909267 -0.0668321 0.3743308 0.5207637 -0.1100105 0.1381321
WMT 0.5382472 -0.3638913 -0.0816542 0.3399109 0.0980102 -0.1095889 0.1604324
XOM 0.7342551 -0.0151136 0.4107610 0.2312412 0.3629603 -0.0861578 0.0356851

This will be the inputs of the second stage of regressions: for each stock, we have both the returns and the associated loading values.

models <- lapply(paste("`", returns$Date, "`", ' ~  MKT_RF + SMB + HML + RMW + CMA', sep = ""),
                 function(f){ lm(as.formula(f), data = FM_data) |>      # Calling lm(.)
                     summary() %>%                                      # Gathering the output
                     "$"(coef) |>                                       # Keeping only the coefs
                     data.frame() |>                                    # Convert to dataframe
                     select(Estimate)}                                  # Keep the coefficients
)
gammas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) |>   # Switch to dataframe
  data.frame(row.names = returns$Date)                                  # & set row names
colnames(gammas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Set col names
gammas[1:10,]                                                           # Show first 20 lines
Constant MKT_RF SMB HML RMW CMA
2000-02-29 0.1239586 -0.0394559 0.2214864 -0.2042830 -0.4434440 0.0238477
2000-03-31 0.0581506 0.0759263 -0.2102380 -0.1237718 -0.0610725 -0.0452642
2000-04-30 -0.4059478 0.4274252 -0.2842407 -0.0475548 0.0748032 0.2370561
2000-05-31 0.0186219 -0.1241037 -0.1407553 0.1927900 0.1246562 0.1118167
2000-06-30 -0.2753429 0.3787721 -0.1288425 -0.1841318 0.0255560 0.0199481
2000-07-31 -0.0758135 -0.0434212 0.0762947 0.2561665 0.3599206 -0.1600895
2000-08-31 -0.1446250 0.3299020 0.2465834 -0.1209519 -0.3083543 0.1399167
2000-09-30 -0.3315829 -0.0648002 -0.2781077 0.5613907 0.9335740 0.0550061
2000-10-31 -0.1640260 -0.0078580 -0.0565691 0.2164587 0.3112272 0.0246202
2000-11-30 0.0017934 -0.2095724 -0.3062634 0.1946519 0.5393848 -0.1810960
rm(models)                                                              # Remove models, save space

Note that the estimates are not going to be very reliable because the number of observations (number of stocks, i.e., 6) is barely larger than the number of variables (number of factors, 5). Accuracy would require a much larger number of assets (hundreds, or, preferably, thousands).

gammas |> colMeans() |> round(4)                                       # Means (over all periods)
## Constant   MKT_RF      SMB      HML      RMW      CMA 
##   0.0198  -0.0067   0.0161  -0.0052  -0.0033  -0.0105
apply(gammas, 2, t.test) |> unlist() |>                                # t-stats
  matrix(ncol = ncol(gammas), dimnames = list(c(), colnames(betas))) |>
  data.frame() |> slice(1) |> as.numeric() |> round(4)    # Take only the first line of the test
## [1]  1.6744 -0.6259  1.4781 -0.7743 -0.2025 -1.5627

Surprisingly, some averages are quite negative. But the t-stats are hardly significant.
The results depend on the initial assets. More accurate results require a much larger set of assets.

2.4 Graphical interpretation

Let’s nonetheless visually assess the stability of premia across time.

gammas |>                                            # Start from gamma (estimates)
  select(-Constant) |>                               # Take out the constant
  rownames_to_column("Date") |>                      # Convert row names to the first column => dates
  mutate(Date = as.Date(Date)) |>                    # Convert first date column to true date format
  pivot_longer(-Date, names_to = "Factor",           # Put in column/tidy format for ggplot
               values_to = "Value") |>     
  ggplot(aes(x = Date, y = Value, color = Factor)) + # Plot
  geom_line() + facet_grid(Factor~.) + theme_light() + 
    scale_color_viridis_d()

CMA and HML appears more stable, while RMW and SMB are more volatile.
Finally, what are the annual average of these series? We build a simple pivot-table below.

gammas |>                                        # To create a pivot table, start with the gammas
  select(-Constant) |>                           # Take out the constant
  rownames_to_column("Date") |>                  # Convert row names to the first column
  mutate(Date = year(as.Date(Date))) |>          # Convert this new first column to year
  pivot_longer(-Date, names_to = "Factor",       # Put in column/tidy format for ggplot
               values_to = "Value") |> 
  group_by(Date, Factor) |>                      # Prepare pivot-table
  summarise(avg_premium = mean(Value)) |>        # Pivot table!
  ggplot(aes(x = Date, y = avg_premium, color = Factor)) + # Plot
  geom_line() + geom_point() + theme_light() +
  scale_color_brewer(palette = "Set2")
## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.

The final graph shows the average of premias computed over each year in the sample. It unveils some cycles for most factors. The Size factor is known to be very sensitive to macro-economic conditions: indeed smaller firms thrive in good economic times, but are hit hardly during economic downturns. Both in 2008 and 2009, the SMB returns were negative but it’s not necessarily reflected here again because the original universe is too small.

3 Factor competition

Below, we regress each factor against all its ‘competitors’. If the alpha (coefficient of the constant/Integer vector) in the regression is not significantly different from zero, it means that the factor is somewhat redundant because it can be ‘explained’ by exposures to other factors. If not, the factor cannot be explained by the other factors and is thus adding value to the analysis.

We can thus work simply with the Fama-French factors with no external data.

factors <- c("MKT_RF", "SMB", "HML", "RMW", "CMA")
models <- lapply(paste(factors, ' ~  MKT_RF + SMB + HML + RMW + CMA-',factors),
                 function(f){ lm(as.formula(f), data = FF_factors) |> # Calling lm(.)
                     summary() %>%                                    # Gathering the output
                     "$"(coef) |>                                     # Keeping only the coefs
                     data.frame() %>%                                 # Convert to dataframe
                     filter(rownames(.) == "(Intercept)") |>          # Keep only the Intercept
                     select(Estimate,`Pr...t..`)}                     # Keep the coef & p-value
)
alphas <- matrix(unlist(models), ncol = 2, byrow = T) |>              # Switch from list to dataframe
  data.frame(row.names = factors)                                     # Set row names
colnames(alphas) <- c("coef", "p-value")                              # Set col names
alphas                                                                # Show
coef p-value
MKT_RF 0.0086840 0.0003485
SMB 0.0032523 0.0378363
HML -0.0030639 0.0388970
RMW 0.0052793 0.0000503
CMA 0.0023372 0.0190311
rm(models)                                                            # Remove models

The p-values are rather small. The results are somewhat in line with the conclusions of Fama & French (2015) - Section 7 in which the authors conclude that HML is redundant (not always true!).

4 Predictive panels

In R, panel models are handled via the great plm package, so we need it! Also: let’s load a small dataset and compute forward returns.

if(!require(plm)){install.packages("plm")}           # Package for panels
library(plm)
load("data_new.RData")                               # Load data
data <- data |>  
  group_by(symbol) |>
  mutate(f_return = dplyr::lead(close)/close - 1,    # Add return column
         mkt_cap = mkt_cap / 10^9,
         revenue = revenue / 10^9) |>  
  ungroup() 
data[1:6,1:6]
date symbol close vol_1y mkt_cap p2b
2000-11-30 AAPL 0.2946426 0.8388373 8.367954 2.144287
2000-12-31 AAPL 0.2656247 0.8322865 8.367954 2.144287
2001-01-31 AAPL 0.3861603 0.9804490 8.367954 2.144287
2001-02-28 AAPL 0.3258925 0.9769291 8.367954 2.144287
2001-03-31 AAPL 0.3941067 0.9835925 8.367954 2.144287
2001-04-30 AAPL 0.4551781 1.0023544 8.367954 2.144287

In equations below: - \(t\) is time;
- \(n\) is stock;
- \(k\) is characteristic.

The predictive relationship is

\[r_{t+1,n} = a + \sum_{k=1}^Kb^kx^k_{t,n} + u_{t+1,n}\]

If we estimate this like a stacked regression, the esimtator is called “pooled”.

if(!require(broom)){install.packages("broom")}             # Neat package for regression output
## Loading required package: broom
library(broom)
est_pooled <- plm(f_return ~ vol_1y + mkt_cap + p2b + d2e, # Model
                  data = data,                             # Data source
                  model = "pooling")                       # Estimation method
summary(est_pooled)                                        # Summary of output
## Pooling Model
## 
## Call:
## plm(formula = f_return ~ vol_1y + mkt_cap + p2b + d2e, data = data, 
##     model = "pooling")
## 
## Unbalanced Panel: n = 282, T = 28-30, N = 8283
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -0.6094527 -0.0391810 -0.0010802  0.0409443  1.2312728 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept) -2.4915e-03  1.9091e-03 -1.3050    0.1919    
## vol_1y       3.5907e-02  6.6090e-03  5.4331 5.696e-08 ***
## mkt_cap      2.4081e-06  3.5133e-06  0.6854    0.4931    
## p2b         -2.8181e-05  2.0621e-05 -1.3666    0.1718    
## d2e          2.0204e-06  1.7178e-06  1.1762    0.2395    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    51.381
## Residual Sum of Squares: 51.157
## R-Squared:      0.0043514
## Adj. R-Squared: 0.0038703
## F-statistic: 9.04464 on 4 and 8278 DF, p-value: 2.7545e-07
coef_pooled <- tidy(est_pooled)                            # Output to dataframe
rm(est_pooled)

If we allow the error term to be decomposed so that

\[r_{t+1,n} = \sum_{k=1}^Kb^kx^k_{t,n} + \mu_n+ e_{t+1,n},\]

then estimators are called within (fixed-effect estimation). This means that the model allows for stock-specific idiosyncrasies.

est_within <- plm(f_return ~ vol_1y + mkt_cap + p2b + d2e, 
                  data = data, 
                  index = c("symbol", "date"),
                  model = "within") 
coef_within <- tidy(est_within)
coef_within
term estimate std.error statistic p.value
vol_1y 0.0350638 0.0072393 4.843527 0.0000013
mkt_cap -0.0000066 0.0000044 -1.496378 0.1345934
p2b -0.0000553 0.0000246 -2.250270 0.0244581
d2e 0.0000048 0.0000020 2.355716 0.0185103

Let’s plot the differences. There are some discrepancies!
We plot t-statistics to normalize scales.

bind_rows(
  bind_cols(coef_within, type = "within"),              # Create new type column      
  bind_cols(coef_pooled, type = "pooled")               # Create new type column
) |>
  ggplot(aes(x = term, y = statistic, fill = type)) +   # Plot
  theme_light() +                                       # White background
  geom_col(position = "dodge", alpha = 0.8) +
  scale_fill_manual(values = c("#77ABD9", "#FFAF43"))   # Custom colors

The differences seem to matter! Except for Mkt_Cap which seem to be a driver in both types of estimations

If you want to extract the fixed effects, you can use the fixef() function. Those are the \(\mu_n\).

fixef(est_within)
##        AAPL          BA         BAC           C        CSCO         CVS 
##  2.0475e-02  1.4810e-03 -2.4782e-03 -1.2243e-02 -1.1469e-03 -2.8890e-03 
##         CVX           D         DIS           F         FDX          GE 
##  4.7774e-04 -2.9751e-03 -5.2309e-04 -1.4253e-02  1.6062e-04 -7.0500e-03 
##          HD         IBM        INTC         JNJ         JPM           K 
##  2.6030e-03 -2.9687e-03 -5.8184e-03  1.2309e-03  9.6266e-05 -1.8847e-03 
##         LLY         MCK         MRK        MSFT        ORCL         PFE 
##  2.9676e-03  1.5160e-03 -3.0948e-03  8.0064e-03 -7.9531e-04 -5.2093e-03 
##          PG           T          VZ         WFC         WMT         XOM 
##  2.0477e-03 -7.5625e-03 -3.5452e-03 -1.5579e-03  1.3831e-03  8.1789e-04

5 Exercises

  1. Perform Fama-MacBeth regressions using a few more stocks (say, ten: find some more on Yahoo) and see if that changes the results.
  2. Try factor competition on another dataset from Ken French’s library (e.g., add the momentum factor).