This is the companion notebook to the course on empirical foundations of machine learning in asset management.
The topic is strongly tilted towards asset pricing anomalies and serves as the economic justification of ML-based factor portfolios. Indeed, since the 1990s, a large body of empirical work has shown that firms’ characteristics have an impact on future returns. This impact is possibly time-varying.
The first chunks of the code are inspired from the book Reproducible Finance by Jonathan Regenstein (http://www.reproduciblefinance.com).
The code is commented and the results are discussed as much as possible.

\(\rightarrow\) ABOUT THE NOTEBOOK:

the code chunks are sequential. They must be executed in the correct order.
text between stars appears in bold in html format.

1 Data collection

First, we collect the data for our baseline assets. We work with 6 large American companies which are associated to their ticker symbol: Apple (AAPL, technologies), Boeing (BA, industry), Citigroup (C, banking), Pfizer (PFE, healthcare), Walmart (WMT, retail) and Exxon (XOM, oil/commodities). The six firms have been chosen in very diverse industry sectors.

NOTE: The analysis below should be carried out on a much larger number of assets. We restrict the width of the analysis for simplicity and for computational purposes.

if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(quantmod)){install.packages("quantmod")}
library(tidyverse)                            # Package for data manipulation
library(quantmod)                             # Package for data extraction
tickers <- c("AAPL", "BA", "C", "PFE", "WMT", "XOM")  
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS", "AMZN", "NVDA")
min_date <- "2000-01-01"                      # Starting date
max_date <- "2025-09-30"                      # Ending date
prices <- getSymbols(tickers, src = 'yahoo',  # The data comes from Yahoo Finance
                     from = min_date,         # Start date
                     to = max_date,           # End date
                     auto.assign = TRUE, 
                     warnings = FALSE) %>% 
  map(~Ad(get(.))) |>                         # Retrieving the data
  reduce(merge) |>                            # Merge in one dataframe
  `colnames<-`(tickers)                       # Set the column names
head(prices[,1:5],7)                          # Have a look at the result!

##                 AAPL       BA        C      PFE      WMT
## 2000-01-03 0.8400941 25.94028 209.0070 11.85604 14.23962
## 2000-01-04 0.7692659 25.89995 196.1906 11.41435 13.70680
## 2000-01-05 0.7805228 27.51364 204.0776 11.60033 13.42707
## 2000-01-06 0.7129775 27.79603 213.9363 12.01877 13.57360
## 2000-01-07 0.7467507 28.60290 212.9506 12.83243 14.59928
## 2000-01-10 0.7336166 28.19946 212.2112 12.80918 14.33287
## 2000-01-11 0.6960916 27.67503 209.4999 12.64645 14.11974

rm(list = tickers)                            # Remove unnecessary variables

The data is stored in the ‘prices’ variable. Let’s have a look at the series, to make sure they make sense.

prices %>%                         # OLD PIPE here
  data.frame(Date = index(.)) |>
  remove_rownames() |>
  pivot_longer(-Date, names_to = "Asset", 
               values_to = "Price") |>
  ggplot(aes(x = Date, y = Price, color = Asset)) + geom_line() + 
  facet_wrap(vars(Asset), scales = "free") + theme_light()

This seems alright, though Citigroup suffered massively in the subprime crisis.

Before we move towards the factors, we transform (via mutate) the daily prices into monthly returns. Returns are indeed the raw material in portfolio management. In the code, we resort to a lot of piping using the pipe operator |>.

returns <- prices |>                                 # Start from prices
  to.monthly(indexAt = "lastof", OHLC = FALSE) %>%   # Convert to monthly - OLD PIPE!
  data.frame(Date = index(.)) |>                     # Convert the index to a date
  remove_rownames() |>                               # Remove index => converted into row names
  pivot_longer(-Date, names_to = "Asset", 
               values_to = "Price") |>               # Put the data in 'tidy' format
  group_by(Asset) |>                                 # Group the rows by asset to compute returns
  mutate(returns = Price / dplyr::lag(Price) - 1) |> # Compute the returns
  select(-Price) |>                                  # Remove price column
  pivot_wider(names_from = "Asset", 
              values_from = "returns") |>            # Put the data back in matrix form
  select(all_of(c("Date", tickers))) |>              # Keep only the relevant columns
  na.omit()                                          # Discard rows with missing/NA data
tail(returns[,1:5])                                  # Again, show the result!

Date	AAPL	BA	C	PFE
2025-04-30	-0.0433530	0.0744063	-0.0367658	-0.0367008
2025-05-31	-0.0535841	0.1314124	0.1102998	-0.0193313
2025-06-30	0.0215085	0.0106598	0.1301115	0.0319285
2025-07-31	0.0116977	0.0587505	0.1007988	-0.0226124
2025-08-31	0.1196389	0.0578795	0.0374080	0.0631172
2025-09-30	0.0960196	-0.0749957	0.0682407	-0.0367528

We now turn to the importation of factor data. It’s extracted from the amazing database maintained by Kenneth French: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

temp <- tempfile()
link <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip"
download.file(link, temp, quiet = TRUE)         # Download!
nb_factors <- 5                                 # Number of factors
FF_factors <- read_csv(unz(temp, 
                           "F-F_Research_Data_5_Factors_2x3.csv"),  # The file name
                       skip = 3) |>                # Check the nb of lines skip (could be 2!)
  rename(Date = `...1`, MKT_RF = `Mkt-RF`) |>      # Change name of the first column
  mutate_at(vars(-Date), as.numeric) |>                             # Convert to number
  mutate(Date = ymd(parse_date_time(Date, "%Y%m"))) |>              # Date in right format
  mutate(Date = rollback(Date + months(1)))                         # End of month date
FF_factors <- FF_factors |> mutate(MKT_RF = MKT_RF / 100,           # Scaling returns
                                    SMB = SMB / 100,
                                    HML = HML / 100,
                                    RMW = RMW / 100,
                                    CMA = CMA / 100,
                                    RF = RF / 100) |>               # Risk-free rate
  filter(Date >= min_date)                      # Finally, we keep the recent points
tail(FF_factors)                                # And we have a look!

Date	MKT_RF	SMB	HML	RMW	CMA	RF
2025-03-31	-0.0639	-0.0149	0.0290	0.0211	-0.0047	0.0034
2025-04-30	-0.0084	-0.0186	-0.0340	-0.0285	-0.0267	0.0035
2025-05-31	0.0606	-0.0072	-0.0288	0.0126	0.0251	0.0038
2025-06-30	0.0486	-0.0002	-0.0160	-0.0319	0.0145	0.0034
2025-07-31	0.0198	-0.0015	-0.0127	-0.0029	-0.0207	0.0034
2025-08-31	0.0185	0.0488	0.0441	-0.0069	0.0207	0.0038

Finally, we merge the two sources of data into one single variable.

data <- left_join(returns, FF_factors, by = "Date") 
# left_join() is an incredible function that merges two sets through one common column
tail(data[,1:6]) # See the outcome!

Date	AAPL	BA	C	PFE	WMT
2025-04-30	-0.0433530	0.0744063	-0.0367658	-0.0367008	0.1077570
2025-05-31	-0.0535841	0.1314124	0.1102998	-0.0193313	0.0175701
2025-06-30	0.0215085	0.0106598	0.1301115	0.0319285	-0.0095219
2025-07-31	0.0116977	0.0587505	0.1007988	-0.0226124	0.0020455
2025-08-31	0.1196389	0.0578795	0.0374080	0.0631172	-0.0078944
2025-09-30	0.0960196	-0.0749957	0.0682407	-0.0367528	0.0628995

With our full dataset at hand, we are ready to begin the analysis.

2 Fama-Macbeth regressions

2.1 Principle

The Fama-Macbeth regressions can be used to assess the premium associated to factors. They are performed in two steps.

First, all assets are regressed against the chosen factors (which gives the factor loadings, i.e., the estimated coefficients in the regression).
Then, for each date, the (cross-sectional) returns of the assets are regressed against the loadings obtained in the first step.

This yields time-series of premia, which can then be averaged into a proxy for the risk premium associated with each factor.

2.2 First pass

A natural way to proceed would be to resort to a for() loop to perform the regressions iteratively. We take another route and instead make us of the ‘apply’ family of functions. We use a procedure that creates all models via their character syntax in R. And finally, we gather the results in the ‘betas’ variable.

models <- lapply(paste(tickers, ' ~  MKT_RF + SMB + HML + RMW + CMA'), # Specification of models
                 function(f){ lm(as.formula(f), data = data) |>        # Calling lm(.)
                     summary() %>%                                     # Gathering the output
                     "$"(coef) |>                                      # Keeping only the coefs
                     data.frame() |>                                   # Convert to dataframe
                     select(Estimate)}                                 # Keep the coefficients
)

betas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) |>    # Extracting the betas
  data.frame(row.names = tickers)                                       # Formatting: row names
colnames(betas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA")  # Formatting: col names
betas                                                                   # Showing the values

	Constant	MKT_RF	SMB	HML	RMW	CMA
AAPL	0.0186491	1.2341543	0.2402198	-0.5152826	0.2492065	-0.7717273
BA	0.0010666	1.1884300	0.1894402	0.4894748	0.4182519	-0.0278164
C	-0.0017137	1.5380006	-0.5803668	1.9945948	-0.9240332	-1.1158051
PFE	-0.0030847	0.7504509	-0.2909254	-0.0668317	0.3743310	0.5207607
WMT	0.0040476	0.5382468	-0.3638918	-0.0816543	0.3399102	0.0980102
XOM	0.0007407	0.7342552	-0.0151137	0.4107608	0.2312420	0.3629598

rm(models)                                                              # Remove models to save space

The market beta is the second column. When there are no other factors, a beta above 1 (resp. below 1) means that the stock is more volatile than the market (resp. less volatile). The two stocks with high market betas also load heavily (and negatively on CMA). Citigroup also has a strong positive exposure to HML.

2.3 Second pass

We then turn to the second step of the procedure. This time, we regress asset returns against the betas, one date at a time. We start by formatting the data.

loadings <- betas |>                          # Start from loadings
  select(-Constant) |>                        # Keep only the loadings
  data.frame()                                # Convert to dataframe             
ret <- returns |>                             # Start from returns
  select(-Date) |>                            # Keep the returns only
  data.frame(row.names = returns$Date) |>     # Set row names
  t()                                         # Transpose

FM_data <- cbind(loadings, ret)               # Aggregate both
FM_data[,1:7]                                 # Quick look at the first 8 columns

	MKT_RF	SMB	HML	RMW	CMA	2000-02-29	2000-03-31
AAPL	1.2341543	0.2402198	-0.5152826	0.2492065	-0.7717273	0.1048199	0.1848417
BA	1.1884300	0.1894402	0.4894748	0.4182519	-0.0278164	-0.1670993	0.0236879
C	1.5380006	-0.5803668	1.9945948	-0.9240332	-1.1158051	-0.0895828	0.1570051
PFE	0.7504509	-0.2909254	-0.0668317	0.3743310	0.5207607	-0.1100100	0.1381317
WMT	0.5382468	-0.3638918	-0.0816543	0.3399102	0.0980102	-0.1095886	0.1604323
XOM	0.7342552	-0.0151137	0.4107608	0.2312420	0.3629598	-0.0861583	0.0356852

This will be the inputs of the second stage of regressions: for each stock, we have both the returns and the associated loading values.

models <- lapply(paste("`", returns$Date, "`", ' ~  MKT_RF + SMB + HML + RMW + CMA', sep = ""),
                 function(f){ lm(as.formula(f), data = FM_data) |>      # Calling lm(.)
                     summary() %>%                                      # Gathering the output
                     "$"(coef) |>                                       # Keeping only the coefs
                     data.frame() |>                                    # Convert to dataframe
                     select(Estimate)}                                  # Keep the coefficients
)
gammas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) |>   # Switch to dataframe
  data.frame(row.names = returns$Date)                                  # & set row names
colnames(gammas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Set col names
gammas[1:10,]                                                           # Show first 20 lines

	Constant	MKT_RF	SMB	HML	RMW	CMA
2000-02-29	0.1239572	-0.0394546	0.2214854	-0.2042833	-0.4434430	0.0238480
2000-03-31	0.0581515	0.0759261	-0.2102370	-0.1237725	-0.0610748	-0.0452637
2000-04-30	-0.4059528	0.4274302	-0.2842439	-0.0475553	0.0748051	0.2370577
2000-05-31	0.0186219	-0.1241036	-0.1407549	0.1927901	0.1246558	0.1118174
2000-06-30	-0.2753457	0.3787751	-0.1288447	-0.1841319	0.0255572	0.0199490
2000-07-31	-0.0758140	-0.0434219	0.0762935	0.2561677	0.3599236	-0.1600908
2000-08-31	-0.1446226	0.3299000	0.2465852	-0.1209522	-0.3083556	0.1399164
2000-09-30	-0.3315864	-0.0647979	-0.2781123	0.5613909	0.9335768	0.0550063
2000-10-31	-0.1640269	-0.0078577	-0.0565697	0.2164590	0.3112282	0.0246200
2000-11-30	0.0017944	-0.2095740	-0.3062641	0.1946525	0.5393854	-0.1810966

rm(models)                                                              # Remove models, save space

Note that the estimates are not going to be very reliable because the number of observations (number of stocks, i.e., 6) is barely larger than the number of variables (number of factors, 5). Accuracy would require a much larger number of assets (hundreds, or, preferably, thousands).

gammas |> colMeans() |> round(4)                                       # Means (over all periods)

## Constant   MKT_RF      SMB      HML      RMW      CMA 
##   0.0198  -0.0067   0.0161  -0.0052  -0.0033  -0.0105

apply(gammas, 2, t.test) |> unlist() |>                                # t-stats
  matrix(ncol = ncol(gammas), dimnames = list(c(), colnames(betas))) |>
  data.frame() |> slice(1) |> as.numeric() |> round(4)    # Take only the first line of the test

## [1]  1.6744 -0.6259  1.4781 -0.7743 -0.2025 -1.5627

Surprisingly, some averages are quite negative. But the t-stats are hardly significant.
The results depend on the initial assets. More accurate results require a much larger set of assets.

2.4 Graphical interpretation

Let’s nonetheless visually assess the stability of premia across time.

gammas |>                                            # Start from gamma (estimates)
  select(-Constant) |>                               # Take out the constant
  rownames_to_column("Date") |>                      # Convert row names to the first column => dates
  mutate(Date = as.Date(Date)) |>                    # Convert first date column to true date format
  pivot_longer(-Date, names_to = "Factor",           # Put in column/tidy format for ggplot
               values_to = "Value") |>     
  ggplot(aes(x = Date, y = Value, color = Factor)) + # Plot
  geom_line() + facet_grid(Factor~.) + theme_light() + 
    scale_color_viridis_d()

CMA and HML appears more stable, while RMW and SMB are more volatile.
Finally, what are the annual average of these series? We build a simple pivot-table below.

gammas |>                                        # To create a pivot table, start with the gammas
  select(-Constant) |>                           # Take out the constant
  rownames_to_column("Date") |>                  # Convert row names to the first column
  mutate(Date = year(as.Date(Date))) |>          # Convert this new first column to year
  pivot_longer(-Date, names_to = "Factor",       # Put in column/tidy format for ggplot
               values_to = "Value") |> 
  group_by(Date, Factor) |>                      # Prepare pivot-table
  summarise(avg_premium = mean(Value)) |>        # Pivot table!
  ggplot(aes(x = Date, y = avg_premium, color = Factor)) + # Plot
  geom_line() + geom_point() + theme_light() +
  scale_color_brewer(palette = "Set2")

## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.

The final graph shows the average of premias computed over each year in the sample. It unveils some cycles for most factors. The Size factor is known to be very sensitive to macro-economic conditions: indeed smaller firms thrive in good economic times, but are hit hardly during economic downturns. Both in 2008 and 2009, the SMB returns were negative but it’s not necessarily reflected here again because the original universe is too small.

3 Factor competition

Below, we regress each factor against all its ‘competitors’. If the alpha (coefficient of the constant/Integer vector) in the regression is not significantly different from zero, it means that the factor is somewhat redundant because it can be ‘explained’ by exposures to other factors. If not, the factor cannot be explained by the other factors and is thus adding value to the analysis.

We can thus work simply with the Fama-French factors with no external data.

factors <- c("MKT_RF", "SMB", "HML", "RMW", "CMA")
models <- lapply(paste(factors, ' ~  MKT_RF + SMB + HML + RMW + CMA-',factors),
                 function(f){ lm(as.formula(f), data = FF_factors) |> # Calling lm(.)
                     summary() %>%                                    # Gathering the output
                     "$"(coef) |>                                     # Keeping only the coefs
                     data.frame() %>%                                 # Convert to dataframe
                     filter(rownames(.) == "(Intercept)") |>          # Keep only the Intercept
                     select(Estimate,`Pr...t..`)}                     # Keep the coef & p-value
)
alphas <- matrix(unlist(models), ncol = 2, byrow = T) |>              # Switch from list to dataframe
  data.frame(row.names = factors)                                     # Set row names
colnames(alphas) <- c("coef", "p-value")                              # Set col names
alphas                                                                # Show

	coef	p-value
MKT_RF	0.0086840	0.0003485
SMB	0.0032523	0.0378363
HML	-0.0030639	0.0388970
RMW	0.0052793	0.0000503
CMA	0.0023372	0.0190311

rm(models)                                                            # Remove models

The p-values are rather small. The results are somewhat in line with the conclusions of Fama & French (2015) - Section 7 in which the authors conclude that HML is redundant (not always true!).

4 Predictive panels

In R, panel models are handled via the great plm package, so we need it! Also: let’s load a small dataset and compute forward returns.

if(!require(plm)){install.packages("plm")}           # Package for panels
library(plm)
load("data_new.RData")                               # Load data
data <- data |>  
  group_by(symbol) |>
  mutate(f_return = dplyr::lead(close)/close - 1,    # Add return column
         mkt_cap = mkt_cap / 10^9,
         revenue = revenue / 10^9) |>  
  ungroup() 
data[1:6,1:6]

date	symbol	close	vol_1y	mkt_cap	p2b
2000-11-30	AAPL	0.2946426	0.8388373	8.367954	2.144287
2000-12-31	AAPL	0.2656247	0.8322865	8.367954	2.144287
2001-01-31	AAPL	0.3861603	0.9804490	8.367954	2.144287
2001-02-28	AAPL	0.3258925	0.9769291	8.367954	2.144287
2001-03-31	AAPL	0.3941067	0.9835925	8.367954	2.144287
2001-04-30	AAPL	0.4551781	1.0023544	8.367954	2.144287

In equations below: - \(t\) is time;
- \(n\) is stock;
- \(k\) is characteristic.

The predictive relationship is

\[r_{t+1,n} = a + \sum_{k=1}^Kb^kx^k_{t,n} + u_{t+1,n}\]

If we estimate this like a stacked regression, the esimtator is called “pooled”.

if(!require(broom)){install.packages("broom")}             # Neat package for regression output

## Loading required package: broom

library(broom)
est_pooled <- plm(f_return ~ vol_1y + mkt_cap + p2b + d2e, # Model
                  data = data,                             # Data source
                  model = "pooling")                       # Estimation method
summary(est_pooled)                                        # Summary of output

## Pooling Model
## 
## Call:
## plm(formula = f_return ~ vol_1y + mkt_cap + p2b + d2e, data = data, 
##     model = "pooling")
## 
## Unbalanced Panel: n = 291, T = 28-30, N = 8553
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -0.6094810 -0.0394745 -0.0011919  0.0410463  1.2307546 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept) -2.1596e-03  1.8684e-03 -1.1559    0.2478    
## vol_1y       3.5389e-02  6.5382e-03  5.4126 6.381e-08 ***
## mkt_cap      2.8727e-06  3.0381e-06  0.9456    0.3444    
## p2b         -2.8121e-05  1.9410e-05 -1.4487    0.1474    
## d2e          2.0109e-06  1.6465e-06  1.2213    0.2220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    52.915
## Residual Sum of Squares: 52.693
## R-Squared:      0.0041949
## Adj. R-Squared: 0.0037289
## F-statistic: 9.0023 on 4 and 8548 DF, p-value: 2.9803e-07

coef_pooled <- tidy(est_pooled)                            # Output to dataframe
rm(est_pooled)

If we allow the error term to be decomposed so that

\[r_{t+1,n} = \sum_{k=1}^Kb^kx^k_{t,n} + \mu_n+ e_{t+1,n},\]

then estimators are called within (fixed-effect estimation). This means that the model allows for stock-specific idiosyncrasies.

est_within <- plm(f_return ~ vol_1y + mkt_cap + p2b + d2e, 
                  data = data, 
                  index = c("symbol", "date"),
                  model = "within") 
coef_within <- tidy(est_within)
coef_within

term	estimate	std.error	statistic	p.value
vol_1y	0.0350733	0.0071549	4.901978	0.0000010
mkt_cap	-0.0000046	0.0000038	-1.218990	0.2228816
p2b	-0.0000526	0.0000233	-2.256637	0.0240560
d2e	0.0000044	0.0000019	2.290262	0.0220305

Let’s plot the differences. There are some discrepancies!
We plot t-statistics to normalize scales.

bind_rows(
  bind_cols(coef_within, type = "within"),              # Create new type column      
  bind_cols(coef_pooled, type = "pooled")               # Create new type column
) |>
  ggplot(aes(x = term, y = statistic, fill = type)) +   # Plot
  theme_light() +                                       # White background
  geom_col(position = "dodge", alpha = 0.8) +
  scale_fill_manual(values = c("#77ABD9", "#FFAF43"))   # Custom colors

The differences seem to matter! Except for Mkt_Cap which seem to be a driver in both types of estimations

If you want to extract the fixed effects, you can use the fixef() function. Those are the \(\mu_n\).

fixef(est_within)

##        AAPL          BA         BAC           C        CSCO         CVS 
##  2.0098e-02  1.0240e-03 -2.1884e-03 -1.1473e-02 -1.7236e-04 -2.5643e-03 
##         CVX           D         DIS           F         FDX          GE 
##  5.3522e-05 -2.9273e-03 -6.2555e-04 -1.4414e-02 -5.1628e-05 -6.3901e-03 
##          HD         IBM        INTC         JNJ         JPM           K 
##  2.7330e-03 -1.8647e-03 -7.0200e-03  1.0354e-03  5.6677e-04 -8.4860e-04 
##         LLY         MCK         MRK        MSFT        ORCL         PFE 
##  3.0349e-03  1.4021e-03 -4.4464e-03  6.6255e-03  6.7220e-05 -5.8698e-03 
##          PG           T          VZ         WFC         WMT         XOM 
##  1.7305e-03 -6.2815e-03 -3.6489e-03 -9.8002e-04  2.1795e-03 -1.8899e-04

5 Exercises

Perform Fama-MacBeth regressions using a few more stocks (say, ten: find some more on Yahoo) and see if that changes the results.
Try factor competition on another dataset from Ken French’s library (e.g., add the momentum factor).

Asset pricing anomalies