This is the companion notebook to the course on empirical foundations of machine learning in asset management.
The topic is strongly tilted towards asset pricing anomalies and serves as the economic justification of ML-based factor portfolios. Indeed, since the 1990s, a large body of empirical work has shown that firms’ characteristics have an impact on future returns. This impact is possibly time-varying.
The first chunks of the code are inspired from the book Reproducible Finance by Jonathan Regenstein (http://www.reproduciblefinance.com).
The code is commented and the results are discussed as much as possible.

\(\rightarrow\) ABOUT THE NOTEBOOK:

  • the code chunks are sequential. They must be executed in the correct order.
  • text between stars appears in bold in html format.

1 Data collection

First, we collect the data for our baseline assets. We work with 6 large American companies which are associated to their ticker symbol: Apple (AAPL, technologies), Boeing (BA, industry), Citigroup (C, banking), Pfizer (PFE, healthcare), Walmart (WMT, retail) and Exxon (XOM, oil/commodities). The six firms have been chosen in very diverse industry sectors.

NOTE: The analysis below should be carried out on a much larger number of assets. We restrict the width of the analysis for simplicity and for computational purposes.

if(!require(tidyverse)){install.packages("tidyverse")}
if(!require(quantmod)){install.packages("quantmod")}
if(!require(lubridate)){install.packages("lubridate")}
library(tidyverse)                            # Package for data manipulation
library(quantmod)                             # Package for data extraction
library(lubridate)                            # Package for date management
tickers <- c("AAPL", "BA", "C", "PFE", "WMT", "XOM")  
# Tickers: Apple, Boeing, Citigroup, Pfizer, Walmart, Exxon
# Others: , "F", "DIS", "GE", "CVX", "MSFT", "GS")
min_date <- "2000-01-01"                      # Starting date
max_date <- "2022-04-27"                      # Ending date
prices <- getSymbols(tickers, src = 'yahoo',  # The data comes from Yahoo Finance
                     from = min_date,         # Start date
                     to = max_date,           # End date
                     auto.assign = TRUE, 
                     warnings = FALSE) %>% 
  map(~Ad(get(.))) %>%                        # Retrieving the data
  reduce(merge) %>%                           # Merge in one dataframe
  `colnames<-`(tickers)                       # Set the column names
head(prices,7)                                # Have a look at the result!
##                AAPL       BA        C      PFE      WMT      XOM
## 2000-01-03 0.855796 25.94028 236.7048 14.20126 44.74233 19.67147
## 2000-01-04 0.783644 25.89994 222.1900 13.67220 43.06817 19.29469
## 2000-01-05 0.795111 27.51364 231.1223 13.89496 42.18923 20.34655
## 2000-01-06 0.726304 27.79605 242.2875 14.39618 42.64964 21.39841
## 2000-01-07 0.760708 28.60290 241.1711 15.37078 45.87244 21.33561
## 2000-01-10 0.747329 28.19947 240.3335 15.34294 45.03532 21.03732
## 2000-01-11 0.709102 27.67502 237.2631 15.14801 44.36568 21.10013
rm(AAPL, BA, C, PFE, WMT, XOM)                # Remove unnecessary variables

The data is stored in the ‘prices’ variable. Let’s have a look at the series, to make sure they make sense.

prices %>%
  data.frame(Date = index(.)) %>%
  remove_rownames() %>%
  pivot_longer(-Date, names_to = "Asset", 
               values_to = "Price") %>%
  ggplot(aes(x = Date, y = Price, color = Asset)) + geom_line() + 
  facet_wrap(vars(Asset), scales = "free") + theme_light()

This seems alright, though Citigroup suffered massively in the subprime crisis.

Before we move towards the factors, we transform (via mutate) the daily prices into monthly returns. Returns are indeed the raw material in portfolio management. In the code, we resort to a lot of piping using the pipe operator %>%.

returns <- prices %>%                                 # Start from prices
  to.monthly(indexAt = "lastof", OHLC = FALSE) %>%    # Convert to monthly 
  data.frame(Date = index(.)) %>%                     # Convert the index to a date
  remove_rownames() %>%                               # Remove index => converted into row names
  pivot_longer(-Date, names_to = "Asset", 
               values_to = "Price") %>%               # Put the data in 'tidy' format
  group_by(Asset) %>%                                 # Group the rows by asset to compute returns
  mutate(returns = Price / dplyr::lag(Price) - 1) %>% # Compute the returns
  select(-Price) %>%                                  # Remove price column
  pivot_wider(names_from = "Asset", 
              values_from = "returns") %>%            # Put the data back in matrix form
  select(all_of(c("Date", tickers))) %>%              # Keep only the relevant columns
  na.omit()                                           # Discard rows with missing/NA data
head(returns)                                         # Again, show the result!
## # A tibble: 6 × 7
##   Date          AAPL      BA       C     PFE      WMT      XOM
##   <date>       <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
## 1 2000-02-29  0.105  -0.167  -0.0896 -0.110  -0.110   -0.0862 
## 2 2000-03-31  0.185   0.0237  0.157   0.138   0.160    0.0357 
## 3 2000-04-30 -0.0865  0.0496 -0.0121  0.152  -0.0199  -0.00401
## 4 2000-05-31 -0.323  -0.0121  0.0540  0.0584  0.0406   0.0782 
## 5 2000-06-30  0.247   0.0704 -0.0312  0.0787  0.00109 -0.0578 
## 6 2000-07-31 -0.0298  0.167   0.170  -0.0964 -0.0412   0.0215

We now turn to the importation of factor data. It’s extracted from the amazing database maintained by Kenneth French: http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

temp <- tempfile()
link <- "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip"
download.file(link, temp, quiet = TRUE)         # Download!
nb_factors <- 5                                 # Number of factors
FF_factors <- read_csv(unz(temp, 
                           "F-F_Research_Data_5_Factors_2x3.csv"),   # The file name
                       skip = 3) %>%            # Check the nb of lines to be skipped (could be 2!)
  rename(Date = `...1`, MKT_RF = `Mkt-RF`) %>%      # Change name of the first column
  mutate_at(vars(-Date), as.numeric) %>%                             # Convert to number
  mutate(Date = ymd(parse_date_time(Date, "%Y%m"))) %>%              # Date in right format
  mutate(Date = rollback(Date + months(1)))                          # End of month date
FF_factors <- FF_factors %>% mutate(MKT_RF = MKT_RF / 100,           # Scaling returns
                                    SMB = SMB / 100,
                                    HML = HML / 100,
                                    RMW = RMW / 100,
                                    CMA = CMA / 100,
                                    RF = RF / 100) %>%        # That's the risk-free rate
  filter(Date >= min_date)                      # Finally, we keep the recent points
head(FF_factors)                                # And we have a look!
## # A tibble: 6 × 7
##   Date        MKT_RF     SMB     HML     RMW     CMA     RF
##   <date>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1 2000-01-31 -0.0474  0.0445 -0.0189 -0.0629  0.0474 0.0041
## 2 2000-02-29  0.0245  0.184  -0.0981 -0.188  -0.0035 0.0043
## 3 2000-03-31  0.052  -0.154   0.0823  0.118  -0.0161 0.0047
## 4 2000-04-30 -0.064  -0.0496  0.0725  0.0767  0.0562 0.0046
## 5 2000-05-31 -0.0442 -0.0387  0.0483  0.0418  0.0132 0.005 
## 6 2000-06-30  0.0464  0.0987 -0.0841 -0.0826 -0.0291 0.004

Finally, we merge the two sources of data into one single variable.

data <- left_join(returns, FF_factors, by = "Date") 
# left_join() is an incredible function that merges two sets through one common column
head(data) # See the outcome!
## # A tibble: 6 × 13
##   Date          AAPL      BA       C     PFE      WMT      XOM  MKT_RF     SMB
##   <date>       <dbl>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>   <dbl>   <dbl>
## 1 2000-02-29  0.105  -0.167  -0.0896 -0.110  -0.110   -0.0862   0.0245  0.184 
## 2 2000-03-31  0.185   0.0237  0.157   0.138   0.160    0.0357   0.052  -0.154 
## 3 2000-04-30 -0.0865  0.0496 -0.0121  0.152  -0.0199  -0.00401 -0.064  -0.0496
## 4 2000-05-31 -0.323  -0.0121  0.0540  0.0584  0.0406   0.0782  -0.0442 -0.0387
## 5 2000-06-30  0.247   0.0704 -0.0312  0.0787  0.00109 -0.0578   0.0464  0.0987
## 6 2000-07-31 -0.0298  0.167   0.170  -0.0964 -0.0412   0.0215  -0.0251 -0.0102
## # … with 4 more variables: HML <dbl>, RMW <dbl>, CMA <dbl>, RF <dbl>

With our full dataset at hand, we are ready to begin the analysis.

2 Fama-Macbeth regressions

2.1 Principle

The Fama-Macbeth regressions can be used to assess the premium associated to factors. They are performed in two steps.

  1. First, all assets are regressed against the chosen factors (which gives the factor loadings, i.e., the estimated coefficients in the regression).
  2. Then, for each date, the (cross-sectional) returns of the assets are regressed against the loadings obtained in the first step.
  • This yields time-series of premia, which can then be averaged into a proxy for the risk premium associated with each factor.

2.2 First pass

A natural way to proceed would be to resort to a for() loop to perform the regressions iteratively. We take another route and instead make us of the ‘apply’ family of functions. We use a procedure that creates all models via their character syntax in R. And finally, we gather the results in the ‘betas’ variable.

models <- lapply(paste(tickers, ' ~  MKT_RF + SMB + HML + RMW + CMA'),  # Specification of models
                 function(f){ lm(as.formula(f), data = data) %>%        # Calling lm(.)
                     summary() %>%                                      # Gathering the output
                     "$"(coef) %>%                                      # Keeping only the coefs
                     data.frame() %>%                                   # Convert to dataframe
                     select(Estimate)}                                  # Keep the coefficients
)

betas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) %>%   # Extracting the betas
  data.frame(row.names = tickers)                                       # Formatting: row names
colnames(betas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA")  # Formatting: col names
betas                                                                   # Showing the values
##           Constant    MKT_RF          SMB         HML        RMW         CMA
## AAPL  2.241459e-02 1.2603179  0.203876790 -0.56038315  0.2337760 -0.97665289
## BA    1.705785e-03 1.1354167  0.099093615  0.62774886  0.3038202  0.02417191
## C    -2.514768e-03 1.5166111 -0.533662630  2.12343223 -0.9939174 -1.17013600
## PFE  -9.473582e-04 0.8444254 -0.406146782 -0.15056910  0.3611860  0.59252631
## WMT   2.539574e-03 0.5297648 -0.396493766 -0.08939238  0.3333755  0.20786897
## XOM  -3.720123e-05 0.7409502 -0.006314125  0.34321241  0.2033925  0.33725592
rm(models)                                                              # Remove models to save space

The market beta is the second column. When there are no other factors, a beta above 1 (resp. below 1) means that the stock is more volatile than the market (resp. less volatile). The two stocks with high market betas also load heavily (and negatively on CMA). Citigroup also has a strong positive exposure to HML.

2.3 Second pass

We then turn to the second step of the procedure. This time, we regress asset returns against the betas, one date at a time. We start by formatting the data.

loadings <- betas %>%                         # Start from loadings
  select(-Constant) %>%                       # Keep only the loadings
  data.frame()                                # Convert to dataframe             
ret <- returns %>%                            # Start from returns
  select(-Date) %>%                           # Keep the returns only
  data.frame(row.names = returns$Date) %>%    # Set row names
  t()                                         # Transpose

FM_data <- cbind(loadings, ret)               # Aggregate both
FM_data[,1:8]                                 # Quick look at the first 8 columns
##         MKT_RF          SMB         HML        RMW         CMA  2000-02-29
## AAPL 1.2603179  0.203876790 -0.56038315  0.2337760 -0.97665289  0.10482098
## BA   1.1354167  0.099093615  0.62774886  0.3038202  0.02417191 -0.16709977
## C    1.5166111 -0.533662630  2.12343223 -0.9939174 -1.17013600 -0.08958239
## PFE  0.8444254 -0.406146782 -0.15056910  0.3611860  0.59252631 -0.11001050
## WMT  0.5297648 -0.396493766 -0.08939238  0.3333755  0.20786897 -0.10958886
## XOM  0.7409502 -0.006314125  0.34321241  0.2033925  0.33725592 -0.08615805
##      2000-03-31   2000-04-30
## AAPL 0.18484066 -0.086516014
## BA   0.02368869  0.049586144
## C    0.15700398 -0.012084238
## PFE  0.13813239  0.152136900
## WMT  0.16043235 -0.019911897
## XOM  0.03568452 -0.004006031

This will be the inputs of the second stage of regressions: for each stock, we have both the returns and the associated loading values.

models <- lapply(paste("`", returns$Date, "`", ' ~  MKT_RF + SMB + HML + RMW + CMA', sep = ""),
                 function(f){ lm(as.formula(f), data = FM_data) %>%     # Calling lm(.)
                     summary() %>%                                      # Gathering the output
                     "$"(coef) %>%                                      # Keeping only the coefs
                     data.frame() %>%                                   # Convert to dataframe
                     select(Estimate)}                                  # Keep the coefficients
)
gammas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) %>%  # Switch to dataframe
  data.frame(row.names = returns$Date)                                  # & set row names
colnames(gammas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Set col names
gammas[1:20,]                                                           # Show first 20 lines
##               Constant        MKT_RF           SMB          HML         RMW
## 2000-02-29  0.06195001 -0.0073364873  0.1581067403 -0.197437047 -0.37044657
## 2000-03-31  0.08084707  0.0148681466 -0.2129338840 -0.078731564 -0.00459732
## 2000-04-30 -0.30384152  0.3201790485 -0.1598278960 -0.016377913  0.03869766
## 2000-05-31  0.09976535 -0.1274611386 -0.0435199251  0.104245250 -0.10700397
## 2000-06-30 -0.27314377  0.3015530083 -0.1325347738 -0.077149801  0.21248284
## 2000-07-31 -0.08904186 -0.0114286039  0.0526508050  0.248621591  0.36545601
## 2000-08-31 -0.17292260  0.3260336572  0.2376255788 -0.090300684 -0.21137606
## 2000-09-30 -0.20529126 -0.0665755163 -0.1159311047  0.469183573  0.64677663
## 2000-10-31 -0.12210987 -0.0028578157 -0.0005302572  0.178998559  0.20686626
## 2000-11-30  0.04074763 -0.2255984526 -0.2889947221  0.192706184  0.48441357
## 2000-12-31 -0.01898855  0.0003900559 -0.1098393697 -0.002510792 -0.04779740
## 2001-01-31  0.25837803 -0.0658709363  0.0462664928 -0.295905078 -0.36659592
## 2001-02-28 -0.31176463  0.2230169625  0.0181719396  0.096327439  0.19657947
## 2001-03-31  0.29737619 -0.2009474362  0.2080627271 -0.169256410 -0.28175765
## 2001-04-30  0.06112848  0.0761933308  0.1346452223 -0.036126678 -0.09761786
## 2001-05-31 -0.04475331 -0.0549063553 -0.0827848288  0.147036924  0.13958393
## 2001-06-30  0.17289993 -0.0361796958  0.1921473429 -0.203174165 -0.45890514
## 2001-07-31 -0.09506371 -0.1665547145 -0.3967360445  0.267139935  0.70087410
## 2001-08-31  0.01877263  0.0630448562  0.2253066808 -0.164458737 -0.44628784
## 2001-09-30  0.33914234 -0.2159149549 -0.0048804592 -0.316727845 -0.80898239
##                     CMA
## 2000-02-29  0.004255256
## 2000-03-31 -0.087668992
## 2000-04-30  0.175949065
## 2000-05-31  0.173798303
## 2000-06-30 -0.076003048
## 2000-07-31 -0.119557651
## 2000-08-31  0.090467520
## 2000-09-30  0.156536012
## 2000-10-31  0.064019811
## 2000-11-30 -0.144057664
## 2000-12-31  0.048969686
## 2001-01-31 -0.193386585
## 2001-02-28  0.123951337
## 2001-03-31 -0.096038572
## 2001-04-30  0.027716795
## 2001-05-31  0.037622030
## 2001-06-30  0.007820043
## 2001-07-31 -0.184186844
## 2001-08-31  0.148226811
## 2001-09-30  0.223495340
rm(models)                                                              # Remove models, save space

Note that the estimates are not going to be very reliable because the number of observations (number of stocks, i.e., 6) is barely larger than the number of variables (number of factors, 5). Accuracy would require a much larger number of assets (hundreds, or, preferably, thousands).

gammas %>% colMeans() %>% round(4)                                       # Means (over all periods)
## Constant   MKT_RF      SMB      HML      RMW      CMA 
##   0.0069   0.0052   0.0084  -0.0061   0.0047  -0.0075
apply(gammas, 2, t.test) %>% unlist() %>%                                # t-stats
  matrix(ncol = ncol(gammas), dimnames = list(c(), colnames(betas))) %>%
  data.frame() %>% slice(1) %>% as.numeric() %>% round(4)    # Take only the first line of the test
## [1]  0.6954  0.5968  0.8468 -0.8422  0.2671 -1.1443

Surprisingly, some averages are quite negative. But the t-stats are hardly significant.

2.4 Graphical interpretation

Let’s nonetheless visually assess the stability of premia across time.

gammas %>%                                           # Start from gamma (estimates)
  select(-Constant) %>%                              # Take out the constant
  rownames_to_column("Date") %>%                     # Convert row names to the first column => dates
  mutate(Date = as.Date(Date)) %>%                   # Convert first date column to true date format
  pivot_longer(-Date, names_to = "Factor",           # Put in column/tidy format for ggplot
               values_to = "Value") %>%     
  ggplot(aes(x = Date, y = Value, color = Factor)) + # Plot
  geom_line() + facet_grid(Factor~.) + theme_light() + 
    scale_color_viridis_d()

CMA and HML appears more stable, while RMW and SMB are more volatile.
Finally, what are the annual average of these series? We build a simple pivot-table below.

gammas %>%                                        # To create a pivot table, start with the gammas
  select(-Constant) %>%                           # Take out the constant
  rownames_to_column("Date") %>%                  # Convert row names to the first column
  mutate(Date = year(as.Date(Date))) %>%          # Convert this new first column to year
  pivot_longer(-Date, names_to = "Factor",        # Put in column/tidy format for ggplot
               values_to = "Value") %>% 
  group_by(Date, Factor) %>%                      # Prepare pivot-table
  summarise(avg_premium = mean(Value)) %>%        # Pivot table!
  ggplot(aes(x = Date, y = avg_premium, color = Factor)) + # Plot
  geom_line() + geom_point() + theme_light() +
  scale_color_brewer(palette = "Set2")
## `summarise()` has grouped output by 'Date'. You can override using the
## `.groups` argument.

The final graph shows the average of premias computed over each year in the sample. It unveils some cycles for most factors. The Size factor is known to be very sensitive to macro-economic conditions: indeed smaller firms thrive in good economic times, but are hit hardly during economic downturns. Both in 2008 and 2009, the SMB returns were negative.

3 Factor competition

Below, we regress each factor against all its ‘competitors’. If the alpha (coefficient of the constant/Integer vector) in the regression is not significantly different from zero, it means that the factor is somewhat redundant because it can be ‘explained’ by exposures to other factors. If not, the factor cannot be explained by the other factors and is thus adding value to the analysis.

We can thus work simply with the Fama-French factors with no external data.

factors <- c("MKT_RF", "SMB", "HML", "RMW", "CMA")
models <- lapply(paste(factors, ' ~  MKT_RF + SMB + HML + RMW + CMA-',factors),
                 function(f){ lm(as.formula(f), data = FF_factors) %>% # Calling lm(.)
                     summary() %>%                                     # Gathering the output
                     "$"(coef) %>%                                     # Keeping only the coefs
                     data.frame() %>%                                  # Convert to dataframe
                     filter(rownames(.) == "(Intercept)") %>%          # Keep only the Intercept
                     select(Estimate,`Pr...t..`)}                      # Keep the coef & p-value
)
alphas <- matrix(unlist(models), ncol = 2, byrow = T) %>%              # Switch from list to dataframe
  data.frame(row.names = factors)                                      # Set row names
colnames(alphas) <- c("coef", "p-value")                               # Set col names
alphas                                                                 # Show
##                coef      p-value
## MKT_RF  0.009359218 2.334365e-04
## SMB     0.003976494 1.978716e-02
## HML    -0.004216841 9.134865e-03
## RMW     0.005920758 3.013396e-05
## CMA     0.003243847 1.685547e-03
rm(models)                                                             # Remove models

The p-values are rather small. The results are somewhat in line with the conclusions of Fama & French (2015) - Section 7 in which the authors conclude that HML is redundant (not always true!).

4 Predictive panels

In R, panel models are handled via the great plm package, so we need it! Also: let’s load a small dataset and compute forward returns.

if(!require(plm)){install.packages("plm")}             # Package for panels
library(plm)
load("data.RData")                                     # Load data
data <- data %>%  
  group_by(Tick) %>%
  mutate(F_return = dplyr::lead(Close)/Close - 1) %>%  # Add return column
  ungroup()
data
## # A tibble: 7,650 × 10
##    Tick  Date       Close Vol_1M Mkt_Cap   P2B   D2E Prof_Marg ESG_rank F_return
##    <chr> <date>     <dbl>  <dbl>   <dbl> <dbl> <dbl>     <dbl>    <dbl>    <dbl>
##  1 AAPL  1999-12-31 0.791   53.6  16532.  3.35  9.66      8.31       NA  0.00885
##  2 AAPL  2000-01-31 0.798   80.8  16683.  3.38  6.80      7.81       NA  0.105  
##  3 AAPL  2000-02-29 0.882   60.0  18480.  3.73  6.80      7.81       NA  0.184  
##  4 AAPL  2000-03-31 1.04    75.7  21896.  4.68  6.80      7.81       NA -0.0862 
##  5 AAPL  2000-04-28 0.954   81.6  20002.  4.28  7.12     12.0        NA -0.323  
##  6 AAPL  2000-05-31 0.646   61.3  13670.  2.90  7.12     12.0        NA  0.248  
##  7 AAPL  2000-06-30 0.806   66.6  17047.  3.64  7.12     12.0        NA -0.0298 
##  8 AAPL  2000-07-31 0.782   76.7  16539.  3.53  7.18     11.0        NA  0.198  
##  9 AAPL  2000-08-31 0.937   56.7  19803.  4.24  7.18     11.0        NA -0.577  
## 10 AAPL  2000-09-29 0.396  231.    8368.  1.85  7.18     11.0        NA -0.240  
## # … with 7,640 more rows

In equations below: - \(t\) is time;
- \(n\) is stock;
- \(k\) is characteristic.

The predictive relationship is

\[r_{t+1,n} = a + \sum_{k=1}^Kb^kx^k_{t,n} + u_{t+1,n}\]

If we estimate this like a stacked regression, the esimtator is called “pooled”.

if(!require(broom)){install.packages("broom")}             # Neat package for regression output
## Loading required package: broom
library(broom)
est_pooled <- plm(F_return ~ Vol_1M + Mkt_Cap + P2B + D2E, # Model
                  data = data,                             # Data source
                  model = "pooling")                       # Estimation method
summary(est_pooled)                                        # Summary of output
## Pooling Model
## 
## Call:
## plm(formula = F_return ~ Vol_1M + Mkt_Cap + P2B + D2E, data = data, 
##     model = "pooling")
## 
## Balanced Panel: n = 30, T = 254, N = 7620
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -0.59128763 -0.04086125  0.00044276  0.04234380  1.26571898 
## 
## Coefficients:
##                Estimate  Std. Error t-value  Pr(>|t|)    
## (Intercept)  9.8501e-03  2.0040e-03  4.9152 9.052e-07 ***
## Vol_1M       2.8787e-05  4.9794e-05  0.5781   0.56320    
## Mkt_Cap     -1.1541e-08  6.1956e-09 -1.8627   0.06254 .  
## P2B          4.2764e-06  1.4138e-05  0.3025   0.76229    
## D2E         -1.6050e-06  2.3055e-06 -0.6962   0.48633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    50.639
## Residual Sum of Squares: 50.611
## R-Squared:      0.000561
## Adj. R-Squared: 3.6017e-05
## F-statistic: 1.06861 on 4 and 7615 DF, p-value: 0.37022
coef_pooled <- tidy(est_pooled)                            # Output to dataframe
rm(est_pooled)

If we allow the error term to be decomposed so that

\[r_{t+1,n} = a + \sum_{k=1}^Kb^kx^k_{t,n} + \mu_n+ e_{t+1,n},\]

then estimators are called within (fixed-effect estimation). This means that the model allows for stock-specific idiosyncrasies.

est_within <- plm(F_return ~ Vol_1M + Mkt_Cap + P2B + D2E, 
                  data = data, 
                  model = "within") 
summary(est_within)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = F_return ~ Vol_1M + Mkt_Cap + P2B + D2E, data = data, 
##     model = "within")
## 
## Balanced Panel: n = 30, T = 254, N = 7620
## 
## Residuals:
##       Min.    1st Qu.     Median    3rd Qu.       Max. 
## -0.6136509 -0.0406315  0.0013567  0.0420280  1.2655448 
## 
## Coefficients:
##            Estimate  Std. Error t-value Pr(>|t|)   
## Vol_1M  -1.3733e-05  5.2289e-05 -0.2626 0.792843   
## Mkt_Cap -2.4726e-08  8.1732e-09 -3.0253 0.002492 **
## P2B     -6.7260e-06  1.6375e-05 -0.4107 0.681274   
## D2E      5.1642e-07  2.8497e-06  0.1812 0.856201   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    50.461
## Residual Sum of Squares: 50.397
## R-Squared:      0.001266
## Adj. R-Squared: -0.0030786
## F-statistic: 2.40406 on 4 and 7586 DF, p-value: 0.047508
coef_within <- tidy(est_within)

Let’s plot the differences. There are some discrepancies!
We plot t-statistics to normalize scales.

bind_rows(
  bind_cols(coef_within, type = "within"),              # Create new type column      
  bind_cols(coef_pooled, type = "pooled")               # Create new type column
) %>%
  ggplot(aes(x = term, y = statistic, fill = type)) +   # Plot
  theme_light() +                                       # White background
  geom_col(position = "dodge", alpha = 0.8) +
  scale_fill_manual(values = c("#77ABD9", "#FFAF43"))   # Custom colors

The differences seem to matter! Except for Mkt_Cap which seem to be a driver in both types of estimations

If you want to extract the fixed effects, you can use the fixef() function.

fixef(est_within)
##      AAPL        BA       BAC         C      CSCO       CVS       CVX         D 
## 0.0375688 0.0157350 0.0144934 0.0067987 0.0095592 0.0105258 0.0127779 0.0105631 
##       DIS         F        GE        HD       IBM      INTC       JNJ       JPM 
## 0.0140952 0.0082480 0.0074514 0.0127411 0.0090035 0.0128892 0.0139873 0.0152703 
##         K       MCK       MRK      MSFT      ORCL       PFE        PG         T 
## 0.0069580 0.0127352 0.0094126 0.0216019 0.0118950 0.0100482 0.0116584 0.0085960 
##       UNH       UPS        VZ       WFC       WMT       XOM 
## 0.0213596 0.0094590 0.0095485 0.0122815 0.0116318 0.0141969

5 Exercises

  1. Perform Fama-MacBeth regressions using a few more stocks (say, ten: find some more on Yahoo) and see if that changes the results.
  2. Try factor competition on another dataset from Ken French’s library (e.g., add the momentum factor).