1 Introduction

This notebook presents the first empirical study of the paper Forking paths in empirical studies available on SSRN.
The paper has had several rounds of review, which is why some of the material may not be present in some versions of the article. We provide everything for the sake of completeness.

The idea of the paper is to decompose any empirical analysis into a series of steps, which are pompously called mapppings in the paper.
Each step matters and can have an important impact on the outcome of the study.
Because of that, we argue that the researcher should, if possible, keep track of all the possible modelling options and present the distribution of outcomes (e.g., t-statistics or p-values) across all configurations.

Below, most code chunks are hidden to ease readability.
It is easy to access all code content by clicking on the related buttons.

2 Data

First, we test the data importation & load the libraries.
The data, downloaded from Amit Goyal’s website is stored on a Github repo.
It was used in the follow-up paper A Comprehensive Look at the Empirical Performance of Equity Premium Prediction II.
Because we do not want to download the data multiple times, we do it only once, for three sheets (monthly, quarterly & yearly data).

library(tidyverse)   # Data wrangling & plotting
library(readxl)      # To read excel files
library(zoo)         # For data imputation
library(DescTools)   # For winsorization
library(tictoc)      # CPU time estimation
library(sandwich)    # HAC estimator
library(lmtest)      # Statistical inference
library(furrr)       # Parallel computing
library(xtable)      # LaTeX exports
library(patchwork)   # Plot combination
library(reshape2)    # List management

loadWorkbook_url <- function(sheet) { # Function that downloads the data from online file
  url = "https://github.com/shokru/coqueret.github.io/blob/master/files/misc/PredictorData2021.xlsx?raw=true"
  temp_file <- tempfile(fileext = ".xlsx")
  download.file(url = url, destfile = temp_file, mode = "wb", quiet = TRUE)
  read_excel(temp_file, sheet = sheet)
}

data_month <- loadWorkbook_url(1)    # Dataframe for monthly data
data_quarter <- loadWorkbook_url(2)  # Dataframe for quarterly data
data_year <- loadWorkbook_url(3)     # Dataframe for annual data

3 Mappings

Next, we code the modules. They correspond to the \(f_j\) mappings in the paper.
The chaining of mappings follows the scheme below:

Because of the strong analogy with neural networks, we approach the coding of mappings/modules similarly to the early implementation of the sequential Keras framework.

3.1 Sheet selection

First module: decide which sheet to work with (i.e., horizon: monthly, quarterly or annually).

prob_sheet <- c(0,1,0)
module_sheet <- function(prob_sheet){
  prob_month <- prob_sheet[1]                                           # Prob. to pick sheet 1 (monthly)
  prob_quarter <- prob_sheet[2]                                         # Prob. to pick sheet 2 (quarterly)
  prob_year <- prob_sheet[3]                                            # Prob. to pick sheet 3 (yearly)
  if((prob_month == 0) + (prob_quarter == 0) + (prob_year == 0) == 2){
    if(prob_month == 1){return(data_month)}
    if(prob_quarter == 1){return(data_quarter)}
    if(prob_year == 1){return(data_year)}
  } else
    p = runif(1)                                                        # This is the random option (not used below)
  if(p <= prob_month){return(data_month)}
  else if(p <= prob_month+prob_quarter){return(data_quarter)}
  else {return(data_year)}
}
module_sheet(prob_sheet)[250:260, 1:11] |>                               # A snapshot
  mutate(across(everything(), as.numeric)) |> 
  knitr::kable(font_size = 5)

yyyyq	Index	D12	E12	b/m	tbl	AAA	BAA	lty	ntis	Rfree
19332	10.91	0.4700	0.4250	0.8335032	0.0007	0.0446	0.0707	0.0306	0.0007609	0.003350
19333	9.83	0.4550	0.4325	0.8679966	0.0004	0.0436	0.0727	0.0308	-0.0004585	0.000175
19334	10.10	0.4400	0.4400	0.8290260	0.0029	0.0450	0.0775	0.0336	0.0061958	0.000100
19341	10.75	0.4425	0.4525	0.8025122	0.0024	0.0413	0.0626	0.0307	0.0096063	0.000725
19342	9.81	0.4450	0.4650	0.8407311	0.0015	0.0393	0.0606	0.0289	0.0059828	0.000600
19343	9.10	0.4475	0.4775	0.8703644	0.0021	0.0396	0.0657	0.0310	0.0194390	0.000375
19344	9.50	0.4500	0.4900	0.7737409	0.0023	0.0381	0.0623	0.0293	0.0208127	0.000525
19351	8.47	0.4500	0.7300	0.8007541	0.0015	0.0367	0.0620	0.0274	0.0120460	0.000575
19352	10.23	0.4400	0.8100	0.6818182	0.0015	0.0361	0.0577	0.0270	0.0193440	0.000375
19353	11.59	0.4400	0.7600	0.6117344	0.0020	0.0359	0.0553	0.0282	0.0080266	0.000375
19354	13.43	0.4700	0.7600	0.5599112	0.0015	0.0344	0.0530	0.0276	0.0086888	0.000500

After the first module, each mapping takes data as first argument. This will make it easy to use pipes (sequences of instructions).

3.2 Handling missing data

Second module: data cleaning.
NOTE: all columns must be converted to numbers beforehand.

module_cleaning <- function(data, prob_remove){                      # Two inputs: data & choice
  data <- data |> mutate(across(everything(), as.numeric))           # Convert everything to numbers
  if(runif(1) <= prob_remove){                                       # If remove:
    return(data |> na.omit())                                        # REMOVE!
  } else
    return(data |> mutate(across(everything(), na.locf, na.rm = F))) # Impute from previous non-NA (not very useful here)
}

3.3 Winsorizing

Third module: winsorizing.
We compute some values of predictors in this module (from the raw series): payout, dfr and dfy.
The winsorization comes after. The only input is the level for the winsorization.

module_winsorization <- function(data, threshold){
  data <- data |> mutate(payout = log(D12/E12),
                         dfy = BAA - AAA,
                         dfr = corpr - ltr)
  data <- data |> mutate(across(2:ncol(data), Winsorize, probs = c(threshold, 1-threshold), na.rm = T))
  return(data)
}

3.4 Levels vs. differences

Fourth module: levels vs. differences.
This step decides if independent variables are left as levels, or if variations are used instead.

differentiate <- function(v){v - lag(v)}

module_levels <- function(data, prob_levels){
  if(runif(1) < prob_levels){return(data)}
  else return(data |> mutate(across(3:ncol(data), differentiate)) |> na.omit())
}

3.5 Choice of predictor

Fifth module: independent variable.
This stage sets the (unique) predictor. There are 6 in the study.

module_independent <- function(data, variable){
  data |> dplyr::select(all_of(c("Index", variable, "Rfree")))
}

3.6 Horizon for returns

Sixth module: horizon.
This module determines the horizon of the dependent variable (return) and removes missing data.
It is given in periods (thus, results are contingent on the initial choice of the sheet (first module)).

module_horizon <- function(data, horizon){
  data[,1] <- lead(data[,1], horizon)/data[,1] - 1 - horizon * data[,3]  # Excess returns on the Index (S&P 500)
  data |> na.omit()
}

3.7 Subsampling: starting point

Seventh module: starting point.
In our study, we allow for simple subsampling.
The idea is that predictability should not be (too much) time-dependent (though this may be the case).
Thus, we allow for 3 sample size options:
- full sample
- sample that begins in the middle of the available data (second half)
- sample that stop at the the middle of the data (first half)

This module picks the starting point (either the original first point, or the middle point).

module_start <- function(data, start_quantile){
  L <- nrow(data)
  data[ceiling(L*start_quantile):L, ]
}

3.8 Subsampling: stopping point

Eighth module: stopping point.

This module picks the stopping point (either the original last point, or the middle point).

module_stop <- function(data, stop_quantile){
  L <- nrow(data)
  data[1:ceiling(L*stop_quantile), ]
}

3.9 Estimator type

Ninth module: estimator type.
This last step takes the data and performs estimations.
Two options are possible: the standard iid estimator, or the HAC estimator from Newey-West.
The final function yields 10 outputs:
- the coefficient, the standard error, the t-statistic, the p-value, the skewness of errors, which are baseline outputs;
- the two criteria (AIC & BIC), for frequentist model averaging;
- the sample size, the sum of squared residuals and the variance of the dependent variable, for the Bayesian model averaging.

module_estimator <- function(data, estimator){
  y <- data |> pull(1)                                             # Naming the dependent variable
  x <- data |> pull(2)                                             # Naming the independent variable
  x <- x / sd(x)                                                    # Scaling the predictors
  
  if(nrow(data) < 30){                                              # Test for sample size
    t_stat <- NA                                                    # NA if too small (not enough obs.)
    sd <- NA
    coef <- NA
    p_val <- NA
    aic <- NA
    bic <- NA
    sum_sq_err <- NA
    var_y <- NA
    skew <- NA
  } else {
    if(estimator == "HAC"){
      model <- lm(y ~ x)                                        # Running the regression
      coef <- summary(model)$coef[2,1]                          # OLS coefficient
      sd <- coeftest(model, vcov. = NeweyWest(model))[2,2]      # NW standard error
      t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3]  # OLS with Newey-West Cov matrix
      p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4]   # p-value
      aic <- AIC(model)                                         # Akaike Info Criterion
      bic <- BIC(model)                                         # Bayesian Info Criterion
      sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
      var_y <- var(y)                                           # Variance of dependent variable
      skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
    } 
    if(estimator == "OLS"){
      model <- lm(y ~ x)                                        # Running the regression
      coef <- summary(model)$coef[2,1]                          # OLS coef
      sd <- summary(model)$coef[2,2]                            # OLS coef
      t_stat <- summary(model)$coef[2,3]                        # OLS t-statistic
      p_val <- summary(model)$coef[2,4]                         # p-value
      aic <- AIC(model)                                         # Akaike Info Criterion
      bic <- BIC(model)                                         # Bayesian Info Criterion
      sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
      var_y <- var(y)                                           # Variance of dependent variable
      skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
    }
    if(estimator == "AUG"){
      model_x <- lm(x ~ lag(x))                                 # x-regression
      rho <- model_x$coef[2]
      nu <- lead(x) - rho * x                                   # x-innovations            
      model <- lm(y ~ x + nu)
      coef <- summary(model)$coef[2,1]                          # OLS coef
      sd <- summary(model)$coef[2,2]                            # OLS coef
      t_stat <- summary(model)$coef[2,3]                        # OLS t-statistic
      p_val <- summary(model)$coef[2,4]                         # p-value
      aic <- AIC(model)                                         # Akaike Info Criterion
      bic <- BIC(model)                                         # Bayesian Info Criterion
      sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
      var_y <- var(y)                                           # Variance of dependent variable
      skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
    }    
    if(estimator == "AUG-HAC"){
      model_x <- lm(x ~ lag(x))                                 # x-regression
      rho <- model_x$coef[2]
      nu <- lead(x) - rho * x                                   # x-innovations            
      model <- lm(y ~ x + nu)
      coef <- summary(model)$coef[2,1]                          # OLS coef
      sd <- coeftest(model, vcov. = NeweyWest(model))[2,2]      # NW standard error
      t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3]  # OLS with Newey-West Cov matrix
      p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4]   # p-value
      aic <- AIC(model)                                         # Akaike Info Criterion
      bic <- BIC(model)                                         # Bayesian Info Criterion
      sum_sq_err <- sum(model$residuals^2)                      # Sum of squared residuals
      var_y <- var(y)                                           # Variance of dependent variable
      skew <- mean((model$residuals/sd(model$residuals))^3)     # Skewness of residuals
    }    
  }
  return(list(coef = coef, sd = sd, t_stat = t_stat, p_val = p_val, 
              sample_size = nrow(data), aic = aic, bic = bic,
              sum_sq_err = sum_sq_err, var_y = var_y, skew = skew))
}

4 Aggregation & spanning

4.1 Module chaining

We show a simple test below.
It shows the natural chaining of the mappings, akin to a neural network (without back-propagation!).
All modules are executed successively.

prob_sheet <- c(0,0,1)   # Yearly data
prob_remove <- 1         # Remove NAs
threshold <- 0.01        # Threshold for winsorization
prob_levels <- 0         # Switch to differences
variable <- "payout"     # Which independent variable 
horizon <- 1             # Return horizon
start_quantile <- 0.5    # Starting point
stop_quantile <- 1       # Stopping point
estimator <- "OLS"       # Which estimator

module_sheet(prob_sheet = prob_sheet) |>              # Step 1: select data frequency
  module_cleaning(prob_remove = prob_remove) |>     # Step 2: clean the data
  module_winsorization(threshold = threshold) |>    # Step 3: manage outliers
  module_levels(prob_levels = prob_levels) |>       # Step 4: decide between levels versus differences
  module_independent(variable = variable) |>        # Step 5: choose predictor
  module_horizon(horizon = horizon) |>              # Step 6: pick return horizon (dependent variable)
  module_start(start_quantile = start_quantile) |>  # Step 7: starting point (for subsampling)
  module_stop(stop_quantile = stop_quantile) |>     # Step 8: ending point (for subsampling)
  module_estimator(estimator = estimator)            # Step 9: select estimator type

## $coef
## [1] -0.01681519
## 
## $sd
## [1] 0.02298709
## 
## $t_stat
## [1] -0.7315057
## 
## $p_val
## [1] 0.4682627
## 
## $sample_size
## [1] 47
## 
## $aic
## [1] -37.36275
## 
## $bic
## [1] -31.81231
## 
## $sum_sq_err
## [1] 1.093801
## 
## $var_y
## [1] 0.02406103
## 
## $skew
## [1] -0.6425367

Below, we code a function that takes all parameters and that runs the whole chain of operators/mappings.
Because we consider a uniform distribution for the sheet (horizons), we recode the function slightly so that the corresponding input is a simple string.

module_aggregator <- function(sheet = sheet,
                              prob_remove = prob_remove,
                              threshold = threshold,
                              prob_levels = prob_levels,
                              variable = variable,
                              horizon = horizon,
                              start_quantile = start_quantile,
                              stop_quantile = stop_quantile,
                              estimator = estimator){
  if(sheet == "Monthly"){prob_sheet <- c(1,0,0)}
  if(sheet == "Quarterly"){prob_sheet <- c(0,1,0)}
  if(sheet == "Yearly"){prob_sheet <- c(0,0,1)}
  module_sheet(prob_sheet = prob_sheet) |>
    module_cleaning(prob_remove = prob_remove) |>
    module_winsorization(threshold = threshold) |>
    module_levels(prob_levels = prob_levels) |>
    module_independent(variable = variable) |>
    module_horizon(horizon = horizon) |>
    module_start(start_quantile = start_quantile) |>
    module_stop(stop_quantile = stop_quantile) |>
    module_estimator(estimator = estimator)
}

4.2 Spanning all design options

Next, we span all the combinations we want to test. Implicitly, they are given the same (uniform) probability.

sheet <- c("Monthly", "Quarterly", "Yearly")
prob_remove <- c(0, 1)
threshold <- c(0, 0.01, 0.02, 0.03)
prob_levels <- c(0, 1)
variable <- c("payout", "b/m", "svar", "dfr", "dfy", "ntis")
horizon <- c(1, 3, 6, 12)
start_quantile <- c(0, 0.5)
stop_quantile <- c(0.5, 1)
estimator <- c("OLS", "HAC", "AUG", "AUG-HAC")

pars <- expand_grid(sheet, prob_remove, threshold, prob_levels, variable, horizon, start_quantile, stop_quantile, estimator)
pars <- pars |> filter(!(start_quantile == 0.5 & stop_quantile == 0.5))

# sheet <- pars[,1]
# prob_remove <- pars[,2]
# threshold <- pars[,3]
# prob_levels <- pars[,4]
# variable <- pars[,5]
# horizon <- pars[,6]
# start_quantile <- pars[,7]
# stop_quantile <- pars[,8]
# estimator <- pars[,9]

We then launch the computation of the outputs across all 13824 combinations.
Note that with the functional programming abilities of R, this is incredibly code efficient.
To speed up things, we resort to parallelization over 4 cores.

plan(multisession, workers = 4)                     # Set the number of cores
tictoc::tic()
output <- future_pmap_dfr(pars,  module_aggregator) # Launch!
tictoc::toc()
results <- bind_cols(pars, output)                  # Bind parameters to results
save(results, file = "results.RData")

5 Post treatment

Here, we adjust the distribution of t-statistics to account for the possible error-in-variable (EIV) bias, as in Proposition 2 of Skill, Scale, and Value Creation in the Mutual Fund Industry.

load("results.RData")
bias_terms <- function(x, m, s, t){ # We follow the notations of Proposition 2
  # The first 5 values are the sample moments of the bivariate series
  m_m <- mean(m, na.rm = T)
  s_m <- sd(m, na.rm = T)
  m_s <- mean(s, na.rm = T)
  s_s <- sd(s, na.rm = T)
  rho <- cor(m,s, use = "complete.obs")
  # Next, the optimal h, along with bar(m1) and bar(m2)
  h <- (3/4/s_m^5*(rho^4*s_s^4/12-(rho*s_s)^2+1)*exp(m_s+s_s^2*(1-rho^2/2)/2)*(length(m)/mean(t)))^(-1/3)
  m1 <- (x-m_m)/s_m
  m2 <- (x-m_m-rho*s_m*s_s)/s_m
  bs1 <- h^2/2/s_m^2*(m1^2-1)/s_m*dnorm(m1)
  bs2 <- (1/2/mean(t)*exp(m_s+s_s^2/2)/s_m^2*(m2^2-1))/s_m*dnorm(m2)
  return(bs1 + bs2)
}

adjust_bias <- function(xx, m, s, t){
  integrate(bias_terms, 
            -Inf, xx, 
            m = m, s = s, t = t)$value
}

adjust_cdf_v <- Vectorize(adjust_bias, vectorize.args='xx')

# Fine discretization
grid_01 <- seq(min(results$t_stat, na.rm = T)-0.1, max(results$t_stat, na.rm = T)+0.1, length.out = 2*nrow(results))
# Next step is long because of integration
Phi_adj <- adjust_cdf_v(grid_01,
                        m = results$t_stat,
                        s = results$sd,
                        t = results$sample_size) + ecdf(results$t_stat)(grid_01)
Phi_adj <- Phi_adj[Phi_adj < 1] |> sort()
# Create inverse function by swapping x & y
quant <- stepfun(Phi_adj, grid_01[1:(length(Phi_adj)+1)])

res_adj <- results
res_adj$t_stat <- quant(ecdf(results$t_stat)(results$t_stat))
res_adj <- res_adj |>  mutate(adjusted = "Yes")
res_adj$coef <- NA
res_adj$sd <- NA
res_adj$p_val <- 2*(1-pt(abs(res_adj$t_stat), results$sample_size))

results <- results |> mutate(adjusted = "No")
results <- bind_rows(results, res_adj)

results     # Have a look

## # A tibble: 27,648 × 20
##    sheet   prob_remove threshold prob_levels variable horizon start_quantile
##    <chr>         <dbl>     <dbl>       <dbl> <chr>      <dbl>          <dbl>
##  1 Monthly           0         0           0 payout         1            0  
##  2 Monthly           0         0           0 payout         1            0  
##  3 Monthly           0         0           0 payout         1            0  
##  4 Monthly           0         0           0 payout         1            0  
##  5 Monthly           0         0           0 payout         1            0  
##  6 Monthly           0         0           0 payout         1            0  
##  7 Monthly           0         0           0 payout         1            0  
##  8 Monthly           0         0           0 payout         1            0  
##  9 Monthly           0         0           0 payout         1            0.5
## 10 Monthly           0         0           0 payout         1            0.5
## # ℹ 27,638 more rows
## # ℹ 13 more variables: stop_quantile <dbl>, estimator <chr>, coef <dbl>,
## #   sd <dbl>, t_stat <dbl>, p_val <dbl>, sample_size <int>, aic <dbl>,
## #   bic <dbl>, sum_sq_err <dbl>, var_y <dbl>, skew <dbl>, adjusted <chr>

rm(res_adj) # Remove temp file

5.1 Impact of mappings

6 Hacking intervals & Lipschitz confirmation

The plot below shows the intervals when fixing exactly one mapping, while leaving all other mappings free.
This generates many combinations, for each of the 31 possible fixed choices.

module_names <- c("sheet", "prob_remove", "threshold", "prob_levels", "variable", 
                  "horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")

results |>
  mutate(across(all_of(module_names), as.character)) |>
  pivot_longer(cols = all_of(module_names), names_to = "module", values_to = "value") |>
  group_by(module, value) |>
  summarise(min = min(t_stat, na.rm = T),
            max = max(t_stat, na.rm = T)) |>
  mutate(variable = paste(module, value)) |>
  ggplot(aes(y = variable)) + geom_vline(xintercept = 0, linetype = 2) +
  geom_errorbarh(aes(xmin = min, xmax = max, color = module)) +
  theme_bw() + ylab("") + xlab("hacking interval (t-statistics)") +
  theme(legend.title = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold")) +
  guides(color = guide_legend(reverse = TRUE)) +
  scale_color_manual(values = c("#3D6499", "#FCA906", "#FC062F", "#9457CF", "#777777", "#000000",  "#84D0B1", "#568751", "#7B6B4A", "#7DD5FE"))

ggsave("intervals.pdf", width = 8, height = 5)

New version below that accounts for differences in predictors.

module_names <- c("sheet", "prob_remove", "threshold", "prob_levels", 
                  "horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")

exp_fun <- function(variable_x){
  module_names <- c("sheet", "prob_remove", "threshold", "prob_levels", 
                    "horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")
  size_full <- vector(mode = "list", length = length(module_names) - 1)
  for(j in 1:(length(module_names) - 1)){
    #print(j)
    combz <- combn(module_names, j) # All possible combinations
    for(k in 1:ncol(combz)){
      temp <- results |>
        filter(variable == variable_x) |>
        group_by_at(combz[,k]) |>
        summarise(min = min(t_stat, na.rm = T),
                  max = max(t_stat, na.rm = T)) |>
        mutate(size = max - min)
      size_full[[j]] <- c(size_full[[j]], temp |> pull(size))
    }
  }
  reshape2::melt(size_full) |> filter(is.finite(value)) |> 
    mutate(L1 = 9 - L1, variable = variable_x)
}

data_combi <- map_dfr(unique(results$variable), exp_fun)

expanding_model <- lm(log(mean) ~ L1, 
                      data = data_combi |> 
                        group_by(L1, variable) |> summarise(mean = mean(value, na.rm = T)))
summary(expanding_model)

## 
## Call:
## lm(formula = log(mean) ~ L1, data = summarise(group_by(data_combi, 
##     L1, variable), mean = mean(value, na.rm = T)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74399 -0.25592 -0.03785  0.24495  0.74424 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.31935    0.12229  -2.611   0.0121 *  
## L1           0.33867    0.02422  13.985   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3844 on 46 degrees of freedom
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8055 
## F-statistic: 195.6 on 1 and 46 DF,  p-value: < 2.2e-16

# save(data_combi, file = "data_combi.RData")

Finally, the plot that summarizes the findings.
Note that the figures at the top show the numbers of combinations of mappings over which the boxplots are drawn.

data_combi |>
  mutate(size = abs(L1-3.5)) |>
  mutate(predictor = variable) |>
  ggplot(aes(x = as.factor(L1), y = value, color = predictor)) + 
  geom_function(fun = function(x) exp(expanding_model$coefficients[1]) * exp(expanding_model$coefficients[2])^x, 
                linetype = 2, size = 0.8) +
  geom_boxplot(outlier.size = 1) +
  theme_bw() +
  xlab("Degrees of freedom (number of free mappings)") + 
  ylab("ARI = average range of hacking interval") +
  scale_color_viridis_d(direction = -1) +
  theme(axis.title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        panel.grid.major.x = element_blank()
        #legend.position = "none"
  )  +
  #stat_summary(fun = mean, geom = "point", shape = 20, size = 6, fill = "black")+
  annotate("rect", xmin = 0, xmax = 2.2, ymin = 12, ymax = 15, fill = "white", alpha = 0.5) + 
  annotate("text", x = 1.4, y = 14, label = "many small intervals", fontface = "bold") +
  annotate("text", x = 7.5, y = 1, label = "few large intervals", fontface = "bold")

ggsave("combinations.pdf", width = 8, height = 4)

Next, some analysis.
Notably, we estimate the coefficients from \(y=a+b^n\), where \(y\) is the average range of hacking intervals and \(n\) is the number of “free” mappings.
We also examine the ratio between successive numbers of mappings.
We see that even after 5-7 free mappings, each new mapping extends the range of \(t\)-statistics by more than 30%.

data_combi |> 
  group_by(L1, variable) |> 
  summarise(ARI = mean(value, na.rm = T)) |> 
  group_by(variable) |>
  mutate(rate = ARI / lag(ARI) - 1) |>
  rename("J-k" = L1) |>
  pivot_longer(cols = c("ARI", "rate"), names_to = "indicator", values_to = "value") |>
  mutate(value = round(value, 3)) |>
  pivot_wider(names_from = "J-k", values_from = c("value")) |>
  arrange(indicator) |>
  data.frame() |>
  xtable(digits = 3) |>
  print(include.rownames = F)

## % latex table generated in R 4.2.1 by xtable 1.8-4 package
## % Wed Jun 14 09:38:58 2023
## \begin{table}[ht]
## \centering
## \begin{tabular}{llrrrrrrrr}
##   \hline
## variable & indicator & X1 & X2 & X3 & X4 & X5 & X6 & X7 & X8 \\ 
##   \hline
## b/m & ARI & 1.237 & 2.613 & 4.151 & 5.927 & 8.052 & 10.669 & 13.949 & 18.054 \\ 
##   dfr & ARI & 0.484 & 1.030 & 1.645 & 2.352 & 3.173 & 4.121 & 5.182 & 6.324 \\ 
##   dfy & ARI & 0.577 & 1.222 & 1.946 & 2.779 & 3.756 & 4.893 & 6.141 & 7.347 \\ 
##   ntis & ARI & 0.983 & 2.047 & 3.192 & 4.454 & 5.899 & 7.614 & 9.701 & 12.251 \\ 
##   payout & ARI & 0.660 & 1.375 & 2.148 & 2.994 & 3.933 & 4.965 & 6.049 & 7.086 \\ 
##   svar & ARI & 0.628 & 1.319 & 2.072 & 2.906 & 3.836 & 4.861 & 5.943 & 6.980 \\ 
##   b/m & rate &  & 1.112 & 0.589 & 0.428 & 0.358 & 0.325 & 0.307 & 0.294 \\ 
##   dfr & rate &  & 1.126 & 0.597 & 0.429 & 0.349 & 0.299 & 0.258 & 0.220 \\ 
##   dfy & rate &  & 1.119 & 0.592 & 0.428 & 0.352 & 0.303 & 0.255 & 0.196 \\ 
##   ntis & rate &  & 1.083 & 0.559 & 0.395 & 0.324 & 0.291 & 0.274 & 0.263 \\ 
##   payout & rate &  & 1.083 & 0.562 & 0.394 & 0.314 & 0.262 & 0.218 & 0.172 \\ 
##   svar & rate &  & 1.099 & 0.571 & 0.402 & 0.320 & 0.267 & 0.223 & 0.174 \\ 
##    \hline
## \end{tabular}
## \end{table}

7 Model averaging

When several estimates have been obtained to quantify one effect, it is natural to seek to aggregate them.

First, frequentist averaging. There are many ways to proceed, but we choose a very common form: \[\hat{b}_* =\sum_{j=1}^Jw_j\hat{b}_j\] with \[w_j= \frac{e^{-\Delta_j/2}}{\sum_{k=1}^Je^{-\Delta_k/2}}, \quad \Delta_j = AIC_j-\underset{j}{\min}AIC_j\]

For the estimation of the variance of the aggregate estimator, we follow Equation (1) in Burnham & Anderson (2004): \[\hat{\sigma}^2_*=\left(\sum_{j=1}^J w_j\sqrt{\hat{\sigma}_j + (\hat{b}_*-\hat{b}_j)^2} \right)^2.\]

Below, the confidence intervals are computed as \([\hat{b}_*-1.96\sigma_*^2/\sqrt{P},\hat{b}_*+1.96\sigma_*^2/\sqrt{P}]\), where \(P\) is the number of paths.

Next, Bayesian averaging. The quantity of interest is \(b\), with posterior probability given the data \(D\) equal to \[P[b | D]= \sum_{m=1}^MP[b|M_m,D]P[M_m|D],\] where \(\mathbb{M}=\{M_m,m=1,\dots,M\}\) is the set of models under consideration. One model corresponds to one complete path. Notably, the above equation translates to the following conditional average and variance: \[\begin{align} \mathbb{E}[b|D]&=\sum_{m=1}^M\hat{b}_mP[M_m|D] \label{eq:mean} \\ \mathbb{V}[b|D]&=\sum_{m=1}^M \left(\mathbb{V}(b|M_m,D) + \hat{b}^2_m \right)P[M_m|D] - (\mathbb{E}[b|D])^2 \end{align}\] where \(\hat{b}_m\) is the estimated effect in model \(m\). The posterior model probabilities are given by \[\begin{equation} P[M_m|D] = \left(\sum_{j=1}^M \frac{P[M_j]}{P[M_m]}\frac{l_D(M_j)}{l_D(M_m)} \right)^{-1}, \end{equation}\] with \(l_D(M_j)\) being the marginal likelihood of model \(j\). Because we are agnostic with respect to which model is more realistic, we will set the prior odds \(\frac{P[M_j]}{P[M_m]}\) equal to one. Note that in this case, the posterior probabilities are then simply proportional to the likelihoods. The remaining Bayes factor is by far the most complex: \[\begin{equation} \frac{l_D(M_j)}{l_D(M_m)}=\left(\frac{n_j}{n_j+1}\right)^{\frac{k_j}{2}}\left(\frac{n_m+1}{n_m}\right)^{\frac{k_m}{2}}\frac{\left(\frac{s_m+n_mv_m}{n_m+1} \right)^{(n_m-1)/2}}{\left(\frac{s_j+n_jv_j}{n_j+1} \right)^{(n_j-1)/2}}, \label{eq:odds} \end{equation}\] where \(n_j\) is the inverse of the number of observations used in model \(j\) and \(k_j\) is the number of predictors in this model, omitting the constant (\(k_j=1\) in our case). Moreover, \(n_jv_j\) is the sample variance of the dependent variable in model \(j\). Finally, \(s_j\) is the sum of squared residuals under model \(j\).

Finally, the results.

m_avg <- results |>
  filter(is.finite(coef), is.finite(aic)) |>
  group_by(variable, prob_levels, sheet) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         b_freq = sum(weight_freq * coef, na.rm = T),
         b_ew = mean(coef),
         n = n(),
         # Below we use sart(sample_size in the end to avoid overflow)
         lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sqrt(sample_size)+1))^(sqrt(sample_size-1)),
         weight_bayes = lik/sum(lik, na.rm = T),
         b_bayes = sum(coef*weight_bayes, na.rm = T)) |>
  summarise(s_ew = mean(sqrt(sd^2+(b_ew-coef)^2), na.rm = T),
            s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
            s_Bayesian = sum(weight_bayes * sqrt(sd^2+(b_bayes-coef)^2), na.rm = T),
            b_ew = mean(b_ew),
            b_Frequentist = mean(b_freq),
            b_Bayesian = mean(b_bayes),
            n = mean(n)) |>
  pivot_longer(s_ew:b_Bayesian, names_to = "indicator", values_to = "indicator_values") |>
  separate(indicator, into = c("indicator", "avg_type"), sep = "_") |>
  pivot_wider(names_from = "indicator", values_from = "indicator_values")

g_levels <- m_avg |>
  filter(prob_levels == 1, avg_type != "ew") |>
  ggplot(aes(y = reorder(avg_type, b), x = b, color = sheet)) + 
  geom_point(position = position_dodge(0.9)) + geom_vline(xintercept = 0, linetype = 2) +
  facet_grid(rows = vars(variable), scales = "free") + 
  geom_errorbar(aes(xmin = b - 2.58 * s / sqrt(n), xmax = b + 2.58 * s / sqrt(n)), position = "dodge", width = 0.9) + 
  theme_minimal() + #xlab("coefficient (confidence interval)") +
  theme(legend.position = c(0.8,0.4),
        text = element_text(size = 12),
        title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        strip.background = element_blank(),
        strip.text = element_blank(),
        legend.box.background = element_rect(color = "grey", size = 1),
        panel.grid = element_blank(),
        legend.background = element_rect(fill = "white", color = "white"),
        axis.title = element_blank()) +
  labs(color = "Frequency",
       subtitle = "Level predictors")

g_diff <- m_avg |>
  filter(prob_levels == 0, avg_type != "ew") |>
  ggplot(aes(y = reorder(avg_type, b), x = b, color = sheet)) + 
  geom_point(position = position_dodge(0.9)) + geom_vline(xintercept = 0, linetype = 2) +
  facet_grid(rows = vars(variable), scales = "free") + 
  geom_errorbar(aes(xmin = b - 1.96 * s / sqrt(n), xmax = b + 1.96 * s / sqrt(n)), position = "dodge", width = 0.9) + 
  theme_minimal() + #xlab("coefficient (confidence interval)") +
  theme(legend.position = "none",
        text = element_text(size = 12),
        title = element_text(face = "bold"),
        legend.title = element_text(face = "bold", color = "grey"),
        legend.box.background = element_rect(color = "grey", size = 1),
        strip.background = element_rect(fill = "#EEEEEE", color = "white"),
        strip.text = element_text(face = "bold"),
        panel.grid = element_blank(),
        axis.text.y = element_blank(),
        legend.background = element_rect(fill = "white", color = "white"),
        panel.background = element_rect(fill = "#F6F6F6", color = "white"),
        axis.title = element_blank()) +
  labs(color="Frequency",
       subtitle = "Difference predictors")

g_levels + g_diff

ggsave("model_averaging.pdf", width = 8, height = 4)

8 Testing for differences in means

In this section, we test if different layer choices lead to changes in results. Some plots are illustrative and not reported in the paper. There are many way to display results and the best only made it to the article.

8.1 Standard versus augmented model

In the augmented model, we proceed as in Amihud & Hurvich 2004 by adding an extra term to correct for possible auto-correlation in the residuals. In contrast with HAC estimators, this changes the value of the coefficient, not only the standard deviation.

results |>
  filter(is.finite(coef), is.finite(aic), estimator %in% c("OLS", "AUG")) |>
  group_by(variable, prob_levels, estimator, sheet) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         #b_freq = sum(weight_freq * coef, na.rm = T),
         n = n()) |>
  select(-t_stat, -p_val, -skew, -sd, -aic, -bic, -sum_sq_err, -D) |>
  pivot_wider(names_from = "estimator", values_from = c("coef", "weight_freq")) |>
  summarise(test_stat = t.test(coef_AUG*weight_freq_AUG*n-coef_OLS*weight_freq_OLS*n)$statistic) |>
  ungroup() |>
  summarise(min = min(abs(test_stat)),
            max = max(abs(test_stat)))

## # A tibble: 1 × 2
##      min   max
##    <dbl> <dbl>
## 1 0.0265  2.31

results |>
  filter(is.finite(coef), is.finite(aic), estimator %in% c("OLS", "AUG")) |>
  mutate(prob_levels = prob_levels |> case_match(
    1 ~ "Level",
    0 ~ "Difference"
  ),
  estimator = estimator |> case_match(
    "OLS" ~ "Standard",
    "AUG" ~ "Augmented"
  )) |>
  group_by(variable, prob_levels, estimator, sheet) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         b_freq = sum(weight_freq * coef, na.rm = T),
         n = n()) |>
  summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
            b_Frequentist = mean(b_freq),
            n = mean(n)) |>
  #mutate(stat = ) |>
  ggplot(aes(y = variable, x = b_Frequentist, color = estimator, shape = estimator, size = estimator)) + 
  geom_point() + theme_bw() + geom_vline(xintercept = 0, linetype = 2, color = "#666666") + 
  scale_size_manual(values = c(2.4,1.8)) +
  facet_grid(vars(prob_levels), vars(sheet), scales = "free") +
  theme(
    #legend.position = c(0.86,1.14),
    legend.position = c(0.86,1.24),
    legend.margin = unit(x=c(0,0,0,0),units="mm"),
    panel.grid.minor.x = element_blank(),
    axis.title = element_blank(),
    title = element_text(face = "bold"),
    strip.text = element_text(face = "bold"),
    strip.background = element_rect(fill = "#F6F6F6", color = "#EEEEEE"),
    panel.grid.major.x = element_blank()) +
  scale_color_manual(values = c("#FF9922", "#2266FF")) +
  guides(color = guide_legend(nrow = 1),
         size = guide_legend(nrow = 1),
         shape = guide_legend(nrow = 1)) +
  ggtitle("Impact of model specification on average effect")

Another representation below.

library(ggrepel)
data_graph <- results |>
  filter(is.finite(coef), is.finite(aic), estimator %in% c("OLS", "AUG")) |>
  mutate(prob_levels = prob_levels |> case_match(
    1 ~ "Level",
    0 ~ "Difference"
  ),
  estimator = estimator |> case_match(
    "OLS" ~ "Standard",
    "AUG" ~ "Augmented"
  )) |>
  group_by(variable, prob_levels, estimator, sheet) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         b_freq = sum(weight_freq * coef, na.rm = T),
         n = n()) |>
  summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
            b_Frequentist = mean(b_freq),
            n = mean(n)) |>
  select(-s_Frequentist) |>
  filter(prob_levels == "Level") |>
  pivot_wider(names_from = "estimator", values_from = "b_Frequentist") 

g1 <- data_graph |>
  ggplot(aes(y = Augmented, x = Standard)) + 
  geom_vline(xintercept = 0, color = "#CCCCCC", size = 0.3) +  geom_hline(yintercept = 0, color = "#CCCCCC", size = 0.3) +
  
  geom_function(fun = function(x) x, linetype = 2, color = "#2288CC") +
  theme_bw() +
  theme(panel.grid = element_blank(),
        legend.position = "none",
        title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold")) +
  geom_text_repel(aes(label = paste(sheet, variable)), 
                  data = data_graph |> filter(abs(Standard) > 0.0025),
                  box.padding = 0.1,
                  min.segment.length = 1, 
                  # segment.curvature = -0.1,
                  # segment.ncp = 3,
                  segment.color = "#CCCCCC",
                  nudge_x = -.004,
                  nudge_y = .008,  
                  # segment.angle = 20,
                  color = "#000000",
                  max.overlaps = 3
  ) +
  geom_point(size = 1.8) +
  #scale_color_manual(values = c("#999999", "#000000"), name = "Predictor in:") +
  ylab("Augmented model (average effect)") + xlab("Standard model (average effect)") +
  ggtitle("Impact of model specification on average effect") 
g1

Hence: the augmented specification changes almost nothing.

8.2 Coefficient consistency through time

Now onto periods. Same with first plot: not reported in the paper.

table_periods <- results |>
  mutate(period = case_when(
    start_quantile == 0 & stop_quantile == 1/2 ~ "first_period",
    start_quantile == 1/2 & stop_quantile == 1 ~ "second_period",
  )) |>  
  filter(is.finite(coef), is.finite(aic), !is.na(period)) |>
  group_by(variable, prob_levels, period, sheet) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         #b_freq = sum(weight_freq * coef, na.rm = T),
         n = n()) |>
  select(-t_stat, -p_val, -skew, -sd, -aic, -bic, -sum_sq_err, -var_y, -sample_size, -D, -start_quantile, -stop_quantile) |>
  pivot_wider(names_from = "period", values_from = c("coef", "weight_freq")) |>
  summarise(test_stat = t.test(coef_first_period*weight_freq_first_period*n-
                                 coef_second_period*weight_freq_second_period*n)$statistic,
            n = mean(n),
            sd = t.test(coef_first_period*weight_freq_first_period*n-
                          coef_second_period*weight_freq_second_period*n)$stderr,
            diff = sum(coef_first_period*weight_freq_first_period-coef_second_period*weight_freq_second_period)) |>
  ungroup() |>
  mutate(test_stat_2 = case_when(
    abs(test_stat) > 1.96 ~ "significant",
    .default = "insignificant"
  )) #|>
# summarise(min = min(abs(test_stat)),
#           max = max(abs(test_stat)))


results |>
  mutate(period = case_when(
    start_quantile == 0 & stop_quantile == 1/2 ~ "first_period",
    start_quantile == 1/2 & stop_quantile == 1 ~ "second_period",
  )) |>
  filter(!is.na(period), is.finite(aic)) |>
  mutate(prob_levels = prob_levels |> case_match(
    1 ~ "Level",
    0 ~ "Difference"
  )) |>
  group_by(variable, prob_levels, period, sheet) |>
  mutate(D = (aic - min(aic)),# |> min2(),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         b_freq = sum(weight_freq * coef, na.rm = T),
         n = n()) |>
  summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
            b_Frequentist = mean(b_freq),
            n = mean(n)) |>
  ggplot(aes(y = variable, x = b_Frequentist, color = period, shape = period, size = period, fill = period)) + 
  geom_point() + theme_bw() + geom_vline(xintercept = 0, linetype = 2, color = "#666666") + 
  scale_size_manual(values = c(2.4,1.8)) +
  facet_grid(vars(prob_levels), vars(sheet), scales = "free") +
  theme(legend.position = c(0.8,1),
        panel.grid.minor.x = element_blank(),
        axis.title = element_blank(),
        strip.text = element_text(face = "bold"),
        strip.background = element_rect(fill = "#F6F6F6", color = "#EEEEEE"),
        panel.grid.major.x = element_blank()) +
  scale_color_manual(values = c("#FF9922", "#2266FF")) +
  ggtitle("Impact of periods on average effect")

Now to the more visually appealing representation…

library(ggrepel)
g2 <- results |>
  mutate(period = case_when(
    start_quantile == 0 & stop_quantile == 1/2 ~ "first_period",
    start_quantile == 1/2 & stop_quantile == 1~ "second_period",
  )) |>
  filter(is.finite(coef), is.finite(aic), !is.na(period)) |>
  group_by(variable, prob_levels, sheet, period) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         b_freq = sum(weight_freq * coef, na.rm = T),
         n = n(),
         # Below we use sart(sample_size in the end to avoid overflow)
         lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sqrt(sample_size)+1))^(sqrt(sample_size-1)),
         weight_bayes = lik/sum(lik, na.rm = T),
         b_bayes = sum(coef*weight_bayes, na.rm = T)) |>
  summarise(b_Frequentist = mean(b_freq),
            b_Bayesian = mean(b_bayes),
            n = mean(n)) |>
  ungroup() |>
  pivot_longer(b_Frequentist:b_Bayesian, names_to = "indicator", values_to = "indicator_values") |>
  separate(indicator, into = c("indicator", "avg_type"), sep = "_") |>
  pivot_wider(names_from = "indicator", values_from = "indicator_values") |> 
  pivot_wider(names_from = "period", values_from = "b") |>
  filter(avg_type == "Frequentist", prob_levels == 1) |>
  bind_cols(table_periods |> filter(prob_levels == 1) |> select(test_stat_2)) |> # For color / significance coding
  ggplot(aes(x = first_period, y = second_period, color = test_stat_2)) + 
  geom_function(fun = function(x) x, linetype = 2, color = "#2288CC", xlim = c(-0.04,0.04)) +
  geom_point(size = 2) + xlab("First period average effect") + ylab("Second period average effect") + 
  theme_bw() + #geom_smooth(method = "lm", se = F) +
  geom_hline(yintercept = 0, color = "#999999", size = 0.3) +
  geom_vline(xintercept = 0, color = "#999999", size = 0.3) +
  theme(axis.title = element_text(face = "bold"),
        legend.position = c(0.9,0.8),
        panel.grid = element_blank(),
        legend.title = element_text(face = "bold"),
        title = element_text(face = "bold"),
        panel.grid.minor = element_blank()) + 
  geom_text_repel(aes(label = paste(sheet, variable), 
                      color = test_stat_2), 
                  box.padding = 0.5,
                  max.overlaps = 20,
                  min.segment.length = 1) +
  scale_color_manual(values = c("#1177DD", "#FF8822"), name = "test statistic") +
  ggtitle("Impact of time period on average effect") 
g1/g2

# ggsave("cond_mean.pdf", width = 8, height = 7)

Bottomline: however, time period can matter a lot, especially for annual data, which is more scarce.

9 Model standard deviations and t-statistics

Below, we dive into the topic of the importance of standard errors of estimators and the importance to resort to HAC corrections. IID estimators are on the x-axis, and their HAC counterparts on the y-axis.

data_graph <- results |>
  filter(is.finite(coef), is.finite(aic)) |>
  mutate(prob_levels = prob_levels |> case_match(
    1 ~ "Level",
    0 ~ "Difference"
  ),
  estimator = estimator |> case_match(
    "OLS" ~ "Standard",
    "HAC" ~ "HAC",
    "AUG" ~ "Standard",
    "AUG-HAC" ~ "HAC"
  )) |>
  group_by(variable, prob_levels, estimator, sheet) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         b_freq = sum(weight_freq * abs(t_stat), na.rm = T),
         n = n()) |>
  summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-abs(t_stat))^2), na.rm = T),
            b_Frequentist = mean(b_freq),
            n = mean(n)) |>
  select(-s_Frequentist) |>
  filter(prob_levels == "Level") |>
  pivot_wider(names_from = "estimator", values_from = "b_Frequentist") 

data_graph |>
  ggplot(aes(y = HAC, x = Standard)) + 
  geom_vline(xintercept = 0, color = "#CCCCCC", size = 0.3) +  geom_hline(yintercept = 0, color = "#CCCCCC", size = 0.3) +
  geom_function(fun = function(x) x, linetype = 2, color = "#2288CC", size = 1.05, xlim = c(0,3)) +
  theme_bw() +
  theme(panel.grid = element_blank(),
        legend.position = "none",
        title = element_text(face = "bold"),
        legend.title = element_text(face = "bold"),
        axis.title = element_text(face = "bold")) +
  geom_text_repel(aes(label = paste(sheet, variable)), 
                  data = data_graph |> filter(Standard > 1.3),
                  box.padding = 0.2,
                  min.segment.length = 1, 
                  # segment.curvature = -0.1,
                  # segment.ncp = 3,
                  segment.color = "#CCCCCC",
                  nudge_x = .004,
                  nudge_y = -.008,  
                  # segment.angle = 20,
                  color = "#000000",
                  max.overlaps = 2
  ) +
  geom_point(size = 1.8) +
  geom_vline(xintercept = 1.96, linetype = 3) + geom_hline(yintercept = 1.96, linetype = 3) +
  geom_vline(xintercept = 2.56, linetype = 5, color = "#CCCCCC") + 
  geom_hline(yintercept = 2.56, linetype = 5, color = "#CCCCCC") +
  #scale_color_manual(values = c("#999999", "#000000"), name = "Predictor in:") +
  ylab("HAC estimator (absolute average t-statistic)") + xlab("Standard (iid) estimator (absolute average t-statistic)") +
  annotate("rect", xmin = 1.8, xmax = 2.2, ymin = 2.9, ymax = 3.1, fill = "white") +
  annotate("rect", xmin = 2.3, xmax = 2.7, ymin = 2.9, ymax = 3.1, fill = "white") +
  annotate("text", label = "5% level", x = 2, y = 3) + 
  annotate("text", label = "1% level", x = 2.6, y = 3, color = "grey") + 
  ggtitle("Impact of standard deviation correction on absolute average t-statistic")

ggsave("hac.pdf", width = 8, heigh = 4)

As expected, HAC values are always above. From an inferential point of view, they are more conservative, as they will imply lower t-statistics which will be less likely to reject the null of zero effect.

10 Case study on weights

Below, we focus on the payout predictor.

base <- results |> 
  filter(variable == "payout", prob_levels == 1, is.finite(coef), is.finite(aic)) |>
  group_by(sheet) |>
  mutate(D = (aic - min(aic)),
         weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
         b_freq = sum(weight_freq * coef, na.rm = T),
         b_ew = mean(coef),
         n = n(),
         # Below we use sart(sample_size in the end to avoid overflow)
         lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sqrt(sample_size)+1))^(sqrt(sample_size-1)),
         weight_bayes = lik/sum(lik, na.rm = T),
         b_bayes = sum(coef*weight_bayes, na.rm = T)) |>
  summarise(s_ew = mean(sqrt(sd^2+(b_ew-coef)^2), na.rm = T),
            s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
            s_Bayesian = sum(weight_bayes * sqrt(sd^2+(b_bayes-coef)^2), na.rm = T),
            b_ew = mean(b_ew),
            b_Frequentist = mean(b_freq),
            b_Bayesian = mean(b_bayes),
            n = mean(n)) |>
  pivot_longer(-c("sheet", "n"), names_to = "indicator", values_to = "value") |>
  separate(indicator, sep = "_", into = c("indicator", "type"))


generate <- function(j){
  results |> 
    filter(variable == "payout", prob_levels == 1, is.finite(coef), is.finite(aic)) |>
    group_by(sheet) |>
    mutate(n = n(),
           weight_random = runif(n),
           weight_random = weight_random/sum(weight_random),
           b_random = sum(coef*weight_random, na.rm = T)) |>
    summarise(s_random = sum(weight_random * sqrt(sd^2+(b_random-coef)^2), na.rm = T),
              b_random = mean(b_random),
              n = mean(n)) |>
    pivot_longer(-c("sheet", "n"), names_to = "indicator", values_to = "value") |>
    separate(indicator, sep = "_", into = c("indicator", "type")) |>
    mutate(type = paste(type, j))
}

generate(1)

## # A tibble: 6 × 5
##   sheet         n indicator type        value
##   <chr>     <dbl> <chr>     <chr>       <dbl>
## 1 Monthly     384 s         random 1  0.0113 
## 2 Monthly     384 b         random 1 -0.00157
## 3 Quarterly   384 s         random 1  0.0297 
## 4 Quarterly   384 b         random 1  0.00123
## 5 Yearly      384 s         random 1  0.117  
## 6 Yearly      384 b         random 1  0.0161

for(j in 1:200){
  base <- base |> bind_rows(generate(j))
}

base |>
  pivot_wider(names_from = "indicator", values_from = "value") |>
  ggplot(aes(x = b, y = n)) + 
  geom_jitter(size = 0.8, 
              data = base |> filter(str_starts(type, "r")) |> 
                pivot_wider(names_from = "indicator", values_from = "value") ) + 
  theme_bw() + 
  xlab("Average effect of payout") + 
  theme(axis.title.y = element_blank(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        strip.background = element_blank(),
        strip.text = element_text(face = "bold"),
        axis.title.x = element_text(face = "bold")) +
  geom_point(data = base |> filter(type == "ew") |> pivot_wider(names_from = "indicator", values_from = "value") , 
             color = "red", size = 4, shape = 17) +
  geom_point(data = base |> filter(type == "Frequentist") |> pivot_wider(names_from = "indicator", values_from = "value") , 
             color = "#2288DD", size = 4, shape = 17) +
  geom_point(data = base |> filter(type == "Bayesian") |> pivot_wider(names_from = "indicator", values_from = "value") , 
             color = "#DD8822", size = 4, shape = 17) +
  facet_grid(cols = vars(sheet), scales = "free") + 
  geom_text(data = base |> filter(sheet == "Quarterly") |> pivot_wider(names_from = "indicator", values_from = "value"),
           label = "frequentist\n average", x = -0.0019 , y = 384.25, color =  "#2288DD", size = 3) +
  geom_text(data = base |> filter(sheet == "Yearly") |> pivot_wider(names_from = "indicator", values_from = "value"),
           label = "Bayesian\n average", x = 0.0005, y = 383.8, color =  "#DD8822", size = 3)

ggsave("payout.pdf", width = 8, height = 2)

This illustrates the effect of averaging: when using particular weights, the meta model can shift heavily towards particular cases.

Lastly: the impact of AIC on effect sizes. In absolute value, |AIC| increases with |effects|.

results |> 
    filter(variable == "payout", prob_levels == 1, is.finite(coef), is.finite(aic)) |>
    ggplot(aes(x = aic, y = abs(coef))) + geom_point() + theme_bw() +
  facet_wrap(vars(sheet), scales = "free") + xlab("Akaike Information Criterion") +
  geom_smooth(method = "lm", se = F) + ylab("absolute effect") +
  theme(strip.background = element_rect(fill = "#DFDFEF"),
        axis.title = element_text(face = "bold"))

ggsave("aic.pdf", width = 8, height = 3)

11 Comparison with previous work.

We extract a few numbers from A Comprehensive 2021 Look at the Empirical Performance of Equity Premium Prediction II and provide the ease-to-replicate indicator in the rounded rectangles.

prior <- data.frame(
    variable = c("ntis", "svar", "b/m", "payout", "dfr", "dfy"),
    effect = c(-0.36, -0.08, 0.29, -0.11, 0.23, 0.06)/100,
    first = c(-1.4, 0.01, 1.16, -0.72, 0.08, 0.05)/100,
    second = c(-0.19, -0.42, -0.16, 0.35, 1.28, 0.44)/100
)

quant <- 0.9
results |>
    filter(prob_levels == 1, sheet == "Monthly", horizon == 1) |>
    left_join(prior, by = "variable") |>
    group_by(variable) |>
    mutate(m = mean(coef, na.rm = T),
           sd = sd(coef, na.rm = T),
           q = pnorm(abs(effect), mean = abs(m), sd = sd),
           q_first = pnorm(abs(first), mean = abs(m), sd = sd),
           q_2 = pnorm(abs(second), mean = abs(m), sd = sd),
           EtR = 1-if_else(q>quant, (q-quant)/(1-quant), 0),
           EtR_first = 1-if_else(q_first>quant, (q_first-quant)/(1-quant), 0),
           EtR_2 = 1-if_else(q_2>quant, (q_2-quant)/(1-quant), 0)) |>
    ggplot(aes(x = coef)) + 
    geom_vline(aes(xintercept = effect), color = "black") + 
    geom_vline(aes(xintercept = first), color = "red", linetype = 2) + 
    geom_vline(aes(xintercept = second), color = "red") + 
    geom_histogram(alpha = 0.6) +
    facet_wrap(vars(variable), scales = "free") +
    theme_classic() +
    theme(axis.title = element_blank()) +
    geom_label(aes(x = effect, y = 9, label = paste0(round(100*EtR, 1), "%")), size = 2.4) +
    geom_label(aes(x = first, y = 11.5, label = paste0(round(100*EtR_first, 1), "%")), size = 2.4, color = "red") +
    geom_label(aes(x = second, y = 6.5, label = paste0(round(100*EtR_2, 1), "%")), size = 2.4, color = "red")

ggsave("premium_prior.pdf", width = 9, height = 5)