This notebook presents the first empirical study of the paper
Forking paths in empirical studies available on SSRN.
The paper has had several rounds of review, which is why some of the
material may not be present in some versions of the article. We provide
everything for the sake of completeness.
The idea of the paper is to decompose any empirical analysis into a
series of steps, which are pompously called
mapppings in the paper.
Each step matters and can have an important impact on the outcome of the
study.
Because of that, we argue that the researcher should, if possible,
keep track of all the possible modelling options and
present the distribution of outcomes (e.g., t-statistics or p-values)
across all configurations.
Below, most code chunks are hidden to ease readability.
It is easy to access all code content by clicking on the related
buttons.
First, we test the data importation & load the libraries.
The data, downloaded from Amit Goyal’s website
is stored on a Github repo.
It was used in the follow-up paper A
Comprehensive Look at the Empirical Performance of Equity Premium
Prediction II.
Because we do not want to download the data multiple times, we do it
only once, for three sheets (monthly, quarterly & yearly data).
library(tidyverse) # Data wrangling & plotting
library(readxl) # To read excel files
library(zoo) # For data imputation
library(DescTools) # For winsorization
library(tictoc) # CPU time estimation
library(sandwich) # HAC estimator
library(lmtest) # Statistical inference
library(furrr) # Parallel computing
library(xtable) # LaTeX exports
library(patchwork) # Plot combination
library(reshape2) # List management
loadWorkbook_url <- function(sheet) { # Function that downloads the data from online file
url = "https://github.com/shokru/coqueret.github.io/blob/master/files/misc/PredictorData2021.xlsx?raw=true"
temp_file <- tempfile(fileext = ".xlsx")
download.file(url = url, destfile = temp_file, mode = "wb", quiet = TRUE)
read_excel(temp_file, sheet = sheet)
}
data_month <- loadWorkbook_url(1) # Dataframe for monthly data
data_quarter <- loadWorkbook_url(2) # Dataframe for quarterly data
data_year <- loadWorkbook_url(3) # Dataframe for annual data
Next, we code the modules. They correspond to the
\(f_j\) mappings in the paper.
The chaining of mappings follows the scheme below:
Because of the strong analogy with neural networks, we approach the coding of mappings/modules similarly to the early implementation of the sequential Keras framework.
First module: decide which sheet to work with (i.e., horizon: monthly, quarterly or annually).
prob_sheet <- c(0,1,0)
module_sheet <- function(prob_sheet){
prob_month <- prob_sheet[1] # Prob. to pick sheet 1 (monthly)
prob_quarter <- prob_sheet[2] # Prob. to pick sheet 2 (quarterly)
prob_year <- prob_sheet[3] # Prob. to pick sheet 3 (yearly)
if((prob_month == 0) + (prob_quarter == 0) + (prob_year == 0) == 2){
if(prob_month == 1){return(data_month)}
if(prob_quarter == 1){return(data_quarter)}
if(prob_year == 1){return(data_year)}
} else
p = runif(1) # This is the random option (not used below)
if(p <= prob_month){return(data_month)}
else if(p <= prob_month+prob_quarter){return(data_quarter)}
else {return(data_year)}
}
module_sheet(prob_sheet)[250:260, 1:11] |> # A snapshot
mutate(across(everything(), as.numeric)) |>
knitr::kable(font_size = 5)
yyyyq | Index | D12 | E12 | b/m | tbl | AAA | BAA | lty | ntis | Rfree |
---|---|---|---|---|---|---|---|---|---|---|
19332 | 10.91 | 0.4700 | 0.4250 | 0.8335032 | 0.0007 | 0.0446 | 0.0707 | 0.0306 | 0.0007609 | 0.003350 |
19333 | 9.83 | 0.4550 | 0.4325 | 0.8679966 | 0.0004 | 0.0436 | 0.0727 | 0.0308 | -0.0004585 | 0.000175 |
19334 | 10.10 | 0.4400 | 0.4400 | 0.8290260 | 0.0029 | 0.0450 | 0.0775 | 0.0336 | 0.0061958 | 0.000100 |
19341 | 10.75 | 0.4425 | 0.4525 | 0.8025122 | 0.0024 | 0.0413 | 0.0626 | 0.0307 | 0.0096063 | 0.000725 |
19342 | 9.81 | 0.4450 | 0.4650 | 0.8407311 | 0.0015 | 0.0393 | 0.0606 | 0.0289 | 0.0059828 | 0.000600 |
19343 | 9.10 | 0.4475 | 0.4775 | 0.8703644 | 0.0021 | 0.0396 | 0.0657 | 0.0310 | 0.0194390 | 0.000375 |
19344 | 9.50 | 0.4500 | 0.4900 | 0.7737409 | 0.0023 | 0.0381 | 0.0623 | 0.0293 | 0.0208127 | 0.000525 |
19351 | 8.47 | 0.4500 | 0.7300 | 0.8007541 | 0.0015 | 0.0367 | 0.0620 | 0.0274 | 0.0120460 | 0.000575 |
19352 | 10.23 | 0.4400 | 0.8100 | 0.6818182 | 0.0015 | 0.0361 | 0.0577 | 0.0270 | 0.0193440 | 0.000375 |
19353 | 11.59 | 0.4400 | 0.7600 | 0.6117344 | 0.0020 | 0.0359 | 0.0553 | 0.0282 | 0.0080266 | 0.000375 |
19354 | 13.43 | 0.4700 | 0.7600 | 0.5599112 | 0.0015 | 0.0344 | 0.0530 | 0.0276 | 0.0086888 | 0.000500 |
After the first module, each mapping takes data as first argument. This will make it easy to use pipes (sequences of instructions).
Second module: data cleaning.
NOTE: all columns must be converted to numbers beforehand.
module_cleaning <- function(data, prob_remove){ # Two inputs: data & choice
data <- data |> mutate(across(everything(), as.numeric)) # Convert everything to numbers
if(runif(1) <= prob_remove){ # If remove:
return(data |> na.omit()) # REMOVE!
} else
return(data |> mutate(across(everything(), na.locf, na.rm = F))) # Impute from previous non-NA (not very useful here)
}
Third module: winsorizing.
We compute some values of predictors in this module (from the raw
series): payout, dfr and
dfy.
The winsorization comes after. The only input is the level for the
winsorization.
module_winsorization <- function(data, threshold){
data <- data |> mutate(payout = log(D12/E12),
dfy = BAA - AAA,
dfr = corpr - ltr)
data <- data |> mutate(across(2:ncol(data), Winsorize, probs = c(threshold, 1-threshold), na.rm = T))
return(data)
}
Fourth module: levels vs. differences.
This step decides if independent variables are left as levels, or if
variations are used instead.
differentiate <- function(v){v - lag(v)}
module_levels <- function(data, prob_levels){
if(runif(1) < prob_levels){return(data)}
else return(data |> mutate(across(3:ncol(data), differentiate)) |> na.omit())
}
Fifth module: independent variable.
This stage sets the (unique) predictor. There are 6 in the study.
module_independent <- function(data, variable){
data |> dplyr::select(all_of(c("Index", variable, "Rfree")))
}
Sixth module: horizon.
This module determines the horizon of the dependent variable (return)
and removes missing data.
It is given in periods (thus, results are contingent on the initial
choice of the sheet (first module)).
module_horizon <- function(data, horizon){
data[,1] <- lead(data[,1], horizon)/data[,1] - 1 - horizon * data[,3] # Excess returns on the Index (S&P 500)
data |> na.omit()
}
Seventh module: starting point.
In our study, we allow for simple subsampling.
The idea is that predictability should not be (too much) time-dependent
(though this may
be the case).
Thus, we allow for 3 sample size options:
- full sample
- sample that begins in the middle of the available data (second
half)
- sample that stop at the the middle of the data (first
half)
This module picks the starting point (either the original first point, or the middle point).
module_start <- function(data, start_quantile){
L <- nrow(data)
data[ceiling(L*start_quantile):L, ]
}
Eighth module: stopping point.
This module picks the stopping point (either the original last point, or the middle point).
module_stop <- function(data, stop_quantile){
L <- nrow(data)
data[1:ceiling(L*stop_quantile), ]
}
Ninth module: estimator type.
This last step takes the data and performs estimations.
Two options are possible: the standard iid estimator, or the HAC
estimator from Newey-West.
The final function yields 10 outputs:
- the coefficient, the standard error, the t-statistic, the
p-value, the skewness of errors, which are baseline
outputs;
- the two criteria (AIC & BIC), for frequentist model
averaging;
- the sample size, the sum of squared residuals and the
variance of the dependent variable, for the Bayesian model
averaging.
module_estimator <- function(data, estimator){
y <- data |> pull(1) # Naming the dependent variable
x <- data |> pull(2) # Naming the independent variable
x <- x / sd(x) # Scaling the predictors
if(nrow(data) < 30){ # Test for sample size
t_stat <- NA # NA if too small (not enough obs.)
sd <- NA
coef <- NA
p_val <- NA
aic <- NA
bic <- NA
sum_sq_err <- NA
var_y <- NA
skew <- NA
} else {
if(estimator == "HAC"){
model <- lm(y ~ x) # Running the regression
coef <- summary(model)$coef[2,1] # OLS coefficient
sd <- coeftest(model, vcov. = NeweyWest(model))[2,2] # NW standard error
t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3] # OLS with Newey-West Cov matrix
p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
if(estimator == "OLS"){
model <- lm(y ~ x) # Running the regression
coef <- summary(model)$coef[2,1] # OLS coef
sd <- summary(model)$coef[2,2] # OLS coef
t_stat <- summary(model)$coef[2,3] # OLS t-statistic
p_val <- summary(model)$coef[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
if(estimator == "AUG"){
model_x <- lm(x ~ lag(x)) # x-regression
rho <- model_x$coef[2]
nu <- lead(x) - rho * x # x-innovations
model <- lm(y ~ x + nu)
coef <- summary(model)$coef[2,1] # OLS coef
sd <- summary(model)$coef[2,2] # OLS coef
t_stat <- summary(model)$coef[2,3] # OLS t-statistic
p_val <- summary(model)$coef[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
if(estimator == "AUG-HAC"){
model_x <- lm(x ~ lag(x)) # x-regression
rho <- model_x$coef[2]
nu <- lead(x) - rho * x # x-innovations
model <- lm(y ~ x + nu)
coef <- summary(model)$coef[2,1] # OLS coef
sd <- coeftest(model, vcov. = NeweyWest(model))[2,2] # NW standard error
t_stat <- coeftest(model, vcov. = NeweyWest(model))[2,3] # OLS with Newey-West Cov matrix
p_val <- coeftest(model, vcov. = NeweyWest(model))[2,4] # p-value
aic <- AIC(model) # Akaike Info Criterion
bic <- BIC(model) # Bayesian Info Criterion
sum_sq_err <- sum(model$residuals^2) # Sum of squared residuals
var_y <- var(y) # Variance of dependent variable
skew <- mean((model$residuals/sd(model$residuals))^3) # Skewness of residuals
}
}
return(list(coef = coef, sd = sd, t_stat = t_stat, p_val = p_val,
sample_size = nrow(data), aic = aic, bic = bic,
sum_sq_err = sum_sq_err, var_y = var_y, skew = skew))
}
We show a simple test below.
It shows the natural chaining of the mappings, akin to
a neural network (without back-propagation!).
All modules are executed successively.
prob_sheet <- c(0,0,1) # Yearly data
prob_remove <- 1 # Remove NAs
threshold <- 0.01 # Threshold for winsorization
prob_levels <- 0 # Switch to differences
variable <- "payout" # Which independent variable
horizon <- 1 # Return horizon
start_quantile <- 0.5 # Starting point
stop_quantile <- 1 # Stopping point
estimator <- "OLS" # Which estimator
module_sheet(prob_sheet = prob_sheet) |> # Step 1: select data frequency
module_cleaning(prob_remove = prob_remove) |> # Step 2: clean the data
module_winsorization(threshold = threshold) |> # Step 3: manage outliers
module_levels(prob_levels = prob_levels) |> # Step 4: decide between levels versus differences
module_independent(variable = variable) |> # Step 5: choose predictor
module_horizon(horizon = horizon) |> # Step 6: pick return horizon (dependent variable)
module_start(start_quantile = start_quantile) |> # Step 7: starting point (for subsampling)
module_stop(stop_quantile = stop_quantile) |> # Step 8: ending point (for subsampling)
module_estimator(estimator = estimator) # Step 9: select estimator type
## $coef
## [1] -0.01681519
##
## $sd
## [1] 0.02298709
##
## $t_stat
## [1] -0.7315057
##
## $p_val
## [1] 0.4682627
##
## $sample_size
## [1] 47
##
## $aic
## [1] -37.36275
##
## $bic
## [1] -31.81231
##
## $sum_sq_err
## [1] 1.093801
##
## $var_y
## [1] 0.02406103
##
## $skew
## [1] -0.6425367
Below, we code a function that takes all parameters
and that runs the whole chain of operators/mappings.
Because we consider a uniform distribution for the sheet (horizons), we
recode the function slightly so that the corresponding input is a simple
string.
module_aggregator <- function(sheet = sheet,
prob_remove = prob_remove,
threshold = threshold,
prob_levels = prob_levels,
variable = variable,
horizon = horizon,
start_quantile = start_quantile,
stop_quantile = stop_quantile,
estimator = estimator){
if(sheet == "Monthly"){prob_sheet <- c(1,0,0)}
if(sheet == "Quarterly"){prob_sheet <- c(0,1,0)}
if(sheet == "Yearly"){prob_sheet <- c(0,0,1)}
module_sheet(prob_sheet = prob_sheet) |>
module_cleaning(prob_remove = prob_remove) |>
module_winsorization(threshold = threshold) |>
module_levels(prob_levels = prob_levels) |>
module_independent(variable = variable) |>
module_horizon(horizon = horizon) |>
module_start(start_quantile = start_quantile) |>
module_stop(stop_quantile = stop_quantile) |>
module_estimator(estimator = estimator)
}
Next, we span all the combinations we want to test. Implicitly, they are given the same (uniform) probability.
sheet <- c("Monthly", "Quarterly", "Yearly")
prob_remove <- c(0, 1)
threshold <- c(0, 0.01, 0.02, 0.03)
prob_levels <- c(0, 1)
variable <- c("payout", "b/m", "svar", "dfr", "dfy", "ntis")
horizon <- c(1, 3, 6, 12)
start_quantile <- c(0, 0.5)
stop_quantile <- c(0.5, 1)
estimator <- c("OLS", "HAC", "AUG", "AUG-HAC")
pars <- expand_grid(sheet, prob_remove, threshold, prob_levels, variable, horizon, start_quantile, stop_quantile, estimator)
pars <- pars |> filter(!(start_quantile == 0.5 & stop_quantile == 0.5))
# sheet <- pars[,1]
# prob_remove <- pars[,2]
# threshold <- pars[,3]
# prob_levels <- pars[,4]
# variable <- pars[,5]
# horizon <- pars[,6]
# start_quantile <- pars[,7]
# stop_quantile <- pars[,8]
# estimator <- pars[,9]
We then launch the computation of the outputs across all 13824
combinations.
Note that with the functional programming abilities of
R, this is incredibly code efficient.
To speed up things, we resort to parallelization over 4
cores.
plan(multisession, workers = 4) # Set the number of cores
tictoc::tic()
output <- future_pmap_dfr(pars, module_aggregator) # Launch!
tictoc::toc()
results <- bind_cols(pars, output) # Bind parameters to results
save(results, file = "results.RData")
Here, we adjust the distribution of t-statistics to account for the possible error-in-variable (EIV) bias, as in Proposition 2 of Skill, Scale, and Value Creation in the Mutual Fund Industry.
load("results.RData")
bias_terms <- function(x, m, s, t){ # We follow the notations of Proposition 2
# The first 5 values are the sample moments of the bivariate series
m_m <- mean(m, na.rm = T)
s_m <- sd(m, na.rm = T)
m_s <- mean(s, na.rm = T)
s_s <- sd(s, na.rm = T)
rho <- cor(m,s, use = "complete.obs")
# Next, the optimal h, along with bar(m1) and bar(m2)
h <- (3/4/s_m^5*(rho^4*s_s^4/12-(rho*s_s)^2+1)*exp(m_s+s_s^2*(1-rho^2/2)/2)*(length(m)/mean(t)))^(-1/3)
m1 <- (x-m_m)/s_m
m2 <- (x-m_m-rho*s_m*s_s)/s_m
bs1 <- h^2/2/s_m^2*(m1^2-1)/s_m*dnorm(m1)
bs2 <- (1/2/mean(t)*exp(m_s+s_s^2/2)/s_m^2*(m2^2-1))/s_m*dnorm(m2)
return(bs1 + bs2)
}
adjust_bias <- function(xx, m, s, t){
integrate(bias_terms,
-Inf, xx,
m = m, s = s, t = t)$value
}
adjust_cdf_v <- Vectorize(adjust_bias, vectorize.args='xx')
# Fine discretization
grid_01 <- seq(min(results$t_stat, na.rm = T)-0.1, max(results$t_stat, na.rm = T)+0.1, length.out = 2*nrow(results))
# Next step is long because of integration
Phi_adj <- adjust_cdf_v(grid_01,
m = results$t_stat,
s = results$sd,
t = results$sample_size) + ecdf(results$t_stat)(grid_01)
Phi_adj <- Phi_adj[Phi_adj < 1] |> sort()
# Create inverse function by swapping x & y
quant <- stepfun(Phi_adj, grid_01[1:(length(Phi_adj)+1)])
res_adj <- results
res_adj$t_stat <- quant(ecdf(results$t_stat)(results$t_stat))
res_adj <- res_adj |> mutate(adjusted = "Yes")
res_adj$coef <- NA
res_adj$sd <- NA
res_adj$p_val <- 2*(1-pt(abs(res_adj$t_stat), results$sample_size))
results <- results |> mutate(adjusted = "No")
results <- bind_rows(results, res_adj)
results # Have a look
## # A tibble: 27,648 × 20
## sheet prob_remove threshold prob_levels variable horizon start_quantile
## <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 Monthly 0 0 0 payout 1 0
## 2 Monthly 0 0 0 payout 1 0
## 3 Monthly 0 0 0 payout 1 0
## 4 Monthly 0 0 0 payout 1 0
## 5 Monthly 0 0 0 payout 1 0
## 6 Monthly 0 0 0 payout 1 0
## 7 Monthly 0 0 0 payout 1 0
## 8 Monthly 0 0 0 payout 1 0
## 9 Monthly 0 0 0 payout 1 0.5
## 10 Monthly 0 0 0 payout 1 0.5
## # ℹ 27,638 more rows
## # ℹ 13 more variables: stop_quantile <dbl>, estimator <chr>, coef <dbl>,
## # sd <dbl>, t_stat <dbl>, p_val <dbl>, sample_size <int>, aic <dbl>,
## # bic <dbl>, sum_sq_err <dbl>, var_y <dbl>, skew <dbl>, adjusted <chr>
rm(res_adj) # Remove temp file
The plot below shows the intervals when fixing exactly one mapping,
while leaving all other mappings free.
This generates many combinations, for each of the 31 possible fixed
choices.
module_names <- c("sheet", "prob_remove", "threshold", "prob_levels", "variable",
"horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")
results |>
mutate(across(all_of(module_names), as.character)) |>
pivot_longer(cols = all_of(module_names), names_to = "module", values_to = "value") |>
group_by(module, value) |>
summarise(min = min(t_stat, na.rm = T),
max = max(t_stat, na.rm = T)) |>
mutate(variable = paste(module, value)) |>
ggplot(aes(y = variable)) + geom_vline(xintercept = 0, linetype = 2) +
geom_errorbarh(aes(xmin = min, xmax = max, color = module)) +
theme_bw() + ylab("") + xlab("hacking interval (t-statistics)") +
theme(legend.title = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")) +
guides(color = guide_legend(reverse = TRUE)) +
scale_color_manual(values = c("#3D6499", "#FCA906", "#FC062F", "#9457CF", "#777777", "#000000", "#84D0B1", "#568751", "#7B6B4A", "#7DD5FE"))
ggsave("intervals.pdf", width = 8, height = 5)
New version below that accounts for differences in predictors.
module_names <- c("sheet", "prob_remove", "threshold", "prob_levels",
"horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")
exp_fun <- function(variable_x){
module_names <- c("sheet", "prob_remove", "threshold", "prob_levels",
"horizon", "start_quantile", "stop_quantile", "estimator", "adjusted")
size_full <- vector(mode = "list", length = length(module_names) - 1)
for(j in 1:(length(module_names) - 1)){
#print(j)
combz <- combn(module_names, j) # All possible combinations
for(k in 1:ncol(combz)){
temp <- results |>
filter(variable == variable_x) |>
group_by_at(combz[,k]) |>
summarise(min = min(t_stat, na.rm = T),
max = max(t_stat, na.rm = T)) |>
mutate(size = max - min)
size_full[[j]] <- c(size_full[[j]], temp |> pull(size))
}
}
reshape2::melt(size_full) |> filter(is.finite(value)) |>
mutate(L1 = 9 - L1, variable = variable_x)
}
data_combi <- map_dfr(unique(results$variable), exp_fun)
expanding_model <- lm(log(mean) ~ L1,
data = data_combi |>
group_by(L1, variable) |> summarise(mean = mean(value, na.rm = T)))
summary(expanding_model)
##
## Call:
## lm(formula = log(mean) ~ L1, data = summarise(group_by(data_combi,
## L1, variable), mean = mean(value, na.rm = T)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74399 -0.25592 -0.03785 0.24495 0.74424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.31935 0.12229 -2.611 0.0121 *
## L1 0.33867 0.02422 13.985 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3844 on 46 degrees of freedom
## Multiple R-squared: 0.8096, Adjusted R-squared: 0.8055
## F-statistic: 195.6 on 1 and 46 DF, p-value: < 2.2e-16
# save(data_combi, file = "data_combi.RData")
Finally, the plot that summarizes the findings.
Note that the figures at the top show the numbers of combinations of
mappings over which the boxplots are drawn.
data_combi |>
mutate(size = abs(L1-3.5)) |>
mutate(predictor = variable) |>
ggplot(aes(x = as.factor(L1), y = value, color = predictor)) +
geom_function(fun = function(x) exp(expanding_model$coefficients[1]) * exp(expanding_model$coefficients[2])^x,
linetype = 2, size = 0.8) +
geom_boxplot(outlier.size = 1) +
theme_bw() +
xlab("Degrees of freedom (number of free mappings)") +
ylab("ARI = average range of hacking interval") +
scale_color_viridis_d(direction = -1) +
theme(axis.title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
panel.grid.major.x = element_blank()
#legend.position = "none"
) +
#stat_summary(fun = mean, geom = "point", shape = 20, size = 6, fill = "black")+
annotate("rect", xmin = 0, xmax = 2.2, ymin = 12, ymax = 15, fill = "white", alpha = 0.5) +
annotate("text", x = 1.4, y = 14, label = "many small intervals", fontface = "bold") +
annotate("text", x = 7.5, y = 1, label = "few large intervals", fontface = "bold")
ggsave("combinations.pdf", width = 8, height = 4)
Next, some analysis.
Notably, we estimate the coefficients from \(y=a+b^n\), where \(y\) is the average range of hacking
intervals and \(n\) is the
number of “free” mappings.
We also examine the ratio between successive numbers of mappings.
We see that even after 5-7 free mappings, each new mapping extends the
range of \(t\)-statistics by more than
30%.
data_combi |>
group_by(L1, variable) |>
summarise(ARI = mean(value, na.rm = T)) |>
group_by(variable) |>
mutate(rate = ARI / lag(ARI) - 1) |>
rename("J-k" = L1) |>
pivot_longer(cols = c("ARI", "rate"), names_to = "indicator", values_to = "value") |>
mutate(value = round(value, 3)) |>
pivot_wider(names_from = "J-k", values_from = c("value")) |>
arrange(indicator) |>
data.frame() |>
xtable(digits = 3) |>
print(include.rownames = F)
## % latex table generated in R 4.2.1 by xtable 1.8-4 package
## % Wed Jun 14 09:38:58 2023
## \begin{table}[ht]
## \centering
## \begin{tabular}{llrrrrrrrr}
## \hline
## variable & indicator & X1 & X2 & X3 & X4 & X5 & X6 & X7 & X8 \\
## \hline
## b/m & ARI & 1.237 & 2.613 & 4.151 & 5.927 & 8.052 & 10.669 & 13.949 & 18.054 \\
## dfr & ARI & 0.484 & 1.030 & 1.645 & 2.352 & 3.173 & 4.121 & 5.182 & 6.324 \\
## dfy & ARI & 0.577 & 1.222 & 1.946 & 2.779 & 3.756 & 4.893 & 6.141 & 7.347 \\
## ntis & ARI & 0.983 & 2.047 & 3.192 & 4.454 & 5.899 & 7.614 & 9.701 & 12.251 \\
## payout & ARI & 0.660 & 1.375 & 2.148 & 2.994 & 3.933 & 4.965 & 6.049 & 7.086 \\
## svar & ARI & 0.628 & 1.319 & 2.072 & 2.906 & 3.836 & 4.861 & 5.943 & 6.980 \\
## b/m & rate & & 1.112 & 0.589 & 0.428 & 0.358 & 0.325 & 0.307 & 0.294 \\
## dfr & rate & & 1.126 & 0.597 & 0.429 & 0.349 & 0.299 & 0.258 & 0.220 \\
## dfy & rate & & 1.119 & 0.592 & 0.428 & 0.352 & 0.303 & 0.255 & 0.196 \\
## ntis & rate & & 1.083 & 0.559 & 0.395 & 0.324 & 0.291 & 0.274 & 0.263 \\
## payout & rate & & 1.083 & 0.562 & 0.394 & 0.314 & 0.262 & 0.218 & 0.172 \\
## svar & rate & & 1.099 & 0.571 & 0.402 & 0.320 & 0.267 & 0.223 & 0.174 \\
## \hline
## \end{tabular}
## \end{table}
When several estimates have been obtained to quantify one effect, it is natural to seek to aggregate them.
First, frequentist averaging. There are many ways to proceed, but we choose a very common form: \[\hat{b}_* =\sum_{j=1}^Jw_j\hat{b}_j\] with \[w_j= \frac{e^{-\Delta_j/2}}{\sum_{k=1}^Je^{-\Delta_k/2}}, \quad \Delta_j = AIC_j-\underset{j}{\min}AIC_j\]
For the estimation of the variance of the aggregate estimator, we follow Equation (1) in Burnham & Anderson (2004): \[\hat{\sigma}^2_*=\left(\sum_{j=1}^J w_j\sqrt{\hat{\sigma}_j + (\hat{b}_*-\hat{b}_j)^2} \right)^2.\]
Below, the confidence intervals are computed as \([\hat{b}_*-1.96\sigma_*^2/\sqrt{P},\hat{b}_*+1.96\sigma_*^2/\sqrt{P}]\), where \(P\) is the number of paths.
Next, Bayesian averaging. The quantity of interest is \(b\), with posterior probability given the data \(D\) equal to \[P[b | D]= \sum_{m=1}^MP[b|M_m,D]P[M_m|D],\] where \(\mathbb{M}=\{M_m,m=1,\dots,M\}\) is the set of models under consideration. One model corresponds to one complete path. Notably, the above equation translates to the following conditional average and variance: \[\begin{align} \mathbb{E}[b|D]&=\sum_{m=1}^M\hat{b}_mP[M_m|D] \label{eq:mean} \\ \mathbb{V}[b|D]&=\sum_{m=1}^M \left(\mathbb{V}(b|M_m,D) + \hat{b}^2_m \right)P[M_m|D] - (\mathbb{E}[b|D])^2 \end{align}\] where \(\hat{b}_m\) is the estimated effect in model \(m\). The posterior model probabilities are given by \[\begin{equation} P[M_m|D] = \left(\sum_{j=1}^M \frac{P[M_j]}{P[M_m]}\frac{l_D(M_j)}{l_D(M_m)} \right)^{-1}, \end{equation}\] with \(l_D(M_j)\) being the marginal likelihood of model \(j\). Because we are agnostic with respect to which model is more realistic, we will set the prior odds \(\frac{P[M_j]}{P[M_m]}\) equal to one. Note that in this case, the posterior probabilities are then simply proportional to the likelihoods. The remaining Bayes factor is by far the most complex: \[\begin{equation} \frac{l_D(M_j)}{l_D(M_m)}=\left(\frac{n_j}{n_j+1}\right)^{\frac{k_j}{2}}\left(\frac{n_m+1}{n_m}\right)^{\frac{k_m}{2}}\frac{\left(\frac{s_m+n_mv_m}{n_m+1} \right)^{(n_m-1)/2}}{\left(\frac{s_j+n_jv_j}{n_j+1} \right)^{(n_j-1)/2}}, \label{eq:odds} \end{equation}\] where \(n_j\) is the inverse of the number of observations used in model \(j\) and \(k_j\) is the number of predictors in this model, omitting the constant (\(k_j=1\) in our case). Moreover, \(n_jv_j\) is the sample variance of the dependent variable in model \(j\). Finally, \(s_j\) is the sum of squared residuals under model \(j\).
Finally, the results.
m_avg <- results |>
filter(is.finite(coef), is.finite(aic)) |>
group_by(variable, prob_levels, sheet) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
b_freq = sum(weight_freq * coef, na.rm = T),
b_ew = mean(coef),
n = n(),
# Below we use sart(sample_size in the end to avoid overflow)
lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sqrt(sample_size)+1))^(sqrt(sample_size-1)),
weight_bayes = lik/sum(lik, na.rm = T),
b_bayes = sum(coef*weight_bayes, na.rm = T)) |>
summarise(s_ew = mean(sqrt(sd^2+(b_ew-coef)^2), na.rm = T),
s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
s_Bayesian = sum(weight_bayes * sqrt(sd^2+(b_bayes-coef)^2), na.rm = T),
b_ew = mean(b_ew),
b_Frequentist = mean(b_freq),
b_Bayesian = mean(b_bayes),
n = mean(n)) |>
pivot_longer(s_ew:b_Bayesian, names_to = "indicator", values_to = "indicator_values") |>
separate(indicator, into = c("indicator", "avg_type"), sep = "_") |>
pivot_wider(names_from = "indicator", values_from = "indicator_values")
g_levels <- m_avg |>
filter(prob_levels == 1, avg_type != "ew") |>
ggplot(aes(y = reorder(avg_type, b), x = b, color = sheet)) +
geom_point(position = position_dodge(0.9)) + geom_vline(xintercept = 0, linetype = 2) +
facet_grid(rows = vars(variable), scales = "free") +
geom_errorbar(aes(xmin = b - 2.58 * s / sqrt(n), xmax = b + 2.58 * s / sqrt(n)), position = "dodge", width = 0.9) +
theme_minimal() + #xlab("coefficient (confidence interval)") +
theme(legend.position = c(0.8,0.4),
text = element_text(size = 12),
title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
strip.background = element_blank(),
strip.text = element_blank(),
legend.box.background = element_rect(color = "grey", size = 1),
panel.grid = element_blank(),
legend.background = element_rect(fill = "white", color = "white"),
axis.title = element_blank()) +
labs(color = "Frequency",
subtitle = "Level predictors")
g_diff <- m_avg |>
filter(prob_levels == 0, avg_type != "ew") |>
ggplot(aes(y = reorder(avg_type, b), x = b, color = sheet)) +
geom_point(position = position_dodge(0.9)) + geom_vline(xintercept = 0, linetype = 2) +
facet_grid(rows = vars(variable), scales = "free") +
geom_errorbar(aes(xmin = b - 1.96 * s / sqrt(n), xmax = b + 1.96 * s / sqrt(n)), position = "dodge", width = 0.9) +
theme_minimal() + #xlab("coefficient (confidence interval)") +
theme(legend.position = "none",
text = element_text(size = 12),
title = element_text(face = "bold"),
legend.title = element_text(face = "bold", color = "grey"),
legend.box.background = element_rect(color = "grey", size = 1),
strip.background = element_rect(fill = "#EEEEEE", color = "white"),
strip.text = element_text(face = "bold"),
panel.grid = element_blank(),
axis.text.y = element_blank(),
legend.background = element_rect(fill = "white", color = "white"),
panel.background = element_rect(fill = "#F6F6F6", color = "white"),
axis.title = element_blank()) +
labs(color="Frequency",
subtitle = "Difference predictors")
g_levels + g_diff
ggsave("model_averaging.pdf", width = 8, height = 4)
In this section, we test if different layer choices lead to changes in results. Some plots are illustrative and not reported in the paper. There are many way to display results and the best only made it to the article.
In the augmented model, we proceed as in Amihud & Hurvich 2004 by adding an extra term to correct for possible auto-correlation in the residuals. In contrast with HAC estimators, this changes the value of the coefficient, not only the standard deviation.
results |>
filter(is.finite(coef), is.finite(aic), estimator %in% c("OLS", "AUG")) |>
group_by(variable, prob_levels, estimator, sheet) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
#b_freq = sum(weight_freq * coef, na.rm = T),
n = n()) |>
select(-t_stat, -p_val, -skew, -sd, -aic, -bic, -sum_sq_err, -D) |>
pivot_wider(names_from = "estimator", values_from = c("coef", "weight_freq")) |>
summarise(test_stat = t.test(coef_AUG*weight_freq_AUG*n-coef_OLS*weight_freq_OLS*n)$statistic) |>
ungroup() |>
summarise(min = min(abs(test_stat)),
max = max(abs(test_stat)))
## # A tibble: 1 × 2
## min max
## <dbl> <dbl>
## 1 0.0265 2.31
results |>
filter(is.finite(coef), is.finite(aic), estimator %in% c("OLS", "AUG")) |>
mutate(prob_levels = prob_levels |> case_match(
1 ~ "Level",
0 ~ "Difference"
),
estimator = estimator |> case_match(
"OLS" ~ "Standard",
"AUG" ~ "Augmented"
)) |>
group_by(variable, prob_levels, estimator, sheet) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
b_freq = sum(weight_freq * coef, na.rm = T),
n = n()) |>
summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
b_Frequentist = mean(b_freq),
n = mean(n)) |>
#mutate(stat = ) |>
ggplot(aes(y = variable, x = b_Frequentist, color = estimator, shape = estimator, size = estimator)) +
geom_point() + theme_bw() + geom_vline(xintercept = 0, linetype = 2, color = "#666666") +
scale_size_manual(values = c(2.4,1.8)) +
facet_grid(vars(prob_levels), vars(sheet), scales = "free") +
theme(
#legend.position = c(0.86,1.14),
legend.position = c(0.86,1.24),
legend.margin = unit(x=c(0,0,0,0),units="mm"),
panel.grid.minor.x = element_blank(),
axis.title = element_blank(),
title = element_text(face = "bold"),
strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "#F6F6F6", color = "#EEEEEE"),
panel.grid.major.x = element_blank()) +
scale_color_manual(values = c("#FF9922", "#2266FF")) +
guides(color = guide_legend(nrow = 1),
size = guide_legend(nrow = 1),
shape = guide_legend(nrow = 1)) +
ggtitle("Impact of model specification on average effect")
Another representation below.
library(ggrepel)
data_graph <- results |>
filter(is.finite(coef), is.finite(aic), estimator %in% c("OLS", "AUG")) |>
mutate(prob_levels = prob_levels |> case_match(
1 ~ "Level",
0 ~ "Difference"
),
estimator = estimator |> case_match(
"OLS" ~ "Standard",
"AUG" ~ "Augmented"
)) |>
group_by(variable, prob_levels, estimator, sheet) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
b_freq = sum(weight_freq * coef, na.rm = T),
n = n()) |>
summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
b_Frequentist = mean(b_freq),
n = mean(n)) |>
select(-s_Frequentist) |>
filter(prob_levels == "Level") |>
pivot_wider(names_from = "estimator", values_from = "b_Frequentist")
g1 <- data_graph |>
ggplot(aes(y = Augmented, x = Standard)) +
geom_vline(xintercept = 0, color = "#CCCCCC", size = 0.3) + geom_hline(yintercept = 0, color = "#CCCCCC", size = 0.3) +
geom_function(fun = function(x) x, linetype = 2, color = "#2288CC") +
theme_bw() +
theme(panel.grid = element_blank(),
legend.position = "none",
title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
axis.title = element_text(face = "bold")) +
geom_text_repel(aes(label = paste(sheet, variable)),
data = data_graph |> filter(abs(Standard) > 0.0025),
box.padding = 0.1,
min.segment.length = 1,
# segment.curvature = -0.1,
# segment.ncp = 3,
segment.color = "#CCCCCC",
nudge_x = -.004,
nudge_y = .008,
# segment.angle = 20,
color = "#000000",
max.overlaps = 3
) +
geom_point(size = 1.8) +
#scale_color_manual(values = c("#999999", "#000000"), name = "Predictor in:") +
ylab("Augmented model (average effect)") + xlab("Standard model (average effect)") +
ggtitle("Impact of model specification on average effect")
g1
Hence: the augmented specification changes almost nothing.
Now onto periods. Same with first plot: not reported in the paper.
table_periods <- results |>
mutate(period = case_when(
start_quantile == 0 & stop_quantile == 1/2 ~ "first_period",
start_quantile == 1/2 & stop_quantile == 1 ~ "second_period",
)) |>
filter(is.finite(coef), is.finite(aic), !is.na(period)) |>
group_by(variable, prob_levels, period, sheet) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
#b_freq = sum(weight_freq * coef, na.rm = T),
n = n()) |>
select(-t_stat, -p_val, -skew, -sd, -aic, -bic, -sum_sq_err, -var_y, -sample_size, -D, -start_quantile, -stop_quantile) |>
pivot_wider(names_from = "period", values_from = c("coef", "weight_freq")) |>
summarise(test_stat = t.test(coef_first_period*weight_freq_first_period*n-
coef_second_period*weight_freq_second_period*n)$statistic,
n = mean(n),
sd = t.test(coef_first_period*weight_freq_first_period*n-
coef_second_period*weight_freq_second_period*n)$stderr,
diff = sum(coef_first_period*weight_freq_first_period-coef_second_period*weight_freq_second_period)) |>
ungroup() |>
mutate(test_stat_2 = case_when(
abs(test_stat) > 1.96 ~ "significant",
.default = "insignificant"
)) #|>
# summarise(min = min(abs(test_stat)),
# max = max(abs(test_stat)))
results |>
mutate(period = case_when(
start_quantile == 0 & stop_quantile == 1/2 ~ "first_period",
start_quantile == 1/2 & stop_quantile == 1 ~ "second_period",
)) |>
filter(!is.na(period), is.finite(aic)) |>
mutate(prob_levels = prob_levels |> case_match(
1 ~ "Level",
0 ~ "Difference"
)) |>
group_by(variable, prob_levels, period, sheet) |>
mutate(D = (aic - min(aic)),# |> min2(),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
b_freq = sum(weight_freq * coef, na.rm = T),
n = n()) |>
summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
b_Frequentist = mean(b_freq),
n = mean(n)) |>
ggplot(aes(y = variable, x = b_Frequentist, color = period, shape = period, size = period, fill = period)) +
geom_point() + theme_bw() + geom_vline(xintercept = 0, linetype = 2, color = "#666666") +
scale_size_manual(values = c(2.4,1.8)) +
facet_grid(vars(prob_levels), vars(sheet), scales = "free") +
theme(legend.position = c(0.8,1),
panel.grid.minor.x = element_blank(),
axis.title = element_blank(),
strip.text = element_text(face = "bold"),
strip.background = element_rect(fill = "#F6F6F6", color = "#EEEEEE"),
panel.grid.major.x = element_blank()) +
scale_color_manual(values = c("#FF9922", "#2266FF")) +
ggtitle("Impact of periods on average effect")
Now to the more visually appealing representation…
library(ggrepel)
g2 <- results |>
mutate(period = case_when(
start_quantile == 0 & stop_quantile == 1/2 ~ "first_period",
start_quantile == 1/2 & stop_quantile == 1~ "second_period",
)) |>
filter(is.finite(coef), is.finite(aic), !is.na(period)) |>
group_by(variable, prob_levels, sheet, period) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
b_freq = sum(weight_freq * coef, na.rm = T),
n = n(),
# Below we use sart(sample_size in the end to avoid overflow)
lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sqrt(sample_size)+1))^(sqrt(sample_size-1)),
weight_bayes = lik/sum(lik, na.rm = T),
b_bayes = sum(coef*weight_bayes, na.rm = T)) |>
summarise(b_Frequentist = mean(b_freq),
b_Bayesian = mean(b_bayes),
n = mean(n)) |>
ungroup() |>
pivot_longer(b_Frequentist:b_Bayesian, names_to = "indicator", values_to = "indicator_values") |>
separate(indicator, into = c("indicator", "avg_type"), sep = "_") |>
pivot_wider(names_from = "indicator", values_from = "indicator_values") |>
pivot_wider(names_from = "period", values_from = "b") |>
filter(avg_type == "Frequentist", prob_levels == 1) |>
bind_cols(table_periods |> filter(prob_levels == 1) |> select(test_stat_2)) |> # For color / significance coding
ggplot(aes(x = first_period, y = second_period, color = test_stat_2)) +
geom_function(fun = function(x) x, linetype = 2, color = "#2288CC", xlim = c(-0.04,0.04)) +
geom_point(size = 2) + xlab("First period average effect") + ylab("Second period average effect") +
theme_bw() + #geom_smooth(method = "lm", se = F) +
geom_hline(yintercept = 0, color = "#999999", size = 0.3) +
geom_vline(xintercept = 0, color = "#999999", size = 0.3) +
theme(axis.title = element_text(face = "bold"),
legend.position = c(0.9,0.8),
panel.grid = element_blank(),
legend.title = element_text(face = "bold"),
title = element_text(face = "bold"),
panel.grid.minor = element_blank()) +
geom_text_repel(aes(label = paste(sheet, variable),
color = test_stat_2),
box.padding = 0.5,
max.overlaps = 20,
min.segment.length = 1) +
scale_color_manual(values = c("#1177DD", "#FF8822"), name = "test statistic") +
ggtitle("Impact of time period on average effect")
g1/g2
# ggsave("cond_mean.pdf", width = 8, height = 7)
Bottomline: however, time period can matter a lot, especially for annual data, which is more scarce.
Below, we dive into the topic of the importance of standard errors of estimators and the importance to resort to HAC corrections. IID estimators are on the x-axis, and their HAC counterparts on the y-axis.
data_graph <- results |>
filter(is.finite(coef), is.finite(aic)) |>
mutate(prob_levels = prob_levels |> case_match(
1 ~ "Level",
0 ~ "Difference"
),
estimator = estimator |> case_match(
"OLS" ~ "Standard",
"HAC" ~ "HAC",
"AUG" ~ "Standard",
"AUG-HAC" ~ "HAC"
)) |>
group_by(variable, prob_levels, estimator, sheet) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
b_freq = sum(weight_freq * abs(t_stat), na.rm = T),
n = n()) |>
summarise(s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-abs(t_stat))^2), na.rm = T),
b_Frequentist = mean(b_freq),
n = mean(n)) |>
select(-s_Frequentist) |>
filter(prob_levels == "Level") |>
pivot_wider(names_from = "estimator", values_from = "b_Frequentist")
data_graph |>
ggplot(aes(y = HAC, x = Standard)) +
geom_vline(xintercept = 0, color = "#CCCCCC", size = 0.3) + geom_hline(yintercept = 0, color = "#CCCCCC", size = 0.3) +
geom_function(fun = function(x) x, linetype = 2, color = "#2288CC", size = 1.05, xlim = c(0,3)) +
theme_bw() +
theme(panel.grid = element_blank(),
legend.position = "none",
title = element_text(face = "bold"),
legend.title = element_text(face = "bold"),
axis.title = element_text(face = "bold")) +
geom_text_repel(aes(label = paste(sheet, variable)),
data = data_graph |> filter(Standard > 1.3),
box.padding = 0.2,
min.segment.length = 1,
# segment.curvature = -0.1,
# segment.ncp = 3,
segment.color = "#CCCCCC",
nudge_x = .004,
nudge_y = -.008,
# segment.angle = 20,
color = "#000000",
max.overlaps = 2
) +
geom_point(size = 1.8) +
geom_vline(xintercept = 1.96, linetype = 3) + geom_hline(yintercept = 1.96, linetype = 3) +
geom_vline(xintercept = 2.56, linetype = 5, color = "#CCCCCC") +
geom_hline(yintercept = 2.56, linetype = 5, color = "#CCCCCC") +
#scale_color_manual(values = c("#999999", "#000000"), name = "Predictor in:") +
ylab("HAC estimator (absolute average t-statistic)") + xlab("Standard (iid) estimator (absolute average t-statistic)") +
annotate("rect", xmin = 1.8, xmax = 2.2, ymin = 2.9, ymax = 3.1, fill = "white") +
annotate("rect", xmin = 2.3, xmax = 2.7, ymin = 2.9, ymax = 3.1, fill = "white") +
annotate("text", label = "5% level", x = 2, y = 3) +
annotate("text", label = "1% level", x = 2.6, y = 3, color = "grey") +
ggtitle("Impact of standard deviation correction on absolute average t-statistic")
ggsave("hac.pdf", width = 8, heigh = 4)
As expected, HAC values are always above. From an inferential point of view, they are more conservative, as they will imply lower t-statistics which will be less likely to reject the null of zero effect.
Below, we focus on the payout predictor.
base <- results |>
filter(variable == "payout", prob_levels == 1, is.finite(coef), is.finite(aic)) |>
group_by(sheet) |>
mutate(D = (aic - min(aic)),
weight_freq = if_else(is.na(D), 0, exp(-D/2) / sum(exp(-D/2), na.rm = T)),
b_freq = sum(weight_freq * coef, na.rm = T),
b_ew = mean(coef),
n = n(),
# Below we use sart(sample_size in the end to avoid overflow)
lik = sqrt(sample_size/(sample_size+1))/((sum_sq_err+var_y)/(sqrt(sample_size)+1))^(sqrt(sample_size-1)),
weight_bayes = lik/sum(lik, na.rm = T),
b_bayes = sum(coef*weight_bayes, na.rm = T)) |>
summarise(s_ew = mean(sqrt(sd^2+(b_ew-coef)^2), na.rm = T),
s_Frequentist = sum(weight_freq * sqrt(sd^2+(b_freq-coef)^2), na.rm = T),
s_Bayesian = sum(weight_bayes * sqrt(sd^2+(b_bayes-coef)^2), na.rm = T),
b_ew = mean(b_ew),
b_Frequentist = mean(b_freq),
b_Bayesian = mean(b_bayes),
n = mean(n)) |>
pivot_longer(-c("sheet", "n"), names_to = "indicator", values_to = "value") |>
separate(indicator, sep = "_", into = c("indicator", "type"))
generate <- function(j){
results |>
filter(variable == "payout", prob_levels == 1, is.finite(coef), is.finite(aic)) |>
group_by(sheet) |>
mutate(n = n(),
weight_random = runif(n),
weight_random = weight_random/sum(weight_random),
b_random = sum(coef*weight_random, na.rm = T)) |>
summarise(s_random = sum(weight_random * sqrt(sd^2+(b_random-coef)^2), na.rm = T),
b_random = mean(b_random),
n = mean(n)) |>
pivot_longer(-c("sheet", "n"), names_to = "indicator", values_to = "value") |>
separate(indicator, sep = "_", into = c("indicator", "type")) |>
mutate(type = paste(type, j))
}
generate(1)
## # A tibble: 6 × 5
## sheet n indicator type value
## <chr> <dbl> <chr> <chr> <dbl>
## 1 Monthly 384 s random 1 0.0113
## 2 Monthly 384 b random 1 -0.00157
## 3 Quarterly 384 s random 1 0.0297
## 4 Quarterly 384 b random 1 0.00123
## 5 Yearly 384 s random 1 0.117
## 6 Yearly 384 b random 1 0.0161
for(j in 1:200){
base <- base |> bind_rows(generate(j))
}
base |>
pivot_wider(names_from = "indicator", values_from = "value") |>
ggplot(aes(x = b, y = n)) +
geom_jitter(size = 0.8,
data = base |> filter(str_starts(type, "r")) |>
pivot_wider(names_from = "indicator", values_from = "value") ) +
theme_bw() +
xlab("Average effect of payout") +
theme(axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
strip.background = element_blank(),
strip.text = element_text(face = "bold"),
axis.title.x = element_text(face = "bold")) +
geom_point(data = base |> filter(type == "ew") |> pivot_wider(names_from = "indicator", values_from = "value") ,
color = "red", size = 4, shape = 17) +
geom_point(data = base |> filter(type == "Frequentist") |> pivot_wider(names_from = "indicator", values_from = "value") ,
color = "#2288DD", size = 4, shape = 17) +
geom_point(data = base |> filter(type == "Bayesian") |> pivot_wider(names_from = "indicator", values_from = "value") ,
color = "#DD8822", size = 4, shape = 17) +
facet_grid(cols = vars(sheet), scales = "free") +
geom_text(data = base |> filter(sheet == "Quarterly") |> pivot_wider(names_from = "indicator", values_from = "value"),
label = "frequentist\n average", x = -0.0019 , y = 384.25, color = "#2288DD", size = 3) +
geom_text(data = base |> filter(sheet == "Yearly") |> pivot_wider(names_from = "indicator", values_from = "value"),
label = "Bayesian\n average", x = 0.0005, y = 383.8, color = "#DD8822", size = 3)
ggsave("payout.pdf", width = 8, height = 2)
This illustrates the effect of averaging: when using particular weights, the meta model can shift heavily towards particular cases.
Lastly: the impact of AIC on effect sizes. In absolute value, |AIC| increases with |effects|.
results |>
filter(variable == "payout", prob_levels == 1, is.finite(coef), is.finite(aic)) |>
ggplot(aes(x = aic, y = abs(coef))) + geom_point() + theme_bw() +
facet_wrap(vars(sheet), scales = "free") + xlab("Akaike Information Criterion") +
geom_smooth(method = "lm", se = F) + ylab("absolute effect") +
theme(strip.background = element_rect(fill = "#DFDFEF"),
axis.title = element_text(face = "bold"))
ggsave("aic.pdf", width = 8, height = 3)
We extract a few numbers from A Comprehensive 2021 Look at the Empirical Performance of Equity Premium Prediction II and provide the ease-to-replicate indicator in the rounded rectangles.
prior <- data.frame(
variable = c("ntis", "svar", "b/m", "payout", "dfr", "dfy"),
effect = c(-0.36, -0.08, 0.29, -0.11, 0.23, 0.06)/100,
first = c(-1.4, 0.01, 1.16, -0.72, 0.08, 0.05)/100,
second = c(-0.19, -0.42, -0.16, 0.35, 1.28, 0.44)/100
)
quant <- 0.9
results |>
filter(prob_levels == 1, sheet == "Monthly", horizon == 1) |>
left_join(prior, by = "variable") |>
group_by(variable) |>
mutate(m = mean(coef, na.rm = T),
sd = sd(coef, na.rm = T),
q = pnorm(abs(effect), mean = abs(m), sd = sd),
q_first = pnorm(abs(first), mean = abs(m), sd = sd),
q_2 = pnorm(abs(second), mean = abs(m), sd = sd),
EtR = 1-if_else(q>quant, (q-quant)/(1-quant), 0),
EtR_first = 1-if_else(q_first>quant, (q_first-quant)/(1-quant), 0),
EtR_2 = 1-if_else(q_2>quant, (q_2-quant)/(1-quant), 0)) |>
ggplot(aes(x = coef)) +
geom_vline(aes(xintercept = effect), color = "black") +
geom_vline(aes(xintercept = first), color = "red", linetype = 2) +
geom_vline(aes(xintercept = second), color = "red") +
geom_histogram(alpha = 0.6) +
facet_wrap(vars(variable), scales = "free") +
theme_classic() +
theme(axis.title = element_blank()) +
geom_label(aes(x = effect, y = 9, label = paste0(round(100*EtR, 1), "%")), size = 2.4) +
geom_label(aes(x = first, y = 11.5, label = paste0(round(100*EtR_first, 1), "%")), size = 2.4, color = "red") +
geom_label(aes(x = second, y = 6.5, label = paste0(round(100*EtR_2, 1), "%")), size = 2.4, color = "red")
ggsave("premium_prior.pdf", width = 9, height = 5)