Training ML models on tails

Introduction

Being honest: ML & money management

ML in Finance, the marketing paradox:
- ML is often a good selling point for funds/strategies;
- but surprisingly, ML-related returns are seldom impressive.
One reason: ML has become ‘easy’ because data is cheap and software free.
- it’s great: everybody can do ML (NNs can be trained in 10 lines of code)!
- it’s bad: ‘doing’ ML is not enough.
Another reason: pressure leads to backtest overfitting.
Yet another reason: the communication bias towards positive results is deceiptful:
- people only talk about success (which is often fragile);
- solid research/results require transparence, especially on failures.

Possible solutions

Successful strategies require a deep understanding of the implementation:
- at the micro-level (what does the algorithm do exactly?)
- at the macro-level (do my models/variables make sense?)
- ths is why AutoML (one size fits all) is probably a bad idea.
Transparence:
- sharing code and data (like in the ‘general ML’ community)
- sharing negative results and acknowledging that often seemingly good ideas don’t work: it’s ok to fail and it’s even better to admit it.

THE ML challenge

Most ML applications in Finance rely on supervised learning (SL) which tries to find patterns from data.
The main challenge in supervised learning is to extract the juice from this data.
Paraphrasing: it’s very hard to separate noise from signal in financial (and alternative) datasets.
What I mean by signal is: the patterns that will hold out-of-sample.
A model that captures those patterns is said to generalize well.

The idea & first results

Tony’s idea

SL is mostly about predictions. Assume we’re trying to predict returns (= labels).
\(\rightarrow\) Filter the training sample: keep only the instances with extreme returns.

About this idea

Features versus instances: people often focus on explanatory variables (e.g., through feature selection & processing). I argue that working at the instance level could add value. I look at rows and not columns!
Why? Since datasets are full of noise, filtering instances may help find a shorter path towards the signal. Also, this shrinks training times!
Intuition: investors care most about high returns (they want them) and low returns (they seek to avoid them). Why bother with the rest?
Reproducible finance: these slides were created in RMarkdown (all the material is available at www.gcoqueret.com/tot.html).

The backtest: empirical protocol

1200 US stocks, ~100 features (accounting, risk, return, etc.)
2000-2017 available, 2008-2017 for backtesting (monthly rebalancing)
ML engine: boosted trees (100 trees, \(\eta=0.3\))

Every month, I predict the future return of the 1200 stocks.
I group them in 6 groups of 200 stocks: the best, the good, the average, etc… the worse.
I form equally weighted portfolios of these 200 stocks.
I store the 1 month ahead return of each portfolio.
I compare the results for each of the 6 groups.

The analysis is performed on all data, on extreme returns, and on ‘bulk’ returns.

Results

First, all ML-based portfolios yield monotonous returns w.r.t their rankings. Second, the average return of the best minus the worst portfolio is:

Difference between best and worst portfolios

How does it work?

Simple trees

Scheme of a regression tree

For each split, two questions:
1. Which feature?
2. Which level (of the feature)?

About these 2 questions

Which feature? Is it ok to let the algorithm choose that blindly?
Which level for the feature? Is any level acceptable? Said differently, do micro-clusters generalize well out-of-sample?
What’s the impact of a filter that keeps only the extreme points?

It’s impossible to answer these question theoretically.

Thus, we resort to a flexible tool: simulations.

One simulation

\[y=x_1/6+\sin(10x_2)/7+\epsilon/20, \quad \epsilon \sim N(0,1)\]

Trees: explaining y with x1 and x2.

The left tree is trained with the raw data; the right trees is trained with the truncated data (top & bottom 15%).

Visual explanation

In the first (oscillatory) case, there is one relatively homogeneous cluster and one very inhomogeneous one. The average value in the big cluster is far from all points.

In the linear case, the two clusters contribute equally to dispersion. One cluster is clearly for higher values of y and the other for the lower values.

Takeaways from the simulations

Filtering the training set and retaining only extreme instances:

favors variables that have a monotonic impact on the label (except for a few pathological exceptions);
can create more homogenous clusters by choosing splitting points closer to the center.

Is this favorable? Yes, because:

monotonicity is a token of robustness;
larger clusters are more reliable than sample idiosyncrasies.

About gains in training times

For trees, the computational cost lies between \(N\log(N)\) and \(N^2\) - with \(N\) being the number of instances. Hence filtering divides training times by two to three.

This is not insignificant because the multiplication of dimensions is truly an issue:

many strategies;
hundreds or thousands of assets;
dozens or hundreds of features;
dozens of hundreds of rebalancing dates in a dynamic backtest;
\(+\) hyperparameter tuning!

This can become overwhelming. Any gain is welcome.

Conclusion

It’s all about structure

Blind machine learning algorithm all tend to overfit on past patterns.
The problem is that unlike many ML problems (for which invariance is the key), these patterns change through time (macroeconomic conditions, investor beliefs, etc.). Solutions include:
- ‘adding’ macro-variables to the set of predictors;
- including time-varying views / preferences.
One natural saveguard against overfitting is to add some structure to the models. Often, this requires a more complex optimization: a little more (constraints), a lot more (GANs for asset pricing).
Surprisingly, altering the training data can lead the model to be more structured around solid (lasting?) relationships between the features and the label.
\(+\) Monotonicity constraints in trees.

Thank you for your attention

Any questions?