SL is mostly about predictions. Assume we’re trying to predict returns (= labels).
\(\rightarrow\) Filter the training sample: keep only the instances with extreme returns.
The analysis is performed on all data, on extreme returns, and on ‘bulk’ returns.
First, all ML-based portfolios yield monotonous returns w.r.t their rankings. Second, the average return of the best minus the worst portfolio is:
Difference between best and worst portfolios
Scheme of a regression tree
For each split, two questions:
1. Which feature?
2. Which level (of the feature)?
It’s impossible to answer these question theoretically.
Thus, we resort to a flexible tool: simulations.
\[y=x_1/6+\sin(10x_2)/7+\epsilon/20, \quad \epsilon \sim N(0,1)\]
The left tree is trained with the raw data; the right trees is trained with the truncated data (top & bottom 15%).
In the first (oscillatory) case, there is one relatively homogeneous cluster and one very inhomogeneous one. The average value in the big cluster is far from all points.
In the linear case, the two clusters contribute equally to dispersion. One cluster is clearly for higher values of y and the other for the lower values.
Filtering the training set and retaining only extreme instances:
Is this favorable? Yes, because:
For trees, the computational cost lies between \(N\log(N)\) and \(N^2\) - with \(N\) being the number of instances. Hence filtering divides training times by two to three.
This is not insignificant because the multiplication of dimensions is truly an issue:
This can become overwhelming. Any gain is welcome.
Any questions?