# install.packages("tidyverse")
# install.packages(c("tidymodels", "randomForest", "doParallel", "torch", "caTools", "tabnet", "lime", "luz"))
Credit risk with machine learning
WARNING: this notebook relies on several R packages. Whenever one is called (for instance via a call such as “library(torch)”), this implies that the package has been installed, e.g., via “install.packages(”torch”)”. Hence, make sure the following packages are installed!
NOTE: in data science, one of the most crucial competence is domain knowledge.
Data is key (more than any algorithm) and domain knowledge helps understand how it is built and which features are likely to matter most.
Introduction
Principles
Credit risk pertains to situations when a lender (e.g., a bank) face uncertainty so as to whether it will receive debt payments. Basically, there are two types of customers: individuals (e.g., people wanting to buy a car or a house), and firms (willing to finance projects, acquisitions, etc.).
There are several quantities that are important to lenders:
- the probability of default (PD), which is self-explanatory.
- the loss given default (LGD): when the default occurs, the amount of money that will be lost (1-recovery rate).
- the exposure at default (EAD): how much money was still owed when the default occurs.
- the expected loss (ED): average amount of money lost, equal to ED = PD * LGD * EAD.
The EAD is easy to compute and the recovery rate is often approximated accurately. The most complicated part is the probability of default….
Basically, in order to be able to understand the drivers of these quantities, we need information on the customers, i.e., DATA!
- For individuals (credit scoring), this usually encompasses: age, gender, income, wealth, job type, payment history (prior defaults), etc.
- For corporations (credit rating), lenders consult credit risk agencies (Standard & Poors, Moodys, Fitch Ratings) that ave developed complex models that usually rely on accounting data linked to revenue, debt, cash-flows, financial ratios, prior credit history, macro-economic variables, etc.
A few free references
Below, we list a few interesting resources on the topic of credit risk and credit scoring:
- Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring
- An Overview on the Landscape of R Packages for Open Source Scorecard Modelling -arXiv version here
- Credit scoring methods: Latest trends and points to consider
- Credit Scoring Approaches Guidelines from the World Bank
- this github repo lists free credit risk related datasets
Data
In the course part, we dig into only one of these examples.
There are lots of datasets online, but to find neat ones (good size, format, clear column names).
The data we rely one below comes from this Kaggle competition.
First, let’s load the packages & dataset.
IMPORTANT: the data (.csv file) is also available on this drive link.
\(\rightarrow\) Please download it and place it in the same folder as the notebook.
library(tidyverse) # Data wrangling
library(tidymodels) # ML workflow
<- read_csv("application_train.csv") # Load data data
The data is super large for a “normal” computer. We don’t want training times that are too long.
So let’s shrink its size (you can skip this step if you want the “full” experience!).
To do so, have a look at missing data first.
We locate the columns with more that 25% of missing data.
Then, remove these columns.
<- colMeans(is.na(data)) |>
missing_names data.frame() |> # Convert to dataframe
rownames_to_column() |> # Shift row names to a column
`colnames<-` (c("name", "missing")) |> # Change column names
filter(missing > 0.25) # Show features with lots of missing data
|> head(12) # First 12 names missing_names
name | missing |
---|---|
OWN_CAR_AGE | 0.6599081 |
OCCUPATION_TYPE | 0.3134555 |
EXT_SOURCE_1 | 0.5638107 |
APARTMENTS_AVG | 0.5074973 |
BASEMENTAREA_AVG | 0.5851596 |
YEARS_BEGINEXPLUATATION_AVG | 0.4878102 |
YEARS_BUILD_AVG | 0.6649778 |
COMMONAREA_AVG | 0.6987230 |
ELEVATORS_AVG | 0.5329598 |
ENTRANCES_AVG | 0.5034877 |
FLOORSMAX_AVG | 0.4976082 |
FLOORSMIN_AVG | 0.6784863 |
<- data |> # Remove these features
data select(-(missing_names |> pull(name)))
Let’s have a look at column names…
colnames(data)
[1] "SK_ID_CURR" "TARGET"
[3] "NAME_CONTRACT_TYPE" "CODE_GENDER"
[5] "FLAG_OWN_CAR" "FLAG_OWN_REALTY"
[7] "CNT_CHILDREN" "AMT_INCOME_TOTAL"
[9] "AMT_CREDIT" "AMT_ANNUITY"
[11] "AMT_GOODS_PRICE" "NAME_TYPE_SUITE"
[13] "NAME_INCOME_TYPE" "NAME_EDUCATION_TYPE"
[15] "NAME_FAMILY_STATUS" "NAME_HOUSING_TYPE"
[17] "REGION_POPULATION_RELATIVE" "DAYS_BIRTH"
[19] "DAYS_EMPLOYED" "DAYS_REGISTRATION"
[21] "DAYS_ID_PUBLISH" "FLAG_MOBIL"
[23] "FLAG_EMP_PHONE" "FLAG_WORK_PHONE"
[25] "FLAG_CONT_MOBILE" "FLAG_PHONE"
[27] "FLAG_EMAIL" "CNT_FAM_MEMBERS"
[29] "REGION_RATING_CLIENT" "REGION_RATING_CLIENT_W_CITY"
[31] "WEEKDAY_APPR_PROCESS_START" "HOUR_APPR_PROCESS_START"
[33] "REG_REGION_NOT_LIVE_REGION" "REG_REGION_NOT_WORK_REGION"
[35] "LIVE_REGION_NOT_WORK_REGION" "REG_CITY_NOT_LIVE_CITY"
[37] "REG_CITY_NOT_WORK_CITY" "LIVE_CITY_NOT_WORK_CITY"
[39] "ORGANIZATION_TYPE" "EXT_SOURCE_2"
[41] "EXT_SOURCE_3" "OBS_30_CNT_SOCIAL_CIRCLE"
[43] "DEF_30_CNT_SOCIAL_CIRCLE" "OBS_60_CNT_SOCIAL_CIRCLE"
[45] "DEF_60_CNT_SOCIAL_CIRCLE" "DAYS_LAST_PHONE_CHANGE"
[47] "FLAG_DOCUMENT_2" "FLAG_DOCUMENT_3"
[49] "FLAG_DOCUMENT_4" "FLAG_DOCUMENT_5"
[51] "FLAG_DOCUMENT_6" "FLAG_DOCUMENT_7"
[53] "FLAG_DOCUMENT_8" "FLAG_DOCUMENT_9"
[55] "FLAG_DOCUMENT_10" "FLAG_DOCUMENT_11"
[57] "FLAG_DOCUMENT_12" "FLAG_DOCUMENT_13"
[59] "FLAG_DOCUMENT_14" "FLAG_DOCUMENT_15"
[61] "FLAG_DOCUMENT_16" "FLAG_DOCUMENT_17"
[63] "FLAG_DOCUMENT_18" "FLAG_DOCUMENT_19"
[65] "FLAG_DOCUMENT_20" "FLAG_DOCUMENT_21"
[67] "AMT_REQ_CREDIT_BUREAU_HOUR" "AMT_REQ_CREDIT_BUREAU_DAY"
[69] "AMT_REQ_CREDIT_BUREAU_WEEK" "AMT_REQ_CREDIT_BUREAU_MON"
[71] "AMT_REQ_CREDIT_BUREAU_QRT" "AMT_REQ_CREDIT_BUREAU_YEAR"
The “FLAG_” variables don’t seem super clear. Let’s remove them too.
<- data |>
data select(!starts_with("FLAG_")) |> # Remove columns that start with FLAG_
select(-SK_ID_CURR) # This column won't be useful as well
Finally, because the dataset is still very large, let’s remove all observations with missing data.
Since we will perform a classification task, the target should be a factor, i.e., a categorical variable.
This is required by some packages (e.g., randomForest).
<- data |>
data na.omit() |> # Remove rows with missing points
mutate(TARGET = as.factor(TARGET), # Categorize TARGET (for classification)
across(where(is.character), as.factor))
|> head() data
TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Cash loans | M | 0 | 202500 | 406597.5 | 24700.5 | 351000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648 | -2120 | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.2629486 | 0.1393758 | 2 | 2 | 2 | 2 | -1134 | 0 | 0 | 0 | 0 | 0 | 1 |
0 | Revolving loans | M | 0 | 67500 | 135000.0 | 6750.0 | 135000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260 | -2531 | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.5559121 | 0.7295667 | 0 | 0 | 0 | 0 | -815 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | Cash loans | M | 0 | 99000 | 490495.5 | 27517.5 | 454500 | Spouse, partner | State servant | Secondary / secondary special | Married | House / apartment | 0.035792 | -16941 | -1588 | -4970 | -477 | 2 | 2 | 2 | WEDNESDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Other | 0.3542247 | 0.6212263 | 0 | 0 | 0 | 0 | -2536 | 0 | 0 | 0 | 0 | 1 | 1 |
0 | Cash loans | F | 1 | 171000 | 1560726.0 | 41301.0 | 1395000 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.035792 | -13778 | -3130 | -1213 | -619 | 3 | 2 | 2 | SUNDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.7239999 | 0.4920601 | 1 | 0 | 1 | 0 | -1562 | 0 | 0 | 0 | 1 | 1 | 2 |
0 | Cash loans | M | 0 | 360000 | 1530000.0 | 42075.0 | 1530000 | Unaccompanied | State servant | Higher education | Married | House / apartment | 0.003122 | -18850 | -449 | -4597 | -2379 | 2 | 3 | 3 | MONDAY | 16 | 0 | 0 | 0 | 0 | 1 | 1 | Other | 0.7142793 | 0.5406545 | 2 | 0 | 2 | 0 | -1070 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | Cash loans | F | 0 | 112500 | 1019610.0 | 33826.5 | 913500 | Children | Pensioner | Secondary / secondary special | Married | House / apartment | 0.018634 | -20099 | 365243 | -7427 | -3514 | 2 | 2 | 2 | WEDNESDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.2057473 | 0.7517237 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
There, now we have a clean sample to work with! But it still may not be optimal. This remark comes a posteriori after fitting a few models. This requires to dive more deeply into the characteristics of the data.
One potential source of problems is categorical columns. Some have too many categories which will create too much trouble later on. For instance, ORGANIZATION_TYPE has 57 possible values… Also, the weekday of the beginning of the process does not seem super important. Moreover, to enhance the quality of the data, we transform textual columns into R factors. This is only useful because the randomForest package which we use first handles categorical data well. For many other models, like boosted trees of neural networks. This will not help unfortunately.
<- data |>
data select(-ORGANIZATION_TYPE, -WEEKDAY_APPR_PROCESS_START) |>
mutate(across(where(is.character), as.factor)) # All text columns to categories
Lastly, after playing with some models, it turns our that some variables do not have enough variations to be relevant (post sub-sampling). Hence, we remove them too. This is done manually ex-post.
<- data |> select(-LIVE_REGION_NOT_WORK_REGION,
data -LIVE_CITY_NOT_WORK_CITY,
-REG_CITY_NOT_LIVE_CITY,
-REG_REGION_NOT_LIVE_REGION,
-REG_CITY_NOT_WORK_CITY,
-REG_REGION_NOT_WORK_REGION,
-DEF_30_CNT_SOCIAL_CIRCLE,
-DEF_60_CNT_SOCIAL_CIRCLE,
-AMT_REQ_CREDIT_BUREAU_HOUR,
#-AMT_REQ_CREDIT_BUREAU_DAY # Already removed
-AMT_REQ_CREDIT_BUREAU_WEEK,
-AMT_REQ_CREDIT_BUREAU_MON,
-AMT_REQ_CREDIT_BUREAU_QRT)
For training time purposes (again!), we will reduce the data to 40k rows.
But the following step can be easily omitted to train on the full set.
<- data[(1:40000)*5,] # Keeps only 40k observations data
Baseline learning
Preparation
As is usual in machine learning, we will split the data in two (we omit validation for simplicity).
In the first exercise, we will use cross-validation. For more on that, we refer to the tidymodels in general and to rsample in particular. Below, we show a scheme of the packages within the tidymodels and the stage at which they are used.
First: data preparation with training/testing splitting.
<- initial_split(data, prop = 0.5) # 0.5 to save time on training...
data_split <- training(data_split) |> vfold_cv(v = 5) # Creates the folds
data_train <- testing(data_split) data_test
First model with random forests
For this course, we will first use random forests. To find available models in the tidymodels/parsnip. The corresponding model is rand_forest().
The advantage of using tidymodels is that it gathers many different ML packages into a unique framework. One important choice (excluding ensembles) is which family of algorithm you want to work with. Now, there are several implementations of Random Forests in R, and we pick randomForest.
The code below is inspired from the tutorial.
Basically, we first select a model family (random forest), then the package/library that implements it, and then finally the supervising task (classification versus regression)
library(randomForest) # Just in case...
<- # Here we define the model
rf_model rand_forest(
mtry = tune(), # The hyperparameters will be tuned!
trees = tune()
|>
) set_engine("randomForest") |> # Which package?
set_mode("classification") # Classification or regression
Next, we move on to the definition of the grid (for the search).
In terms of parameters:
- mtry is the number of variables randomly chosen as predictors
- trees is the number of trees trained in the forest
- sampsize is the number of observations randomly sampled to train one given tree
- nodesize is the minimum number of observations required in the terminal leaves of all trees
- maxnodes is the maximum number of leaves for all trees
Below, we consider only the first two.
<- grid_regular(mtry(range = c(10, 20)), # There are dedicated function for parameters!
rf_grid trees(range = c(5, 20)), # Small numbers to reduce CPU time
levels = c(2,3)) # 2 & 3 choices for 1st and 2nd parameters
rf_grid
mtry | trees |
---|---|
10 | 5 |
20 | 5 |
10 | 12 |
20 | 12 |
10 | 20 |
20 | 20 |
set.seed(42)
<- workflow() |>
rf_wf add_model(rf_model) |>
add_formula(TARGET ~ .) # This means we predict TARGET with all other variables
Below, we launch the training process across the grid values.
Because training takes some time, we resort to parallelization.
Whenever you have 4+ cores in your computer, this can help you divide training times by 2+.
library(doParallel)
::registerDoParallel(6) # Specifies the number of cores
doParallel<- rf_wf |>
rf_res tune_grid( # This will train on all params
resamples = data_train,
grid = rf_grid
)show_best(rf_res, metric = "accuracy") # Shows the best params
mtry | trees | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|
10 | 20 | accuracy | binary | 0.92145 | 5 | 0.0017489 | Preprocessor1_Model5 |
20 | 20 | accuracy | binary | 0.92100 | 5 | 0.0012068 | Preprocessor1_Model6 |
10 | 12 | accuracy | binary | 0.92025 | 5 | 0.0015062 | Preprocessor1_Model3 |
20 | 12 | accuracy | binary | 0.91890 | 5 | 0.0018125 | Preprocessor1_Model4 |
10 | 5 | accuracy | binary | 0.91295 | 5 | 0.0013309 | Preprocessor1_Model1 |
Average accuracy seems high…
But at the same time, there are few instances of strictly positive TARGET values.
This is a well known issue with highly unbalanced datasets.
mean(data$TARGET |> as.character() |> as.numeric() == 0)
[1] 0.9218
Hence, if we predict 0 each time, we would be correct 92% of the time. This level of accuracy for our algorithm is therefore not impressive… more on that later.
A neural network with torch
Introduction
Next we move to another family of predictive tools (supervised learning algorithms): neural networks!
There are two main frameworks, mostly developed in Python: pytorch (developed by Meta) and keras/tensorflow (developed by Google). Here we want to test the first, but without using Python, hence we will rely on the foundational library, which is torch.
library(torch)
Below we prepare the data in torch fashion (not super user friendly…).
torch is an ML framework coded in Lua/C. In Python => Pytorch, in R => torch for R.
First, we need to re-prepare the data because NNs can only take numerical inputs.
Moreover, the scales are all over the place… and this is also bad for NNs.
<- data |> select(where(is.numeric))
data_nn
<- function(v){(v-min(v))/(max(v)-min(v))}
scale_0_1 <- data_nn |>
data_nn mutate(across(everything(), scale_0_1))
<- bind_cols(TARGET = data$TARGET, data_nn)
data_nn $TARGET <- data_nn$TARGET |> as.character() |> as.numeric() data_nn
Data preparation
The idea is to start preparing everything, especially for (random) batches.
The code below is inspired from this post.
<- dataset(
scoring_dataset name = "scoring_dataset",
initialize = function(index) {
$data <- self$prepare_scoring_data(index)
self
},
.getitem = function(index) {
<- self$data[index, 2:-1] # Predictors
x <- self$data[index, 1]$to(torch_long()) # Target
y list(x, y)
},
.length = function() {
$data$size()[[1]]
self
},
prepare_scoring_data = function(index) {
<- na.omit(data_nn |> select(where(is.numeric)))
input <- input[index, ] # Here we slice the data (too long)
input <- data.matrix(input)
input torch_tensor(input)
} )
Let’s have a look.
<- scoring_dataset(1:5000)
score_data $.length() score_data
[1] 5000
$.getitem(1) score_data
[[1]]
torch_tensor
0.0000
0.1498
0.3708
0.1790
0.3708
0.0360
0.3623
0.0446
0.7975
0.6368
0.0667
1.0000
1.0000
0.6957
0.8354
0.6045
0.0057
0.0058
0.7424
0.0000
0.0000
[ CPUFloatType{21} ]
[[2]]
torch_tensor
0
[ CPULongType{} ]
<- score_data |> dataloader(batch_size = 32) dl
Model architecture
Below, we specify the network’s architecture… There are a couple of important functions:
- nn_linear() is a dense layer, the most common used ones
- nnf_relu() and nnf_sigmoid() are activation functions
- nn_dropout() is a layer that randomly removes units
The full reference is here.
<- nn_module(
net "ScoringNet",
initialize = function() {
$fc1 <- nn_linear(21, 64) # There are 21 predictors in the data (1st number)
self$fc2 <- nn_linear(64, 8)
self$fc3 <- nn_linear(8,1)
self
},# Below, we code the feed-forward structure
forward = function(x) {
|>
x $fc1() |>
selfnnf_relu() |>
$fc2() |>
selfnnf_relu() |>
$fc3() |>
selfnnf_sigmoid()
}
)<- net() model
The cheat sheet gives more functions/options.
Next, the optimizer (gradient descent)… with the learning rate.
<- optim_sgd(model$parameters, lr = 0.01) optimizer
Training
And off we go! The syntax is quite puzzling… (more on that below)
for(epoch in 1:10) { # Loop on epochs
<- c()
l ::loop(for (b in dl) { # Loop on batchs
coro$zero_grad() # Initialize gradients to zero
optimizer<- model(b[[1]])$to(torch_float()) # Forward pass
output <- nnf_binary_cross_entropy(output,
loss 2]]$to(torch_float())) # Computes loss
b[[$backward() # Computes dloss/dx for every parameter x
loss$step() # Parameter update
optimizer<- c(l, loss$item()) # Stores loss of batch
l
})cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))
}
Loss at epoch 1: 0.563368
Loss at epoch 2: 0.296720
Loss at epoch 3: 0.269932
Loss at epoch 4: 0.268180
Loss at epoch 5: 0.267451
Loss at epoch 6: 0.266772
Loss at epoch 7: 0.266096
Loss at epoch 8: 0.265426
Loss at epoch 9: 0.264764
Loss at epoch 10: 0.264111
Then, onto predictions.
REMEMBER: neural networks are initialized randomly, hence predictions are random too!
<- scoring_dataset(1001:1500)
score_data <- score_data %>% dataloader(batch_size = 1)
dl_test
<- c()
preds ::loop(for (b in dl_test) {
coro<- model(b[[1]])
output <- c(preds, output %>% as.numeric())
preds
})# Accuracy below: we round predictions...
mean(round(preds) == as_array(score_data$.getitem(1:500)[[2]]) )
[1] 0.934
Evaluation
Again, the problem of defaults (and frauds) is that datasets are highly imbalanced. In this case, it is common to compute the AUC: area under the ROC curve.
library(caTools)
colAUC(X = preds,
y = as_array(score_data$.getitem(1:500)[[2]]),
plotROC = TRUE)
[,1]
0 vs. 1 0.6949581
Extension: use embeddings to include categorical variables as in this example.
Simpler syntax
NOTE torch is known to have a pythonic (& arid) syntax, especially compared to TF-Keras.
This is why initiative to ease this have blossomed. In R, one such attempts is the luz package, which makes fitting a lot easier to code.
library(luz)
<- score_data |> dataloader(batch_size = 32)
dl_train <- score_data |> dataloader(batch_size = 32)
dl_test
<- net %>%
fit_luz setup(
metrics = list(luz_metric_accuracy()),
optimizer = optim_adam,
loss = function(input, target) {
nnf_binary_cross_entropy(input, target$float()$unsqueeze(-1))
}%>%
) fit(dl_train, epochs = 10, valid_data = dl_test)
Tabnets
Standard neural networks often perform less well than tree ensembles on tabular data.
Tabular networks were introduced to improve their competitiveness.
In R, tabnet is embedded in tidymodels.
library(tabnet)
<- initial_split(data_nn |> mutate(TARGET = as.factor(TARGET)),
data_split_nn prop = 0.5) # 0.5 to save time on training...
<- training(data_split_nn)
data_train_nn <- testing(data_split_nn) data_test_nn
Then, the model. We refer to the documentation (and the original paper!) for the meaning of the parameters. We reproduce the description below:
- penalty: the extra sparsity loss coefficient as proposed in the original paper. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help.
- batch_size: (int) Number of examples per batch, large batch sizes are recommended. (default: 1024^2).
- learn_rate: initial learning rate for the optimizer. (<<1)
- decision_width: (int) Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64.
- attention_width: (int) Width of the attention embedding for each mask. According to the paper n_d=n_a is usually a good choice. (default=8).
- num_steps: (int) Number of steps in the architecture (usually between 3 and 10).
- feature_reusage: (float) This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values range from 1.0 to 2.0.
- virtual_batch_size: (int) Size of the mini batches used for “Ghost Batch Normalization” (default=256^2).
- num_independent: Number of independent Gated Linear Units layers at each step. Usual values range from 1 to 5.
- num_shared: Number of shared Gated Linear Units at each step Usual values range from 1 to 5.
- momentum: Momentum for batch normalization, typically ranges from 0.01 to 0.4 (default=0.02).
<- tabnet(epochs = 5,
tab_mod batch_size = 512,
decision_width = 8,
attention_width = 8,
num_steps = 4,
penalty = 0.0001,
virtual_batch_size = 512,
momentum = 0.3,
feature_reusage = 1.5,
learn_rate = 0.001) %>%
set_engine("torch", verbose = TRUE) %>%
set_mode("classification")
<- workflow() |>
tab_wf add_model(tab_mod) |>
add_formula(TARGET ~ .)
<- tab_wf %>% fit(data_train_nn) fitted_tabnet
[Epoch 001] Loss: 0.494779
[Epoch 002] Loss: 0.371300
[Epoch 003] Loss: 0.320907
[Epoch 004] Loss: 0.305884
[Epoch 005] Loss: 0.291652
To access training data/performance:
# fitted_tabnet$fit$fit$fit$fit$metrics
To predict:
<- predict(fitted_tabnet, data_test_nn)
pred_nn mean(pred_nn$.pred_class == data_test_nn$TARGET)
[1] 0.92215
Hard to beat the 92% threshold!
Interpretability
Finance is often a critical discipline because it involves money.
Hence, when people rely on a black-box algorithm for decision-making, they usually prefer to understand why the algorithm comes to particular conclusions.
To do so, we need to “white-box” the outcome of the predictions.
Of course, there are many ways to do that, but there is one (among others) important dichotomy:
- global interpretability: in this case, we seek to understand how the model works on a large set of observations, for instance, on the whole training set.
- local interpretability: in this case, the focus is set on one (or a few) observations and the aim is to understand how the model behaves for this particular point.
One popular approach for the latter is Local Interpretable Model-agnostic Explanation: LIME.
The idea is sketched below.
The original (complex) model is \(f\) and it is approximated at instance \(x\) by the interpretable model \(g\) that belongs to a large class \(G\). The vicinity of \(x\) is denoted \(\pi_x\) and the complexity of \(g\) is written \(\Omega(g)\). LIME seeks an interpretation of the form \[\xi(x)=\underset{g \in G}{\text{argmin}} \, \mathcal{L}(f,g,\pi_x)+\Omega(g),\] where \(\mathcal{L}(f,g,\pi_x)\) is the loss function (error/imprecision) induced by \(g\) in the vicinity \(\pi_x\) of \(x\).
The penalisation \(\Omega(g)\) is for instance the number of leaves or depth of a tree or the number of predictors in a linear regression.
The vicinity of \(x\) is defined by \(\pi_x(z)=e^{-D(x,z)^2/\sigma^2},\) where \(D\) is some distance measure. Note: this function decreases when \(z\) shifts away from \(x\).
The tricky part is the loss function. In order to minimise it, LIME generates artificial samples close to \(x\) and averages/sums the error on the label that the simple representation makes. The formulation is the following: \[\mathcal{L}(f,g,\pi_x)=\sum_z \pi_x(z)(f(z)-g(z))^2\] the errors are weighted according to their distance from the initial instance \(x\).
Visually, one representation below from the original paper.
::include_graphics("images/lime.png") knitr
Here, the task is classification: blue circles versus red crosses. The model is complex and clearly not linear.
Hence, the points and crosses are sampled in the vicinity of the bold red cross. This gives an idea of how the model behaves locally around the target point. Then, a linear model is fitted on these points a yields the dotted line. Locally, i.e., close to the bold cross, it works well: to the left are the crosses and to the right the blue circles.
Let’s try this! NOTE the code below is adapted from: http://www.mlfactor.com/interp.html#lime For this test, we briefly re-train a Random Forest model like the ones used above.
library(lime)
<- randomForest( # Training the RF model
rf_model_2 x = data |> select(-TARGET),
y = data$TARGET,
mtry = 18, # 18 predictors for each tree
trees = 15, # 15 trees in the forest
sampsize = 25000, # 20k observations used for each tree
nodesize = 30, # At least 30 obs in each leaf
maxnodes = 32, # No more than 32 leaves in total for each tree
)
Once we have the model, LIME proceed in two steps.
We seek explanations for the first 2 observations in the sample.
<- lime(data |> select(-TARGET),
out_lime model = as_classifier(rf_model_2))
Warning: AMT_REQ_CREDIT_BUREAU_DAY does not contain enough variance to use
quantile binning. Using standard binning instead.
<- explain(x = data |>
explanation select(-TARGET) |>
slice(1:2), # First two instances in train_sample
explainer = out_lime, # Explainer variable created above
labels = TRUE,
n_permutations = 900, # Nb samples for loss function
n_features = 7 # Nb of features shown (important ones)
)plot_features(explanation, ncol = 1) # Visual display
The blue bars go in the direction of the final predictions, the red ones go against.
NOTE: unfortunately, the providers of the data would not specify what the “EXT_SOURCE” variables refer to…