Credit risk with machine learning

WARNING: this notebook relies on several R packages. Whenever one is called (for instance via a call such as “library(torch)”), this implies that the package has been installed, e.g., via “install.packages(”torch”)”. Hence, make sure the following packages are installed!

# install.packages("tidyverse") 
# install.packages(c("tidymodels", "randomForest", "doParallel", "torch", "caTools", "tabnet", "lime", "luz"))

NOTE: in data science, one of the most crucial competence is domain knowledge.
Data is key (more than any algorithm) and domain knowledge helps understand how it is built and which features are likely to matter most.

Introduction

Principles

Credit risk pertains to situations when a lender (e.g., a bank) face uncertainty so as to whether it will receive debt payments. Basically, there are two types of customers: individuals (e.g., people wanting to buy a car or a house), and firms (willing to finance projects, acquisitions, etc.).

There are several quantities that are important to lenders:

the probability of default (PD), which is self-explanatory.
the loss given default (LGD): when the default occurs, the amount of money that will be lost (1-recovery rate).
the exposure at default (EAD): how much money was still owed when the default occurs.
the expected loss (ED): average amount of money lost, equal to ED = PD * LGD * EAD.

The EAD is easy to compute and the recovery rate is often approximated accurately. The most complicated part is the probability of default….
Basically, in order to be able to understand the drivers of these quantities, we need information on the customers, i.e., DATA!

For individuals (credit scoring), this usually encompasses: age, gender, income, wealth, job type, payment history (prior defaults), etc.
For corporations (credit rating), lenders consult credit risk agencies (Standard & Poors, Moodys, Fitch Ratings) that ave developed complex models that usually rely on accounting data linked to revenue, debt, cash-flows, financial ratios, prior credit history, macro-economic variables, etc.

A few free references

Below, we list a few interesting resources on the topic of credit risk and credit scoring:

Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring
An Overview on the Landscape of R Packages for Open Source Scorecard Modelling -arXiv version here
Credit scoring methods: Latest trends and points to consider
Credit Scoring Approaches Guidelines from the World Bank
this github repo lists free credit risk related datasets

Data

In the course part, we dig into only one of these examples.
There are lots of datasets online, but to find neat ones (good size, format, clear column names).
The data we rely one below comes from this Kaggle competition.

First, let’s load the packages & dataset.
IMPORTANT: the data (.csv file) is also available on this drive link.
\(\rightarrow\) Please download it and place it in the same folder as the notebook.

library(tidyverse)                           # Data wrangling
library(tidymodels)                          # ML workflow
data <- read_csv("application_train.csv")    # Load data

The data is super large for a “normal” computer. We don’t want training times that are too long.
So let’s shrink its size (you can skip this step if you want the “full” experience!).
To do so, have a look at missing data first.
We locate the columns with more that 25% of missing data.
Then, remove these columns.

missing_names <- colMeans(is.na(data)) |>
  data.frame() |>                          # Convert to dataframe
  rownames_to_column() |>                  # Shift row names to a column
  `colnames<-` (c("name", "missing")) |>   # Change column names  
  filter(missing > 0.25)                   # Show features with lots of missing data
missing_names |> head(12)                    # First 12 names

name	missing
OWN_CAR_AGE	0.6599081
OCCUPATION_TYPE	0.3134555
EXT_SOURCE_1	0.5638107
APARTMENTS_AVG	0.5074973
BASEMENTAREA_AVG	0.5851596
YEARS_BEGINEXPLUATATION_AVG	0.4878102
YEARS_BUILD_AVG	0.6649778
COMMONAREA_AVG	0.6987230
ELEVATORS_AVG	0.5329598
ENTRANCES_AVG	0.5034877
FLOORSMAX_AVG	0.4976082
FLOORSMIN_AVG	0.6784863

data <- data |>                              # Remove these features
  select(-(missing_names |> pull(name)))

Let’s have a look at column names…

colnames(data)

 [1] "SK_ID_CURR"                  "TARGET"                     
 [3] "NAME_CONTRACT_TYPE"          "CODE_GENDER"                
 [5] "FLAG_OWN_CAR"                "FLAG_OWN_REALTY"            
 [7] "CNT_CHILDREN"                "AMT_INCOME_TOTAL"           
 [9] "AMT_CREDIT"                  "AMT_ANNUITY"                
[11] "AMT_GOODS_PRICE"             "NAME_TYPE_SUITE"            
[13] "NAME_INCOME_TYPE"            "NAME_EDUCATION_TYPE"        
[15] "NAME_FAMILY_STATUS"          "NAME_HOUSING_TYPE"          
[17] "REGION_POPULATION_RELATIVE"  "DAYS_BIRTH"                 
[19] "DAYS_EMPLOYED"               "DAYS_REGISTRATION"          
[21] "DAYS_ID_PUBLISH"             "FLAG_MOBIL"                 
[23] "FLAG_EMP_PHONE"              "FLAG_WORK_PHONE"            
[25] "FLAG_CONT_MOBILE"            "FLAG_PHONE"                 
[27] "FLAG_EMAIL"                  "CNT_FAM_MEMBERS"            
[29] "REGION_RATING_CLIENT"        "REGION_RATING_CLIENT_W_CITY"
[31] "WEEKDAY_APPR_PROCESS_START"  "HOUR_APPR_PROCESS_START"    
[33] "REG_REGION_NOT_LIVE_REGION"  "REG_REGION_NOT_WORK_REGION" 
[35] "LIVE_REGION_NOT_WORK_REGION" "REG_CITY_NOT_LIVE_CITY"     
[37] "REG_CITY_NOT_WORK_CITY"      "LIVE_CITY_NOT_WORK_CITY"    
[39] "ORGANIZATION_TYPE"           "EXT_SOURCE_2"               
[41] "EXT_SOURCE_3"                "OBS_30_CNT_SOCIAL_CIRCLE"   
[43] "DEF_30_CNT_SOCIAL_CIRCLE"    "OBS_60_CNT_SOCIAL_CIRCLE"   
[45] "DEF_60_CNT_SOCIAL_CIRCLE"    "DAYS_LAST_PHONE_CHANGE"     
[47] "FLAG_DOCUMENT_2"             "FLAG_DOCUMENT_3"            
[49] "FLAG_DOCUMENT_4"             "FLAG_DOCUMENT_5"            
[51] "FLAG_DOCUMENT_6"             "FLAG_DOCUMENT_7"            
[53] "FLAG_DOCUMENT_8"             "FLAG_DOCUMENT_9"            
[55] "FLAG_DOCUMENT_10"            "FLAG_DOCUMENT_11"           
[57] "FLAG_DOCUMENT_12"            "FLAG_DOCUMENT_13"           
[59] "FLAG_DOCUMENT_14"            "FLAG_DOCUMENT_15"           
[61] "FLAG_DOCUMENT_16"            "FLAG_DOCUMENT_17"           
[63] "FLAG_DOCUMENT_18"            "FLAG_DOCUMENT_19"           
[65] "FLAG_DOCUMENT_20"            "FLAG_DOCUMENT_21"           
[67] "AMT_REQ_CREDIT_BUREAU_HOUR"  "AMT_REQ_CREDIT_BUREAU_DAY"  
[69] "AMT_REQ_CREDIT_BUREAU_WEEK"  "AMT_REQ_CREDIT_BUREAU_MON"  
[71] "AMT_REQ_CREDIT_BUREAU_QRT"   "AMT_REQ_CREDIT_BUREAU_YEAR"

The “FLAG_” variables don’t seem super clear. Let’s remove them too.

data <- data |> 
  select(!starts_with("FLAG_")) |> # Remove columns that start with FLAG_
  select(-SK_ID_CURR)              # This column won't be useful as well

Finally, because the dataset is still very large, let’s remove all observations with missing data.
Since we will perform a classification task, the target should be a factor, i.e., a categorical variable.
This is required by some packages (e.g., randomForest).

data <- data |>                            
  na.omit() |>                                    # Remove rows with missing points
  mutate(TARGET = as.factor(TARGET),              # Categorize TARGET (for classification)
         across(where(is.character), as.factor))
data |> head()

TARGET	NAME_CONTRACT_TYPE	CODE_GENDER	CNT_CHILDREN	AMT_INCOME_TOTAL	AMT_CREDIT	AMT_ANNUITY	AMT_GOODS_PRICE	NAME_TYPE_SUITE	NAME_INCOME_TYPE	NAME_EDUCATION_TYPE	NAME_FAMILY_STATUS	NAME_HOUSING_TYPE	REGION_POPULATION_RELATIVE	DAYS_BIRTH	DAYS_EMPLOYED	DAYS_REGISTRATION	DAYS_ID_PUBLISH	CNT_FAM_MEMBERS	REGION_RATING_CLIENT	REGION_RATING_CLIENT_W_CITY	WEEKDAY_APPR_PROCESS_START	HOUR_APPR_PROCESS_START	REG_CITY_NOT_WORK_CITY	LIVE_CITY_NOT_WORK_CITY	ORGANIZATION_TYPE	EXT_SOURCE_2	EXT_SOURCE_3	OBS_30_CNT_SOCIAL_CIRCLE	DEF_30_CNT_SOCIAL_CIRCLE	OBS_60_CNT_SOCIAL_CIRCLE	DEF_60_CNT_SOCIAL_CIRCLE	DAYS_LAST_PHONE_CHANGE	AMT_REQ_CREDIT_BUREAU_MON	AMT_REQ_CREDIT_BUREAU_QRT	AMT_REQ_CREDIT_BUREAU_YEAR
1	Cash loans	M	0	202500	406597.5	24700.5	351000	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.018801	-9461	-637	-3648	-2120	1	2	2	WEDNESDAY	10	0	0	Business Entity Type 3	0.2629486	0.1393758	2	2	2	2	-1134	0	0	1
0	Revolving loans	M	0	67500	135000.0	6750.0	135000	Unaccompanied	Working	Secondary / secondary special	Single / not married	House / apartment	0.010032	-19046	-225	-4260	-2531	1	2	2	MONDAY	9	0	0	Government	0.5559121	0.7295667	0	0	0	0	-815	0	0	0
0	Cash loans	M	0	99000	490495.5	27517.5	454500	Spouse, partner	State servant	Secondary / secondary special	Married	House / apartment	0.035792	-16941	-1588	-4970	-477	2	2	2	WEDNESDAY	16	0	0	Other	0.3542247	0.6212263	0	0	0	0	-2536	0	1	1
0	Cash loans	F	1	171000	1560726.0	41301.0	1395000	Unaccompanied	Commercial associate	Higher education	Married	House / apartment	0.035792	-13778	-3130	-1213	-619	3	2	2	SUNDAY	16	0	0	Business Entity Type 3	0.7239999	0.4920601	1	0	1	0	-1562	1	1	2
0	Cash loans	M	0	360000	1530000.0	42075.0	1530000	Unaccompanied	State servant	Higher education	Married	House / apartment	0.003122	-18850	-449	-4597	-2379	2	3	3	MONDAY	16	1	1	Other	0.7142793	0.5406545	2	0	2	0	-1070	0	0	0
0	Cash loans	F	0	112500	1019610.0	33826.5	913500	Children	Pensioner	Secondary / secondary special	Married	House / apartment	0.018634	-20099	365243	-7427	-3514	2	2	2	WEDNESDAY	14	0	0	XNA	0.2057473	0.7517237	1	0	1	0	0	0	0	1

There, now we have a clean sample to work with! But it still may not be optimal. This remark comes a posteriori after fitting a few models. This requires to dive more deeply into the characteristics of the data.

One potential source of problems is categorical columns. Some have too many categories which will create too much trouble later on. For instance, ORGANIZATION_TYPE has 57 possible values… Also, the weekday of the beginning of the process does not seem super important. Moreover, to enhance the quality of the data, we transform textual columns into R factors. This is only useful because the randomForest package which we use first handles categorical data well. For many other models, like boosted trees of neural networks. This will not help unfortunately.

data <- data |> 
  select(-ORGANIZATION_TYPE, -WEEKDAY_APPR_PROCESS_START) |>
  mutate(across(where(is.character), as.factor))  # All text columns to categories

Lastly, after playing with some models, it turns our that some variables do not have enough variations to be relevant (post sub-sampling). Hence, we remove them too. This is done manually ex-post.

data <- data |> select(-LIVE_REGION_NOT_WORK_REGION,
                       -LIVE_CITY_NOT_WORK_CITY,
                       -REG_CITY_NOT_LIVE_CITY,
                       -REG_REGION_NOT_LIVE_REGION,
                       -REG_CITY_NOT_WORK_CITY,
                       -REG_REGION_NOT_WORK_REGION,
                       -DEF_30_CNT_SOCIAL_CIRCLE,
                       -DEF_60_CNT_SOCIAL_CIRCLE,
                       -AMT_REQ_CREDIT_BUREAU_HOUR,
                       #-AMT_REQ_CREDIT_BUREAU_DAY # Already removed
                       -AMT_REQ_CREDIT_BUREAU_WEEK,
                       -AMT_REQ_CREDIT_BUREAU_MON,
                       -AMT_REQ_CREDIT_BUREAU_QRT)

For training time purposes (again!), we will reduce the data to 40k rows.
But the following step can be easily omitted to train on the full set.

data <- data[(1:40000)*5,]  # Keeps only 40k observations

Baseline learning

Preparation

As is usual in machine learning, we will split the data in two (we omit validation for simplicity).
In the first exercise, we will use cross-validation. For more on that, we refer to the tidymodels in general and to rsample in particular. Below, we show a scheme of the packages within the tidymodels and the stage at which they are used.

First: data preparation with training/testing splitting.

data_split <- initial_split(data, prop = 0.5) # 0.5 to save time on training...
data_train <- training(data_split) |> vfold_cv(v = 5) # Creates the folds
data_test <- testing(data_split)

First model with random forests

For this course, we will first use random forests. To find available models in the tidymodels/parsnip. The corresponding model is rand_forest().

The advantage of using tidymodels is that it gathers many different ML packages into a unique framework. One important choice (excluding ensembles) is which family of algorithm you want to work with. Now, there are several implementations of Random Forests in R, and we pick randomForest.
The code below is inspired from the tutorial.
Basically, we first select a model family (random forest), then the package/library that implements it, and then finally the supervising task (classification versus regression)

library(randomForest)             # Just in case...
rf_model <-                       # Here we define the model
  rand_forest(
    mtry = tune(),            # The hyperparameters will be tuned!
    trees = tune()
  ) |>
  set_engine("randomForest") |> # Which package?
  set_mode("classification")    # Classification or regression

Next, we move on to the definition of the grid (for the search).
In terms of parameters:
- mtry is the number of variables randomly chosen as predictors
- trees is the number of trees trained in the forest
- sampsize is the number of observations randomly sampled to train one given tree
- nodesize is the minimum number of observations required in the terminal leaves of all trees
- maxnodes is the maximum number of leaves for all trees

Below, we consider only the first two.

rf_grid <- grid_regular(mtry(range = c(10, 20)),  # There are dedicated function for parameters!
                        trees(range = c(5, 20)),  # Small numbers to reduce CPU time
                        levels = c(2,3))          # 2 & 3 choices for 1st and 2nd parameters
rf_grid

mtry	trees
10	5
20	5
10	12
20	12
10	20
20	20

set.seed(42)
rf_wf <- workflow() |>
  add_model(rf_model) |>
  add_formula(TARGET ~ .)  # This means we predict TARGET with all other variables

Below, we launch the training process across the grid values.
Because training takes some time, we resort to parallelization.
Whenever you have 4+ cores in your computer, this can help you divide training times by 2+.

library(doParallel)
doParallel::registerDoParallel(6)        # Specifies the number of cores
rf_res <- rf_wf |>
  tune_grid(                             # This will train on all params
    resamples = data_train,
    grid = rf_grid
  )
show_best(rf_res, metric = "accuracy")   # Shows the best params

mtry	trees	.metric	.estimator	mean	n	std_err	.config
10	20	accuracy	binary	0.92145	5	0.0017489	Preprocessor1_Model5
20	20	accuracy	binary	0.92100	5	0.0012068	Preprocessor1_Model6
10	12	accuracy	binary	0.92025	5	0.0015062	Preprocessor1_Model3
20	12	accuracy	binary	0.91890	5	0.0018125	Preprocessor1_Model4
10	5	accuracy	binary	0.91295	5	0.0013309	Preprocessor1_Model1

Average accuracy seems high…
But at the same time, there are few instances of strictly positive TARGET values.
This is a well known issue with highly unbalanced datasets.

mean(data$TARGET |> as.character() |> as.numeric() == 0)

[1] 0.9218

Hence, if we predict 0 each time, we would be correct 92% of the time. This level of accuracy for our algorithm is therefore not impressive… more on that later.

A neural network with torch

Introduction

Next we move to another family of predictive tools (supervised learning algorithms): neural networks!
There are two main frameworks, mostly developed in Python: pytorch (developed by Meta) and keras/tensorflow (developed by Google). Here we want to test the first, but without using Python, hence we will rely on the foundational library, which is torch.

library(torch)

Below we prepare the data in torch fashion (not super user friendly…).

torch is an ML framework coded in Lua/C. In Python => Pytorch, in R => torch for R.

First, we need to re-prepare the data because NNs can only take numerical inputs.
Moreover, the scales are all over the place… and this is also bad for NNs.

data_nn <- data |> select(where(is.numeric))

scale_0_1 <- function(v){(v-min(v))/(max(v)-min(v))}
data_nn <- data_nn |>
  mutate(across(everything(), scale_0_1))

data_nn <- bind_cols(TARGET = data$TARGET, data_nn)
data_nn$TARGET <- data_nn$TARGET |> as.character() |> as.numeric()

Data preparation

The idea is to start preparing everything, especially for (random) batches.
The code below is inspired from this post.

scoring_dataset <- dataset(
  name = "scoring_dataset",
  initialize = function(index) {
    self$data <- self$prepare_scoring_data(index)
  },
  
  .getitem = function(index) {
    x <- self$data[index, 2:-1]               # Predictors
    y <- self$data[index, 1]$to(torch_long()) # Target
    list(x, y)
  },
  
  .length = function() {
    self$data$size()[[1]]
  },
  
  prepare_scoring_data = function(index) {
    input <- na.omit(data_nn |> select(where(is.numeric))) 
    input <- input[index, ]             # Here we slice the data (too long)
    input <- data.matrix(input)
    torch_tensor(input)
  }
)

Let’s have a look.

score_data <- scoring_dataset(1:5000)
score_data$.length()

[1] 5000

score_data$.getitem(1)

[[1]]
torch_tensor
 0.0000
 0.1498
 0.3708
 0.1790
 0.3708
 0.0360
 0.3623
 0.0446
 0.7975
 0.6368
 0.0667
 1.0000
 1.0000
 0.6957
 0.8354
 0.6045
 0.0057
 0.0058
 0.7424
 0.0000
 0.0000
[ CPUFloatType{21} ]

[[2]]
torch_tensor
0
[ CPULongType{} ]

dl <- score_data |> dataloader(batch_size = 32)

Model architecture

Below, we specify the network’s architecture… There are a couple of important functions:

nn_linear() is a dense layer, the most common used ones
nnf_relu() and nnf_sigmoid() are activation functions
nn_dropout() is a layer that randomly removes units

The full reference is here.

net <- nn_module(
  "ScoringNet",
  initialize = function() {
    self$fc1 <- nn_linear(21, 64) # There are 21 predictors in the data (1st number)
    self$fc2 <- nn_linear(64, 8)
    self$fc3 <- nn_linear(8,1)
  },
  # Below, we code the feed-forward structure
  forward = function(x) {
    x |> 
      self$fc1() |> 
      nnf_relu() |>
      self$fc2() |> 
      nnf_relu() |>
      self$fc3() |>
      nnf_sigmoid() 
  }
)
model <- net()

The cheat sheet gives more functions/options.

Next, the optimizer (gradient descent)… with the learning rate.

optimizer <- optim_sgd(model$parameters, lr = 0.01)

Training

And off we go! The syntax is quite puzzling… (more on that below)

for(epoch in 1:10) {                        # Loop on epochs
  l <- c()
  coro::loop(for (b in dl) {                  # Loop on batchs
    optimizer$zero_grad()                     # Initialize gradients to zero
    output <- model(b[[1]])$to(torch_float()) # Forward pass
    loss <- nnf_binary_cross_entropy(output, 
                                     b[[2]]$to(torch_float()))    # Computes loss
    loss$backward()                           # Computes dloss/dx for every parameter x
    optimizer$step()                          # Parameter update
    l <- c(l, loss$item())                    # Stores loss of batch
  })
  cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(l)))
}

Loss at epoch 1: 0.563368
Loss at epoch 2: 0.296720
Loss at epoch 3: 0.269932
Loss at epoch 4: 0.268180
Loss at epoch 5: 0.267451
Loss at epoch 6: 0.266772
Loss at epoch 7: 0.266096
Loss at epoch 8: 0.265426
Loss at epoch 9: 0.264764
Loss at epoch 10: 0.264111

Then, onto predictions.
REMEMBER: neural networks are initialized randomly, hence predictions are random too!

score_data <- scoring_dataset(1001:1500)
dl_test <- score_data %>% dataloader(batch_size = 1)

preds <- c()
coro::loop(for (b in dl_test) {
  output <- model(b[[1]])
  preds <- c(preds, output %>% as.numeric())
})
# Accuracy below: we round predictions...
mean(round(preds) == as_array(score_data$.getitem(1:500)[[2]]) )

[1] 0.934

Evaluation

Again, the problem of defaults (and frauds) is that datasets are highly imbalanced. In this case, it is common to compute the AUC: area under the ROC curve.

library(caTools)
colAUC(X = preds, 
       y = as_array(score_data$.getitem(1:500)[[2]]), 
       plotROC = TRUE)

             [,1]
0 vs. 1 0.6949581

Extension: use embeddings to include categorical variables as in this example.

Simpler syntax

NOTE torch is known to have a pythonic (& arid) syntax, especially compared to TF-Keras.
This is why initiative to ease this have blossomed. In R, one such attempts is the luz package, which makes fitting a lot easier to code.

library(luz)

dl_train <- score_data |> dataloader(batch_size = 32)
dl_test <- score_data |> dataloader(batch_size = 32)

fit_luz <- net %>%
  setup(
    metrics = list(luz_metric_accuracy()),
    optimizer = optim_adam,
    loss = function(input, target) {
      nnf_binary_cross_entropy(input, target$float()$unsqueeze(-1))
    }
  ) %>%
  fit(dl_train, epochs = 10, valid_data = dl_test)

Tabnets

Standard neural networks often perform less well than tree ensembles on tabular data.
Tabular networks were introduced to improve their competitiveness.
In R, tabnet is embedded in tidymodels.

library(tabnet)
data_split_nn <- initial_split(data_nn |> mutate(TARGET = as.factor(TARGET)), 
                               prop = 0.5) # 0.5 to save time on training...
data_train_nn <- training(data_split_nn) 
data_test_nn <- testing(data_split_nn)

Then, the model. We refer to the documentation (and the original paper!) for the meaning of the parameters. We reproduce the description below:

penalty: the extra sparsity loss coefficient as proposed in the original paper. The bigger this coefficient is, the sparser your model will be in terms of feature selection. Depending on the difficulty of your problem, reducing this value could help.
batch_size: (int) Number of examples per batch, large batch sizes are recommended. (default: 1024^2).
learn_rate: initial learning rate for the optimizer. (<<1)
decision_width: (int) Width of the decision prediction layer. Bigger values gives more capacity to the model with the risk of overfitting. Values typically range from 8 to 64.
attention_width: (int) Width of the attention embedding for each mask. According to the paper n_d=n_a is usually a good choice. (default=8).
num_steps: (int) Number of steps in the architecture (usually between 3 and 10).
feature_reusage: (float) This is the coefficient for feature reusage in the masks. A value close to 1 will make mask selection least correlated between layers. Values range from 1.0 to 2.0.
virtual_batch_size: (int) Size of the mini batches used for “Ghost Batch Normalization” (default=256^2).
num_independent: Number of independent Gated Linear Units layers at each step. Usual values range from 1 to 5.
num_shared: Number of shared Gated Linear Units at each step Usual values range from 1 to 5.
momentum: Momentum for batch normalization, typically ranges from 0.01 to 0.4 (default=0.02).

tab_mod <- tabnet(epochs = 5, 
                  batch_size = 512, 
                  decision_width = 8, 
                  attention_width = 8,
                  num_steps = 4, 
                  penalty = 0.0001, 
                  virtual_batch_size = 512, 
                  momentum = 0.3,
                  feature_reusage = 1.5, 
                  learn_rate = 0.001) %>%
  set_engine("torch", verbose = TRUE) %>%
  set_mode("classification")

tab_wf <- workflow() |>
  add_model(tab_mod) |>
  add_formula(TARGET ~ .)

fitted_tabnet <- tab_wf %>% fit(data_train_nn)

[Epoch 001] Loss: 0.494779
[Epoch 002] Loss: 0.371300
[Epoch 003] Loss: 0.320907
[Epoch 004] Loss: 0.305884
[Epoch 005] Loss: 0.291652

To access training data/performance:

# fitted_tabnet$fit$fit$fit$fit$metrics

To predict:

pred_nn <- predict(fitted_tabnet, data_test_nn)
mean(pred_nn$.pred_class == data_test_nn$TARGET)

[1] 0.92215

Hard to beat the 92% threshold!

Interpretability

Finance is often a critical discipline because it involves money.
Hence, when people rely on a black-box algorithm for decision-making, they usually prefer to understand why the algorithm comes to particular conclusions.
To do so, we need to “white-box” the outcome of the predictions.
Of course, there are many ways to do that, but there is one (among others) important dichotomy:
- global interpretability: in this case, we seek to understand how the model works on a large set of observations, for instance, on the whole training set.
- local interpretability: in this case, the focus is set on one (or a few) observations and the aim is to understand how the model behaves for this particular point.

One popular approach for the latter is Local Interpretable Model-agnostic Explanation: LIME.
The idea is sketched below.

The original (complex) model is \(f\) and it is approximated at instance \(x\) by the interpretable model \(g\) that belongs to a large class \(G\). The vicinity of \(x\) is denoted \(\pi_x\) and the complexity of \(g\) is written \(\Omega(g)\). LIME seeks an interpretation of the form \[\xi(x)=\underset{g \in G}{\text{argmin}} \, \mathcal{L}(f,g,\pi_x)+\Omega(g),\] where \(\mathcal{L}(f,g,\pi_x)\) is the loss function (error/imprecision) induced by \(g\) in the vicinity \(\pi_x\) of \(x\).

The penalisation \(\Omega(g)\) is for instance the number of leaves or depth of a tree or the number of predictors in a linear regression.

The vicinity of \(x\) is defined by \(\pi_x(z)=e^{-D(x,z)^2/\sigma^2},\) where \(D\) is some distance measure. Note: this function decreases when \(z\) shifts away from \(x\).

The tricky part is the loss function. In order to minimise it, LIME generates artificial samples close to \(x\) and averages/sums the error on the label that the simple representation makes. The formulation is the following: \[\mathcal{L}(f,g,\pi_x)=\sum_z \pi_x(z)(f(z)-g(z))^2\] the errors are weighted according to their distance from the initial instance \(x\).

Visually, one representation below from the original paper.

knitr::include_graphics("images/lime.png")

Here, the task is classification: blue circles versus red crosses. The model is complex and clearly not linear.
Hence, the points and crosses are sampled in the vicinity of the bold red cross. This gives an idea of how the model behaves locally around the target point. Then, a linear model is fitted on these points a yields the dotted line. Locally, i.e., close to the bold cross, it works well: to the left are the crosses and to the right the blue circles.

Let’s try this! NOTE the code below is adapted from: http://www.mlfactor.com/interp.html#lime For this test, we briefly re-train a Random Forest model like the ones used above.

library(lime) 
rf_model_2 <- randomForest(      # Training the RF model 
  x = data |> select(-TARGET), 
  y = data$TARGET,
  mtry = 18,                     # 18 predictors for each tree
  trees = 15,                    # 15 trees in the forest
  sampsize = 25000,              # 20k observations used for each tree
  nodesize = 30,                 # At least 30 obs in each leaf
  maxnodes = 32,                 # No more than 32 leaves in total for each tree
)

Once we have the model, LIME proceed in two steps.
We seek explanations for the first 2 observations in the sample.

out_lime <- lime(data |> select(-TARGET), 
                 model = as_classifier(rf_model_2))

Warning: AMT_REQ_CREDIT_BUREAU_DAY does not contain enough variance to use
quantile binning. Using standard binning instead.

explanation <- explain(x = data |> 
                         select(-TARGET) |>
                         slice(1:2),            # First two instances in train_sample 
                       explainer = out_lime,    # Explainer variable created above 
                       labels = TRUE,
                       n_permutations = 900,    # Nb samples for loss function
                       n_features = 7           # Nb of features shown (important ones)
)
plot_features(explanation, ncol = 1)            # Visual display

The blue bars go in the direction of the final predictions, the red ones go against.
NOTE: unfortunately, the providers of the data would not specify what the “EXT_SOURCE” variables refer to…