# install.packages("tidyverse")
# install.packages(c("data.table", "fastDummies", "lightgbm", "reticulate", "keras"))
Fraud detection
WARNING: this notebook relies on several R packages. Whenever one is called (for instance via a call such as “library(torch)”), this implies that the package has been installed, e.g., via “install.packages(”torch”)”. Hence, make sure the following packages are installed!
NOTE: in data science, one of the most crucial competence is domain knowledge.
Data is key (more than any algorithm) and domain knowledge helps understand how it is built and which features are likely to matter most.
Introduction
Principles
Quite obviously, the purpose of fraud detection is to identify illicit transactions.
This is usually done via the characteristics of the transaction, which can be split into several categories: the nature of the transaction (e.g., type (credit card versus online), location, timing or amount), and the properties of the parties: history of transactions, location, gender of emitter, trustworthiness of receiver, etc.
Hence, like for credit risk, the main task will be essentially classification (supervised learning), where we use the characteristics of the transactions to predict if they are fraudulent or not. However, with fraud data, the imbalance is even more pronounced, because usually less than 1% of transactions are tagged as fraudulent. This means that most of the interesting information comes from a very small portion of the data!
Note that fraud is widespread and affects, many domains, especially insurance for instance (see Section 6 of this paper for examples). The tools to detect fraud (or anomalies) in other context are quite similar (supervised learning).
Short list of free references
There are several books available on the topic, but the following one, with code snippets in python is probably the most applied (with PyTorch examples to illustrate neural networks):
- Reproducible Machine Learning for Credit Card Fraud detection - Practical handbook
Two recent survey articles:
Data wrangling
First, let us load the data & the packages.
The data originates from a Kaggle competition and we only keep the training data for which the target/label is available.
The file is also available on this drive link.
\(\rightarrow\) Please download it and place it in the same folder as the notebook.
library(tidyverse)
library(data.table) # To import csv sheets rapidly
library(fastDummies) # For dummies
<- fread("fraud.csv")
fraud head(fraud)
V1 | trans_date_trans_time | cc_num | merchant | category | amt | first | last | gender | street | city | state | zip | lat | long | city_pop | job | dob | trans_num | unix_time | merch_lat | merch_long | is_fraud |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-01-01 00:00:18 | 2703186189652095 | fraud_Rippin, Kub and Mann | misc_net | 4.97 | Jennifer | Banks | F | 561 Perry Cove | Moravian Falls | NC | 28654 | 36.0788 | -81.1781 | 3495 | Psychologist, counselling | 1988-03-09 | 0b242abb623afc578575680df30655b9 | 1325376018 | 36.01129 | -82.04832 | 0 |
1 | 2019-01-01 00:00:44 | 630423337322 | fraud_Heller, Gutmann and Zieme | grocery_pos | 107.23 | Stephanie | Gill | F | 43039 Riley Greens Suite 393 | Orient | WA | 99160 | 48.8878 | -118.2105 | 149 | Special educational needs teacher | 1978-06-21 | 1f76529f8574734946361c461b024d99 | 1325376044 | 49.15905 | -118.18646 | 0 |
2 | 2019-01-01 00:00:51 | 38859492057661 | fraud_Lind-Buckridge | entertainment | 220.11 | Edward | Sanchez | M | 594 White Dale Suite 530 | Malad City | ID | 83252 | 42.1808 | -112.2620 | 4154 | Nature conservation officer | 1962-01-19 | a1a22d70485983eac12b5b88dad1cf95 | 1325376051 | 43.15070 | -112.15448 | 0 |
3 | 2019-01-01 00:01:16 | 3534093764340240 | fraud_Kutch, Hermiston and Farrell | gas_transport | 45.00 | Jeremy | White | M | 9443 Cynthia Court Apt. 038 | Boulder | MT | 59632 | 46.2306 | -112.1138 | 1939 | Patent attorney | 1967-01-12 | 6b849c168bdad6f867558c3793159a81 | 1325376076 | 47.03433 | -112.56107 | 0 |
4 | 2019-01-01 00:03:06 | 375534208663984 | fraud_Keeling-Crist | misc_pos | 41.96 | Tyler | Garcia | M | 408 Bradley Rest | Doe Hill | VA | 24433 | 38.4207 | -79.4629 | 99 | Dance movement psychotherapist | 1986-03-28 | a41d7549acf90789359a9aa5346dcb46 | 1325376186 | 38.67500 | -78.63246 | 0 |
5 | 2019-01-01 00:04:08 | 4767265376804500 | fraud_Stroman, Hudson and Erdman | gas_transport | 94.63 | Jennifer | Conner | F | 4655 David Island | Dublin | PA | 18917 | 40.3750 | -75.2045 | 2158 | Transport planner | 1961-06-19 | 189a841a0a8ba03058526bcfe566aab5 | 1325376248 | 40.65338 | -76.15267 | 0 |
The first three columns probably won’t be very useful (this is a shortcut though: time can be predictive, but it would need to be transormed, e.g., into days and hours).
The merchant name can also be improved.
<- fraud[, -(1:3)]
fraud <- fraud |>
fraud select(-trans_num, -first, -last) |>
mutate(merchant = merchant |> str_remove("fraud_"))
|> head(4) fraud
merchant | category | amt | gender | street | city | state | zip | lat | long | city_pop | job | dob | unix_time | merch_lat | merch_long | is_fraud |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rippin, Kub and Mann | misc_net | 4.97 | F | 561 Perry Cove | Moravian Falls | NC | 28654 | 36.0788 | -81.1781 | 3495 | Psychologist, counselling | 1988-03-09 | 1325376018 | 36.01129 | -82.04832 | 0 |
Heller, Gutmann and Zieme | grocery_pos | 107.23 | F | 43039 Riley Greens Suite 393 | Orient | WA | 99160 | 48.8878 | -118.2105 | 149 | Special educational needs teacher | 1978-06-21 | 1325376044 | 49.15905 | -118.18646 | 0 |
Lind-Buckridge | entertainment | 220.11 | M | 594 White Dale Suite 530 | Malad City | ID | 83252 | 42.1808 | -112.2620 | 4154 | Nature conservation officer | 1962-01-19 | 1325376051 | 43.15070 | -112.15448 | 0 |
Kutch, Hermiston and Farrell | gas_transport | 45.00 | M | 9443 Cynthia Court Apt. 038 | Boulder | MT | 59632 | 46.2306 | -112.1138 | 1939 | Patent attorney | 1967-01-12 | 1325376076 | 47.03433 | -112.56107 | 0 |
There remains a lot to do… There are too many merchants and cities and for simplicity, we choose to remove them. Instead, we’ll keep categories, which we will transform via one-hot encoding: each categorical value is coded into a new binary column (see below).
In terms of locations, zip is a number but can be misleading, so we’ll only keep latitude & longitude.
NOTE: the one-hot encoding below can take some time…
<- fraud |>
fraud select(-merchant, -street, -city, -state, -zip, -dob, -unix_time, -job) |>
mutate(gender = gender |> as.factor() |> as.numeric())
<- dummy_cols(fraud$category) # One-hot encoding
dummy_cat <- fraud |>
fraud select(-category) |> # Removes the "category" column
bind_cols(dummy_cat |> select(-1)) # Adds the equivalent dummy columns
|> head(3) fraud
amt | gender | lat | long | city_pop | merch_lat | merch_long | is_fraud | .data_entertainment | .data_food_dining | .data_gas_transport | .data_grocery_net | .data_grocery_pos | .data_health_fitness | .data_home | .data_kids_pets | .data_misc_net | .data_misc_pos | .data_personal_care | .data_shopping_net | .data_shopping_pos | .data_travel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4.97 | 1 | 36.0788 | -81.1781 | 3495 | 36.01129 | -82.04832 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
107.23 | 1 | 48.8878 | -118.2105 | 149 | 49.15905 | -118.18646 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
220.11 | 2 | 42.1808 | -112.2620 | 4154 | 43.15070 | -112.15448 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Before the models, let’s evaluate the imbalance of the data…
mean(fraud$is_fraud)
[1] 0.005788652
0.6% of positive cases: this is very low. Again: a large chunk of the information seems to be contained in a very small fraction of observations…
A first lightGBM model
Data splitting
In this section, we fit a boosted tree from the lightGBM package.
We split the data in two, for simplicity: we won’t use validation.
library(lightgbm)
<- fraud[1:500000,]
fraud_train <- fraud[(1+500000):nrow(fraud),] fraud_test
The model parameters. The full list is gathered here.
Good tips with regard to parameter tuning are also provided on this page.
<- list(
train_params num_leaves = 15, # Max nb leaves in tree
learning_rate = 0.1, # Learning rate
objective = "binary", # Loss function
max_depth = 4, # Max depth of trees
min_data_in_leaf = 50, # Nb points in leaf
bagging_fraction = 0.5, # % of observations
feature_fraction = 0.7, # % of features
nthread = 2
)
Training
Onto training! The syntax is very simple.
<- lightgbm(
bst data = fraud_train |> select(-is_fraud) |> as.matrix(),
label = fraud_train$is_fraud, # Target / label
params = train_params, # Passing parameter values
nrounds = 20 # Number of trees in the model
)
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.016566 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0367439"
[1] "[2]: train's binary_logloss:0.0217458"
[1] "[3]: train's binary_logloss:0.0208932"
[1] "[4]: train's binary_logloss:0.0175729"
[1] "[5]: train's binary_logloss:0.0171861"
[1] "[6]: train's binary_logloss:0.0166766"
[1] "[7]: train's binary_logloss:0.0162691"
[1] "[8]: train's binary_logloss:0.015946"
[1] "[9]: train's binary_logloss:0.0157294"
[1] "[10]: train's binary_logloss:0.0152117"
[1] "[11]: train's binary_logloss:0.0148635"
[1] "[12]: train's binary_logloss:0.0147105"
[1] "[13]: train's binary_logloss:0.0144223"
[1] "[14]: train's binary_logloss:0.0142859"
[1] "[15]: train's binary_logloss:0.0141833"
[1] "[16]: train's binary_logloss:0.0137635"
[1] "[17]: train's binary_logloss:0.0135653"
[1] "[18]: train's binary_logloss:0.0133968"
[1] "[19]: train's binary_logloss:0.0131941"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[20]: train's binary_logloss:0.0130456"
Predicting
Next, we ask for the predictions of the first 700 rows of the test set.
predict(bst, fraud_test[1:700,] |> select(-is_fraud) |> as.matrix()) |> round()
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
[149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[630] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Other parameter options
Modern libraries in machine learning allow for many options (again, see the parameter list).
Let’s have a look at several extensions:
- dropout: adapted from neural nets, the idea is to drop some trees, more or less randomly.
- monotonicity: it may be the case that for some features, the modeller wishes to impose a directional link with the target (e.g., increasing or decreasing).
- regularization: the idea of penalizing output values to reduce the variance of predictions (and overfitting).
- early stopping is also now implemented in most libraries: it can save time when the algo does not learn after a few rounds.
We adapt the set of parameters below:
<- ncol(fraud_train |> select(-is_fraud)) # Nb features
nb_feat <- rep(0, nb_feat)
mono_const 1] <- 1 # First feature is amount: fraud increases with amount
mono_const[# If you have a prior on gender, it's also possible to enforce a constraint
<- list(
train_params num_leaves = 15, # Max nb leaves in tree
learning_rate = 0.1, # Learning rate
objective = "binary", # Loss function
max_depth = 4, # Max depth of trees
min_data_in_leaf = 50, # Nb points in leaf
bagging_fraction = 0.5, # % of observations
feature_fraction = 0.7, # % of features
nthread = 4, # Parallelization
boosting = "dart", # DART = dropping
drop_rate = 0.1, # Dropping rate
lambda_l1 = 0.3, # Penalizing leave norms
seed = 42, # For reproducibility?
# early stopping not available with DARTs
#early_stopping_round = 10, # Early stopping after X round if no improvement
monotone_constraints = mono_const,
force_row_wise = T
)
New round of training.
<- lightgbm(
bst data = fraud_train |> select(-is_fraud) |> as.matrix(),
label = fraud_train$is_fraud, # Target / label
params = train_params, # Passing parameter values
nrounds = 20 # Number of trees in the model
)
Warning in (function (params = list(), data, nrounds = 100L, valids = list(), :
Early stopping is not available in 'dart' mode.
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[1]: train's binary_logloss:0.0269181"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[2]: train's binary_logloss:0.0250706"
[1] "[3]: train's binary_logloss:0.0238056"
[1] "[4]: train's binary_logloss:0.0229725"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[5]: train's binary_logloss:0.022269"
[1] "[6]: train's binary_logloss:0.0213811"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[7]: train's binary_logloss:0.0215642"
[1] "[8]: train's binary_logloss:0.0207618"
[1] "[9]: train's binary_logloss:0.0204714"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[10]: train's binary_logloss:0.0207381"
[1] "[11]: train's binary_logloss:0.02036"
[1] "[12]: train's binary_logloss:0.0205059"
[1] "[13]: train's binary_logloss:0.0203041"
[1] "[14]: train's binary_logloss:0.0201755"
[1] "[15]: train's binary_logloss:0.0198653"
[1] "[16]: train's binary_logloss:0.0196283"
[1] "[17]: train's binary_logloss:0.0196991"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[18]: train's binary_logloss:0.0193787"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[19]: train's binary_logloss:0.0192463"
[1] "[20]: train's binary_logloss:0.0192584"
The loss is higher, which makes sense, given the additional constraints we have imposed.
Cross-validation
What if we want to test results on alternative subsets of the data?
<- lgb.cv(
cv_model params = train_params,
data = fraud_train |> select(-is_fraud) |> as.matrix(),
label = fraud_train$is_fraud, # Target / label
eval_freq = 80,
nrounds = 3, # Still number of trees
nfold = 5
)
[LightGBM] [Info] Number of positive: 2404, number of negative: 397596
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 21
[LightGBM] [Info] Number of positive: 2473, number of negative: 397528
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400001, number of used features: 21
[LightGBM] [Info] Number of positive: 2427, number of negative: 397573
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 21
[LightGBM] [Info] Number of positive: 2446, number of negative: 397554
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 21
[LightGBM] [Info] Number of positive: 2414, number of negative: 397585
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 399999, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006010 -> initscore=-5.108302
[LightGBM] [Info] Start training from score -5.108302
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006182 -> initscore=-5.079833
[LightGBM] [Info] Start training from score -5.079833
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006067 -> initscore=-5.098723
[LightGBM] [Info] Start training from score -5.098723
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006115 -> initscore=-5.090877
[LightGBM] [Info] Start training from score -5.090877
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006035 -> initscore=-5.104124
[LightGBM] [Info] Start training from score -5.104124
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[1]: valid's binary_logloss:0.0269293+0.000709862"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[3]: valid's binary_logloss:0.0238533+0.000725518"
$record_evals cv_model
$start_iter
[1] 1
$valid
$valid$binary_logloss
$valid$binary_logloss$eval
$valid$binary_logloss$eval[[1]]
[1] 0.02692933
$valid$binary_logloss$eval[[2]]
[1] 0.02509091
$valid$binary_logloss$eval[[3]]
[1] 0.02385325
$valid$binary_logloss$eval_err
$valid$binary_logloss$eval_err[[1]]
[1] 0.0007098624
$valid$binary_logloss$eval_err[[2]]
[1] 0.0007349295
$valid$binary_logloss$eval_err[[3]]
[1] 0.0007255183
Testing several parameter values
Ok, but what if we want to test several parameter combinations?
With the tidymodels ecosystem, this is possible via dedicated functions from the tune package, but let’s see if we can code it in R with functional programming.
First, let’s create the combinations.
<- c(5,30)
num_leaves <- c(0.01, 0.05, 0.2)
learning_rate <- expand.grid(num_leaves, learning_rate)
pars <- pars[,1]
num_leaves <- pars[,2] learning_rate
Next, the training function. Note that some parameters are flexible, others are fixed.
<- function(num_leaves, learning_rate, fraud_train){
train_func <- list( # First, the list of params
train_params num_leaves = num_leaves, # Max nb leaves in tree
learning_rate = learning_rate, # Learning rate
objective = "binary", # Loss function
max_depth = 3, # Max depth of trees
min_data_in_leaf = 50, # Nb points in leaf
bagging_fraction = 0.5, # % of observations
feature_fraction = 0.7, # % of features
nthread = 4, # Parallelization
force_row_wise = T
)# Next we train
<- lightgbm(
bst data = fraud_train |> select(-is_fraud) |> as.matrix(),
label = fraud_train$is_fraud, # Target / label
params = train_params, # Passing parameter values
eval_freq = 50,
nrounds = 10 # Number of trees in the model
)# Next, we record the final loss (depends on the model/loss defined above)
return(loss = bst$record_evals$train$binary_logloss$eval[[10]])
}train_func(10, 0.1, fraud_train) # Testing
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0368203"
[1] "[10]: train's binary_logloss:0.0155729"
[1] 0.01557292
Finally, we can launch a function that is going to span all free parameters.
<- pmap(list(num_leaves, learning_rate), # Parameters for the grid search
grd # Function on which to apply the grid search
train_func, fraud_train = fraud_train # Non-changing argument (data is fixed)
)
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0370693"
[1] "[10]: train's binary_logloss:0.0285605"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0370674"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[10]: train's binary_logloss:0.0271148"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0369637"
[1] "[10]: train's binary_logloss:0.0172955"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0369543"
[1] "[10]: train's binary_logloss:0.0168124"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0366141"
[1] "[10]: train's binary_logloss:0.0672663"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]: train's binary_logloss:0.0365813"
[1] "[10]: train's binary_logloss:0.0732902"
<- bind_cols(pars, tibble(loss = grd)) grd
Manual random oversampling
The problem with fraud datasets is that they are very imbalanced: in our sample fewer than 1% of observations are actual frauds… This complicates the learning phase. Hence, it can make sense to resample the training data to reduce imbalance.
Let’s see how we can proceed.
We will use the function below.
sample(10, 5, replace = T) # If true, the same number can be chosen several times
[1] 9 4 7 5 2
We first sample the fraud rows.
# First, we locate the fraudulent observations
<- which(fraud_train$is_fraud == 1)
fraud_ind <- 10000
sample_size <- 0.05
ratio # First, select the fraud instances
<- fraud_ind[sample(1:length(fraud_ind),
fraud_ind *ratio,
sample_sizereplace = F)]
<- fraud_train[fraud_ind,]
fraud_sub |> head(3) fraud_sub
amt | gender | lat | long | city_pop | merch_lat | merch_long | is_fraud | .data_entertainment | .data_food_dining | .data_gas_transport | .data_grocery_net | .data_grocery_pos | .data_health_fitness | .data_home | .data_kids_pets | .data_misc_net | .data_misc_pos | .data_personal_care | .data_shopping_net | .data_shopping_pos | .data_travel |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1016.88 | 1 | 35.1506 | -106.5690 | 641349 | 35.55577 | -105.85035 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
19.02 | 1 | 38.9311 | -89.2463 | 1810 | 38.64232 | -89.02502 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
965.64 | 1 | 41.2639 | -80.8164 | 75903 | 42.25914 | -80.96991 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Then, the non-fraud ones and aggregation.
<- which(fraud_train$is_fraud == 0)
non_fraud_ind <- non_fraud_ind[sample(1:length(non_fraud_ind),
non_fraud_ind *(1-ratio),
sample_sizereplace = F)]
<- fraud_train[non_fraud_ind,]
non_fraud_sub <- bind_rows(fraud_sub, non_fraud_sub) fraud_train_sub
And just checking:
mean(fraud_train_sub$is_fraud)
[1] 0.05
Oversampling with SMOTE
Below, we resort to one implementation of Synthetic Minority Oversampling Technique (SMOTE).
library(SMOTEWB)
<- SMOTE(x = fraud_train |> select(-is_fraud),
train_smote y = fraud_train$is_fraud |> as.factor(),
k = 5)
head(train_smote$x_new[,1:9])
amt gender lat long city_pop merch_lat merch_long
[1,] 283.5551 2.000000 35.97082 -82.74856 886.6038 36.41789 -82.23208
[2,] 323.5162 2.000000 35.58993 -99.11577 912.2892 36.22188 -99.09008
[3,] 320.0837 2.000000 35.62265 -97.70989 910.0829 36.23872 -97.64204
[4,] 309.6495 2.000000 35.72210 -93.43626 903.3762 36.28990 -93.24026
[5,] 293.0703 1.819313 37.16070 -84.15612 885.0000 37.33711 -83.84679
[6,] 288.7725 1.870660 36.89042 -82.93488 884.0946 37.25846 -82.45607
.data_entertainment .data_food_dining
[1,] 0 0
[2,] 0 0
[3,] 0 0
[4,] 0 0
[5,] 0 0
[6,] 0 0
Let’s check the proportion!
summary(train_smote$y_new)
0 1
496959 496959
In this case, the data is perfectly balanced, which may be too strong a requirement…
Neural networks with Keras
NOTE: Tensorflow and M1 Macs don’t go too well together. Read the following post for more on the matter.
The chunk below follows the above procedure. If you do not work on an M1 mac, please skip it.
library(reticulate)
use_condaenv("r-tensorflow") # This is for M1 Macs with special conda environment
Next, we move to neural networks. In this case, categorical variables must be re-coded.
Below, the target will be split in two binary columns (fraud versus no-fraud).
This is a bit heavy though…
library(keras)
#install_keras() # You need to run this the first time
<- dummy_cols(fraud_train$is_fraud, remove_selected_columns = T) |> as.matrix()
nn_train_target <- dummy_cols(fraud_test$is_fraud, remove_selected_columns = T) |> as.matrix() nn_test_target
In Keras, there are also many options available. Below, we resort to:
- initializers both for weights and biases;
- penalizations (via regularizers)
- dropout: when some units are removed to mitigate overfitting
<- keras_model_sequential()
model %>% # This defines the structure of the network, i.e. how layers are organized
model layer_dense(units = 32,
activation = 'relu',
input_shape = ncol(fraud_train) - 1) %>% # Size of input (compulsory for 1st layer)
layer_dense(units = 16, activation = 'tanh', # Nb units & activation
kernel_initializer = "random_normal", # Initialization of weights
kernel_constraint = constraint_nonneg()) %>% # Weights should be nonneg
layer_dropout(rate = 0.25) %>% # Dropping out 25% units
layer_dense(units = 8, activation = 'elu', # Nb units & activation
bias_initializer = initializer_constant(0.2), # Initialization of biases
kernel_regularizer = regularizer_l2(0.01)) %>% # Penalization of weights
layer_dense(units = 2, activation = 'softmax') # Softmax for categorical output
Below, we use Adam optimization for learning (gradient descent).
%>% compile( # Model specification
model loss = 'binary_crossentropy', # Loss function
optimizer = optimizer_adam(learning_rate = 0.005, # Optimisation method (weight updating)
beta_1 = 0.9, # Parameter for Adam
beta_2 = 0.95), # Parameter for Adam
metrics = c('categorical_accuracy') # Output metric
)summary(model) # Model structure
Model: "sequential"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_3 (Dense) (None, 32) 704
dense_2 (Dense) (None, 16) 528
dropout (Dropout) (None, 16) 0
dense_1 (Dense) (None, 8) 136
dense (Dense) (None, 2) 18
================================================================================
Total params: 1,386
Trainable params: 1,386
Non-trainable params: 0
________________________________________________________________________________
We also resort to early stopping here: this can be very useful for large models that require long training times.
<- model %>%
fit_NN fit(fraud_train |> select(-is_fraud) |> data.matrix(), # Training features
# Training labels
nn_train_target, epochs = 20, batch_size = 2048, # Training parameters
validation_data = list(fraud_test |> select(-is_fraud) |> data.matrix(),
# Test data
nn_test_target), verbose = 0, # No comments from algo
callbacks = list(
callback_early_stopping(monitor = "val_loss", # Early stopping:
min_delta = 0.001, # Improvement threshold
patience = 4, # Nb epochs with no improvmt
verbose = 2 # Warning policy while training
)
)
)plot(fit_NN) + theme_light()
In order to save time, learning has stopped early…
Let’s fetch some predictions below.
<- predict(model, fraud_test[1:700,] |> select(-is_fraud) |> data.matrix()) preds
And show them…
round(preds[1:30,])
[,1] [,2]
[1,] 1 0
[2,] 1 0
[3,] 1 0
[4,] 1 0
[5,] 1 0
[6,] 1 0
[7,] 1 0
[8,] 1 0
[9,] 1 0
[10,] 1 0
[11,] 1 0
[12,] 1 0
[13,] 1 0
[14,] 1 0
[15,] 1 0
[16,] 1 0
[17,] 1 0
[18,] 1 0
[19,] 1 0
[20,] 1 0
[21,] 1 0
[22,] 1 0
[23,] 1 0
[24,] 1 0
[25,] 1 0
[26,] 1 0
[27,] 1 0
[28,] 1 0
[29,] 1 0
[30,] 1 0
How many frauds are detected in the test set?
sum(preds[,2] == 1)
[1] 0
sum(fraud_test$is_fraud)
[1] 4465
Aoutch… the algo makes the simplest bet that there is no fraud at all.
Interpretability
Finance is often a critical discipline because it involves money.
Hence, when people rely on a black-box algorithm for decision-making, they usually prefer to understand why the algorithm comes to particular conclusions.
To do so, we need to “white-box” the outcome of the predictions.
Of course, there are many ways to do that, but there is one (among others) important dichotomy:
- global interpretability: in this case, we seek to understand how the model works on a large set of observations, for instance, on the whole training set.
- local interpretability: in this case, the focus is set on one (or a few) observations and the aim is to understand how the model behaves for this particular point.
Global interpretability can also be either model-specific or model-independent.
For instance, for tree models, variable importance is one way to understand when a model gives more (decision) weight to particular features. The process is the following: each time a feature is selected as a splitting variable, the resulting gain (in loss reduction) is allocated to this feature. After the growing of a tree, each feature has its total importance, i.e., the sum of all gains that it has permitted. With random forests of boosted trees, it suffices to aggregate these values (possibly discounted by a learning rate).
All tree-based packages propose these kinds of metrics. Below, we show the result from the lightGBM model that we trained earlier.
lgb.importance(bst) |>
ggplot(aes(x = Gain, y = reorder(Feature, Gain))) + geom_col(fill = "#22AABB", alpha = 0.7) +
theme_bw() + theme(axis.title.y = element_blank())
The amount seems to be an important variable in the model.
Clearly importance is concentrated in a handful of features…
In lightGBM, there is also a local interpretation that is proposed!
<- lgb.interprete(bst, fraud_test[1:700,] |> select(-is_fraud) |> data.matrix(), 1:2)
LGB_intepret LGB_intepret
[[1]]
Feature Contribution
1: .data_misc_net -1.152586742
2: .data_grocery_pos -0.312105320
3: .data_shopping_net -0.199847327
4: amt -0.147336306
5: .data_gas_transport -0.074064349
6: gender 0.031763639
7: city_pop 0.010097187
8: .data_shopping_pos -0.009884630
9: lat -0.008081910
10: merch_lat -0.007441658
11: .data_kids_pets -0.006353981
12: merch_long -0.004323646
13: .data_misc_pos -0.003758473
14: long 0.002902084
15: .data_travel -0.002307757
[[2]]
Feature Contribution
1: .data_gas_transport 0.490558257
2: .data_grocery_pos -0.330355672
3: .data_shopping_net -0.199847327
4: amt -0.168985618
5: gender -0.067639874
6: lat -0.050716066
7: merch_lat -0.026985780
8: city_pop -0.022459477
9: .data_misc_net 0.021629711
10: .data_home 0.005141175
11: .data_grocery_net -0.003246968
12: .data_shopping_pos 0.002371183
lgb.plot.interpretation(
tree_interpretation_dt = LGB_intepret[[1L]]
top_n = 8
, )
Let’s try to make sense of this with actual numbers.
The average prediction over the testing sample is 0.0074 whereas the prediction is 9.2976331^{-4}. The bars above show which features contribute the most to the discrepancy between these two values. This interpretation follows that of Shapley values.