Fraud detection

WARNING: this notebook relies on several R packages. Whenever one is called (for instance via a call such as “library(torch)”), this implies that the package has been installed, e.g., via “install.packages(”torch”)”. Hence, make sure the following packages are installed!

# install.packages("tidyverse") 
# install.packages(c("data.table", "fastDummies", "lightgbm", "reticulate", "keras"))

NOTE: in data science, one of the most crucial competence is domain knowledge.
Data is key (more than any algorithm) and domain knowledge helps understand how it is built and which features are likely to matter most.

Introduction

Principles

Quite obviously, the purpose of fraud detection is to identify illicit transactions.
This is usually done via the characteristics of the transaction, which can be split into several categories: the nature of the transaction (e.g., type (credit card versus online), location, timing or amount), and the properties of the parties: history of transactions, location, gender of emitter, trustworthiness of receiver, etc.

Hence, like for credit risk, the main task will be essentially classification (supervised learning), where we use the characteristics of the transactions to predict if they are fraudulent or not. However, with fraud data, the imbalance is even more pronounced, because usually less than 1% of transactions are tagged as fraudulent. This means that most of the interesting information comes from a very small portion of the data!

Note that fraud is widespread and affects, many domains, especially insurance for instance (see Section 6 of this paper for examples). The tools to detect fraud (or anomalies) in other context are quite similar (supervised learning).

Short list of free references

There are several books available on the topic, but the following one, with code snippets in python is probably the most applied (with PyTorch examples to illustrate neural networks):
- Reproducible Machine Learning for Credit Card Fraud detection - Practical handbook

Two recent survey articles:

Data wrangling

First, let us load the data & the packages.
The data originates from a Kaggle competition and we only keep the training data for which the target/label is available.
The file is also available on this drive link.
\(\rightarrow\) Please download it and place it in the same folder as the notebook.

library(tidyverse)
library(data.table)         # To import csv sheets rapidly
library(fastDummies)        # For dummies
fraud <- fread("fraud.csv")
head(fraud)

V1	trans_date_trans_time	cc_num	merchant	category	amt	first	last	gender	street	city	state	zip	lat	long	city_pop	job	dob	trans_num	unix_time	merch_lat	merch_long
0	2019-01-01 00:00:18	2703186189652095	fraud_Rippin, Kub and Mann	misc_net	4.97	Jennifer	Banks	F	561 Perry Cove	Moravian Falls	NC	28654	36.0788	-81.1781	3495	Psychologist, counselling	1988-03-09	0b242abb623afc578575680df30655b9	1325376018	36.01129	-82.04832
1	2019-01-01 00:00:44	630423337322	fraud_Heller, Gutmann and Zieme	grocery_pos	107.23	Stephanie	Gill	F	43039 Riley Greens Suite 393	Orient	WA	99160	48.8878	-118.2105	149	Special educational needs teacher	1978-06-21	1f76529f8574734946361c461b024d99	1325376044	49.15905	-118.18646
2	2019-01-01 00:00:51	38859492057661	fraud_Lind-Buckridge	entertainment	220.11	Edward	Sanchez	M	594 White Dale Suite 530	Malad City	ID	83252	42.1808	-112.2620	4154	Nature conservation officer	1962-01-19	a1a22d70485983eac12b5b88dad1cf95	1325376051	43.15070	-112.15448
3	2019-01-01 00:01:16	3534093764340240	fraud_Kutch, Hermiston and Farrell	gas_transport	45.00	Jeremy	White	M	9443 Cynthia Court Apt. 038	Boulder	MT	59632	46.2306	-112.1138	1939	Patent attorney	1967-01-12	6b849c168bdad6f867558c3793159a81	1325376076	47.03433	-112.56107
4	2019-01-01 00:03:06	375534208663984	fraud_Keeling-Crist	misc_pos	41.96	Tyler	Garcia	M	408 Bradley Rest	Doe Hill	VA	24433	38.4207	-79.4629	99	Dance movement psychotherapist	1986-03-28	a41d7549acf90789359a9aa5346dcb46	1325376186	38.67500	-78.63246
5	2019-01-01 00:04:08	4767265376804500	fraud_Stroman, Hudson and Erdman	gas_transport	94.63	Jennifer	Conner	F	4655 David Island	Dublin	PA	18917	40.3750	-75.2045	2158	Transport planner	1961-06-19	189a841a0a8ba03058526bcfe566aab5	1325376248	40.65338	-76.15267

The first three columns probably won’t be very useful (this is a shortcut though: time can be predictive, but it would need to be transormed, e.g., into days and hours).
The merchant name can also be improved.

fraud <- fraud[, -(1:3)]
fraud <- fraud |>
  select(-trans_num, -first, -last) |>
  mutate(merchant = merchant |> str_remove("fraud_"))
fraud |> head(4)

merchant	category	amt	gender	street	city	state	zip	lat	long	city_pop	job	dob	unix_time	merch_lat	merch_long
Rippin, Kub and Mann	misc_net	4.97	F	561 Perry Cove	Moravian Falls	NC	28654	36.0788	-81.1781	3495	Psychologist, counselling	1988-03-09	1325376018	36.01129	-82.04832
Heller, Gutmann and Zieme	grocery_pos	107.23	F	43039 Riley Greens Suite 393	Orient	WA	99160	48.8878	-118.2105	149	Special educational needs teacher	1978-06-21	1325376044	49.15905	-118.18646
Lind-Buckridge	entertainment	220.11	M	594 White Dale Suite 530	Malad City	ID	83252	42.1808	-112.2620	4154	Nature conservation officer	1962-01-19	1325376051	43.15070	-112.15448
Kutch, Hermiston and Farrell	gas_transport	45.00	M	9443 Cynthia Court Apt. 038	Boulder	MT	59632	46.2306	-112.1138	1939	Patent attorney	1967-01-12	1325376076	47.03433	-112.56107

There remains a lot to do… There are too many merchants and cities and for simplicity, we choose to remove them. Instead, we’ll keep categories, which we will transform via one-hot encoding: each categorical value is coded into a new binary column (see below).
In terms of locations, zip is a number but can be misleading, so we’ll only keep latitude & longitude.
NOTE: the one-hot encoding below can take some time…

fraud <- fraud |> 
  select(-merchant, -street, -city, -state, -zip, -dob, -unix_time, -job) |>
  mutate(gender = gender |> as.factor() |> as.numeric())
dummy_cat <- dummy_cols(fraud$category)  # One-hot encoding
fraud <- fraud |>
  select(-category) |>                   # Removes the "category" column
  bind_cols(dummy_cat |> select(-1))     # Adds the equivalent dummy columns
fraud |> head(3)

amt	gender	lat	long	city_pop	merch_lat	merch_long	.data_entertainment	.data_grocery_pos	.data_misc_net
4.97	1	36.0788	-81.1781	3495	36.01129	-82.04832	0	0	1
107.23	1	48.8878	-118.2105	149	49.15905	-118.18646	0	1	0
220.11	2	42.1808	-112.2620	4154	43.15070	-112.15448	1	0	0

Before the models, let’s evaluate the imbalance of the data…

mean(fraud$is_fraud)

[1] 0.005788652

0.6% of positive cases: this is very low. Again: a large chunk of the information seems to be contained in a very small fraction of observations…

A first lightGBM model

Data splitting

In this section, we fit a boosted tree from the lightGBM package.
We split the data in two, for simplicity: we won’t use validation.

library(lightgbm)
fraud_train <- fraud[1:500000,]               
fraud_test <- fraud[(1+500000):nrow(fraud),]

The model parameters. The full list is gathered here.
Good tips with regard to parameter tuning are also provided on this page.

train_params <- list(
  num_leaves = 15,        # Max nb leaves in tree
  learning_rate = 0.1,    # Learning rate
  objective = "binary",   # Loss function
  max_depth = 4,          # Max depth of trees
  min_data_in_leaf = 50,  # Nb points in leaf
  bagging_fraction = 0.5, # % of observations
  feature_fraction = 0.7, # % of features
  nthread = 2
)

Training

Onto training! The syntax is very simple.

bst <- lightgbm(
  data = fraud_train |> select(-is_fraud) |> as.matrix(),
  label = fraud_train$is_fraud, # Target / label
  params = train_params,        # Passing parameter values
  nrounds = 20                  # Number of trees in the model
)

[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.016566 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0367439"
[1] "[2]:  train's binary_logloss:0.0217458"
[1] "[3]:  train's binary_logloss:0.0208932"
[1] "[4]:  train's binary_logloss:0.0175729"
[1] "[5]:  train's binary_logloss:0.0171861"
[1] "[6]:  train's binary_logloss:0.0166766"
[1] "[7]:  train's binary_logloss:0.0162691"
[1] "[8]:  train's binary_logloss:0.015946"
[1] "[9]:  train's binary_logloss:0.0157294"
[1] "[10]:  train's binary_logloss:0.0152117"
[1] "[11]:  train's binary_logloss:0.0148635"
[1] "[12]:  train's binary_logloss:0.0147105"
[1] "[13]:  train's binary_logloss:0.0144223"
[1] "[14]:  train's binary_logloss:0.0142859"
[1] "[15]:  train's binary_logloss:0.0141833"
[1] "[16]:  train's binary_logloss:0.0137635"
[1] "[17]:  train's binary_logloss:0.0135653"
[1] "[18]:  train's binary_logloss:0.0133968"
[1] "[19]:  train's binary_logloss:0.0131941"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[20]:  train's binary_logloss:0.0130456"

Predicting

Next, we ask for the predictions of the first 700 rows of the test set.

predict(bst, fraud_test[1:700,] |> select(-is_fraud) |> as.matrix()) |> round()

  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
[149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[297] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[408] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[630] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
[667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Other parameter options

Modern libraries in machine learning allow for many options (again, see the parameter list).

Let’s have a look at several extensions:

dropout: adapted from neural nets, the idea is to drop some trees, more or less randomly.
monotonicity: it may be the case that for some features, the modeller wishes to impose a directional link with the target (e.g., increasing or decreasing).
regularization: the idea of penalizing output values to reduce the variance of predictions (and overfitting).
early stopping is also now implemented in most libraries: it can save time when the algo does not learn after a few rounds.

We adapt the set of parameters below:

nb_feat <- ncol(fraud_train |> select(-is_fraud)) # Nb features
mono_const <- rep(0, nb_feat)
mono_const[1] <- 1             # First feature is amount: fraud increases with amount
# If you have a prior on gender, it's also possible to enforce a constraint

train_params <- list(
  num_leaves = 15,           # Max nb leaves in tree
  learning_rate = 0.1,       # Learning rate
  objective = "binary",      # Loss function
  max_depth = 4,             # Max depth of trees
  min_data_in_leaf = 50,     # Nb points in leaf
  bagging_fraction = 0.5,    # % of observations
  feature_fraction = 0.7,    # % of features
  nthread = 4,               # Parallelization
  boosting = "dart",         # DART = dropping
  drop_rate = 0.1,           # Dropping rate
  lambda_l1 = 0.3,           # Penalizing leave norms
  seed = 42,                 # For reproducibility?
  # early stopping not available with DARTs
  #early_stopping_round = 10, # Early stopping after X round if no improvement
  monotone_constraints = mono_const,
  force_row_wise = T
)

New round of training.

bst <- lightgbm(
  data = fraud_train |> select(-is_fraud) |> as.matrix(),
  label = fraud_train$is_fraud, # Target / label
  params = train_params,        # Passing parameter values
  nrounds = 20                  # Number of trees in the model
)

Warning in (function (params = list(), data, nrounds = 100L, valids = list(), :
Early stopping is not available in 'dart' mode.

[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[1]:  train's binary_logloss:0.0269181"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[2]:  train's binary_logloss:0.0250706"
[1] "[3]:  train's binary_logloss:0.0238056"
[1] "[4]:  train's binary_logloss:0.0229725"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[5]:  train's binary_logloss:0.022269"
[1] "[6]:  train's binary_logloss:0.0213811"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[7]:  train's binary_logloss:0.0215642"
[1] "[8]:  train's binary_logloss:0.0207618"
[1] "[9]:  train's binary_logloss:0.0204714"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[10]:  train's binary_logloss:0.0207381"
[1] "[11]:  train's binary_logloss:0.02036"
[1] "[12]:  train's binary_logloss:0.0205059"
[1] "[13]:  train's binary_logloss:0.0203041"
[1] "[14]:  train's binary_logloss:0.0201755"
[1] "[15]:  train's binary_logloss:0.0198653"
[1] "[16]:  train's binary_logloss:0.0196283"
[1] "[17]:  train's binary_logloss:0.0196991"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[18]:  train's binary_logloss:0.0193787"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[19]:  train's binary_logloss:0.0192463"
[1] "[20]:  train's binary_logloss:0.0192584"

The loss is higher, which makes sense, given the additional constraints we have imposed.

Cross-validation

What if we want to test results on alternative subsets of the data?

cv_model <- lgb.cv(
  params = train_params,
  data = fraud_train |> select(-is_fraud) |> as.matrix(),
  label = fraud_train$is_fraud, # Target / label
  eval_freq = 80,
  nrounds = 3,                  # Still number of trees
  nfold = 5
)

[LightGBM] [Info] Number of positive: 2404, number of negative: 397596
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 21
[LightGBM] [Info] Number of positive: 2473, number of negative: 397528
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400001, number of used features: 21
[LightGBM] [Info] Number of positive: 2427, number of negative: 397573
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 21
[LightGBM] [Info] Number of positive: 2446, number of negative: 397554
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 400000, number of used features: 21
[LightGBM] [Info] Number of positive: 2414, number of negative: 397585
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 399999, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006010 -> initscore=-5.108302
[LightGBM] [Info] Start training from score -5.108302
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006182 -> initscore=-5.079833
[LightGBM] [Info] Start training from score -5.079833
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006067 -> initscore=-5.098723
[LightGBM] [Info] Start training from score -5.098723
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006115 -> initscore=-5.090877
[LightGBM] [Info] Start training from score -5.090877
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006035 -> initscore=-5.104124
[LightGBM] [Info] Start training from score -5.104124
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[1]:  valid's binary_logloss:0.0269293+0.000709862"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[3]:  valid's binary_logloss:0.0238533+0.000725518"

cv_model$record_evals

$start_iter
[1] 1

$valid
$valid$binary_logloss
$valid$binary_logloss$eval
$valid$binary_logloss$eval[[1]]
[1] 0.02692933

$valid$binary_logloss$eval[[2]]
[1] 0.02509091

$valid$binary_logloss$eval[[3]]
[1] 0.02385325


$valid$binary_logloss$eval_err
$valid$binary_logloss$eval_err[[1]]
[1] 0.0007098624

$valid$binary_logloss$eval_err[[2]]
[1] 0.0007349295

$valid$binary_logloss$eval_err[[3]]
[1] 0.0007255183

Testing several parameter values

Ok, but what if we want to test several parameter combinations?
With the tidymodels ecosystem, this is possible via dedicated functions from the tune package, but let’s see if we can code it in R with functional programming.

First, let’s create the combinations.

num_leaves <- c(5,30)
learning_rate <- c(0.01, 0.05, 0.2)
pars <- expand.grid(num_leaves, learning_rate)
num_leaves <- pars[,1]
learning_rate <- pars[,2]

Next, the training function. Note that some parameters are flexible, others are fixed.

train_func <- function(num_leaves, learning_rate, fraud_train){
  train_params <- list(             # First, the list of params
    num_leaves = num_leaves,        # Max nb leaves in tree
    learning_rate = learning_rate,  # Learning rate
    objective = "binary",           # Loss function
    max_depth = 3,                  # Max depth of trees
    min_data_in_leaf = 50,          # Nb points in leaf
    bagging_fraction = 0.5,         # % of observations
    feature_fraction = 0.7,         # % of features
    nthread = 4,                    # Parallelization
    force_row_wise = T
  )
  # Next we train
  bst <- lightgbm(
    data = fraud_train |> select(-is_fraud) |> as.matrix(),
    label = fraud_train$is_fraud, # Target / label
    params = train_params,        # Passing parameter values
    eval_freq = 50,
    nrounds = 10                  # Number of trees in the model
  )
  # Next, we record the final loss (depends on the model/loss defined above)
  return(loss = bst$record_evals$train$binary_logloss$eval[[10]]) 
}
train_func(10, 0.1, fraud_train) # Testing

[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0368203"
[1] "[10]:  train's binary_logloss:0.0155729"

[1] 0.01557292

Finally, we can launch a function that is going to span all free parameters.

grd <- pmap(list(num_leaves, learning_rate),     # Parameters for the grid search
            train_func,                          # Function on which to apply the grid search
            fraud_train = fraud_train            # Non-changing argument (data is fixed)
)

[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0370693"
[1] "[10]:  train's binary_logloss:0.0285605"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0370674"
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] "[10]:  train's binary_logloss:0.0271148"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0369637"
[1] "[10]:  train's binary_logloss:0.0172955"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0369543"
[1] "[10]:  train's binary_logloss:0.0168124"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0366141"
[1] "[10]:  train's binary_logloss:0.0672663"
[LightGBM] [Info] Number of positive: 3041, number of negative: 496959
[LightGBM] [Info] Total Bins 1551
[LightGBM] [Info] Number of data points in the train set: 500000, number of used features: 21
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.006082 -> initscore=-5.096321
[LightGBM] [Info] Start training from score -5.096321
[1] "[1]:  train's binary_logloss:0.0365813"
[1] "[10]:  train's binary_logloss:0.0732902"

grd <- bind_cols(pars, tibble(loss = grd))

Manual random oversampling

The problem with fraud datasets is that they are very imbalanced: in our sample fewer than 1% of observations are actual frauds… This complicates the learning phase. Hence, it can make sense to resample the training data to reduce imbalance.
Let’s see how we can proceed.

We will use the function below.

sample(10, 5, replace = T) # If true, the same number can be chosen several times

[1] 9 4 7 5 2

We first sample the fraud rows.

# First, we locate the fraudulent observations
fraud_ind <- which(fraud_train$is_fraud == 1)
sample_size <- 10000
ratio <- 0.05
# First, select the fraud instances
fraud_ind <- fraud_ind[sample(1:length(fraud_ind), 
                              sample_size*ratio, 
                              replace = F)]
fraud_sub <- fraud_train[fraud_ind,]
fraud_sub |> head(3)

amt	gender	lat	long	city_pop	merch_lat	merch_long	is_fraud	.data_personal_care	.data_shopping_net	.data_shopping_pos
1016.88	1	35.1506	-106.5690	641349	35.55577	-105.85035	1	0	1	0
19.02	1	38.9311	-89.2463	1810	38.64232	-89.02502	1	1	0	0
965.64	1	41.2639	-80.8164	75903	42.25914	-80.96991	1	0	0	1

Then, the non-fraud ones and aggregation.

non_fraud_ind <- which(fraud_train$is_fraud == 0)
non_fraud_ind <- non_fraud_ind[sample(1:length(non_fraud_ind), 
                              sample_size*(1-ratio), 
                              replace = F)]
non_fraud_sub <- fraud_train[non_fraud_ind,]
fraud_train_sub <- bind_rows(fraud_sub, non_fraud_sub)

And just checking:

mean(fraud_train_sub$is_fraud)

[1] 0.05

Oversampling with SMOTE

Below, we resort to one implementation of Synthetic Minority Oversampling Technique (SMOTE).

library(SMOTEWB)
train_smote <- SMOTE(x = fraud_train |> select(-is_fraud), 
      y = fraud_train$is_fraud |> as.factor(), 
      k = 5)
head(train_smote$x_new[,1:9])

          amt   gender      lat      long city_pop merch_lat merch_long
[1,] 283.5551 2.000000 35.97082 -82.74856 886.6038  36.41789  -82.23208
[2,] 323.5162 2.000000 35.58993 -99.11577 912.2892  36.22188  -99.09008
[3,] 320.0837 2.000000 35.62265 -97.70989 910.0829  36.23872  -97.64204
[4,] 309.6495 2.000000 35.72210 -93.43626 903.3762  36.28990  -93.24026
[5,] 293.0703 1.819313 37.16070 -84.15612 885.0000  37.33711  -83.84679
[6,] 288.7725 1.870660 36.89042 -82.93488 884.0946  37.25846  -82.45607
     .data_entertainment .data_food_dining
[1,]                   0                 0
[2,]                   0                 0
[3,]                   0                 0
[4,]                   0                 0
[5,]                   0                 0
[6,]                   0                 0

Let’s check the proportion!

summary(train_smote$y_new)

     0      1 
496959 496959

In this case, the data is perfectly balanced, which may be too strong a requirement…

Neural networks with Keras

NOTE: Tensorflow and M1 Macs don’t go too well together. Read the following post for more on the matter.
The chunk below follows the above procedure. If you do not work on an M1 mac, please skip it.

library(reticulate)
use_condaenv("r-tensorflow") # This is for M1 Macs with special conda environment

Next, we move to neural networks. In this case, categorical variables must be re-coded.
Below, the target will be split in two binary columns (fraud versus no-fraud).
This is a bit heavy though…

library(keras)
#install_keras()  # You need to run this the first time
nn_train_target <- dummy_cols(fraud_train$is_fraud, remove_selected_columns = T) |> as.matrix()
nn_test_target <- dummy_cols(fraud_test$is_fraud, remove_selected_columns = T) |> as.matrix()

In Keras, there are also many options available. Below, we resort to:

initializers both for weights and biases;
penalizations (via regularizers)
dropout: when some units are removed to mitigate overfitting

model <- keras_model_sequential()
model %>%   # This defines the structure of the network, i.e. how layers are organized
  layer_dense(units = 32, 
              activation = 'relu', 
              input_shape = ncol(fraud_train) - 1) %>%       # Size of input (compulsory for 1st layer)
  layer_dense(units = 16, activation = 'tanh',               # Nb units & activation
              kernel_initializer = "random_normal",          # Initialization of weights
              kernel_constraint = constraint_nonneg()) %>%   # Weights should be nonneg
  layer_dropout(rate = 0.25) %>%                             # Dropping out 25% units
  layer_dense(units = 8, activation = 'elu',                 # Nb units & activation
              bias_initializer = initializer_constant(0.2),  # Initialization of biases
              kernel_regularizer = regularizer_l2(0.01)) %>% # Penalization of weights 
  layer_dense(units = 2, activation = 'softmax')             # Softmax for categorical output

Below, we use Adam optimization for learning (gradient descent).

model %>% compile(                                  # Model specification
  loss = 'binary_crossentropy',                     # Loss function
  optimizer = optimizer_adam(learning_rate = 0.005, # Optimisation method (weight updating)
                             beta_1 = 0.9,          # Parameter for Adam
                             beta_2 = 0.95),        # Parameter for Adam 
  metrics = c('categorical_accuracy')               # Output metric
)
summary(model)                                      # Model structure

Model: "sequential"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_3 (Dense)                    (None, 32)                      704         
 dense_2 (Dense)                    (None, 16)                      528         
 dropout (Dropout)                  (None, 16)                      0           
 dense_1 (Dense)                    (None, 8)                       136         
 dense (Dense)                      (None, 2)                       18          
================================================================================
Total params: 1,386
Trainable params: 1,386
Non-trainable params: 0
________________________________________________________________________________

We also resort to early stopping here: this can be very useful for large models that require long training times.

fit_NN <- model %>% 
  fit(fraud_train |> select(-is_fraud) |> data.matrix(), # Training features
      nn_train_target,                                   # Training labels
      epochs = 20, batch_size = 2048,                    # Training parameters
      validation_data = list(fraud_test |> select(-is_fraud) |> data.matrix(), 
                             nn_test_target),            # Test data
      verbose = 0,                                       # No comments from algo
      callbacks = list(
        callback_early_stopping(monitor = "val_loss",    # Early stopping:
                                min_delta = 0.001,       # Improvement threshold
                                patience = 4,            # Nb epochs with no improvmt 
                                verbose = 2              # Warning policy while training
        )
      )
  )
plot(fit_NN) + theme_light()

In order to save time, learning has stopped early…

Let’s fetch some predictions below.

preds <- predict(model, fraud_test[1:700,] |> select(-is_fraud) |> data.matrix())

And show them…

round(preds[1:30,])

      [,1] [,2]
 [1,]    1    0
 [2,]    1    0
 [3,]    1    0
 [4,]    1    0
 [5,]    1    0
 [6,]    1    0
 [7,]    1    0
 [8,]    1    0
 [9,]    1    0
[10,]    1    0
[11,]    1    0
[12,]    1    0
[13,]    1    0
[14,]    1    0
[15,]    1    0
[16,]    1    0
[17,]    1    0
[18,]    1    0
[19,]    1    0
[20,]    1    0
[21,]    1    0
[22,]    1    0
[23,]    1    0
[24,]    1    0
[25,]    1    0
[26,]    1    0
[27,]    1    0
[28,]    1    0
[29,]    1    0
[30,]    1    0

How many frauds are detected in the test set?

sum(preds[,2] == 1)

[1] 0

sum(fraud_test$is_fraud)

[1] 4465

Aoutch… the algo makes the simplest bet that there is no fraud at all.

Interpretability

Finance is often a critical discipline because it involves money.
Hence, when people rely on a black-box algorithm for decision-making, they usually prefer to understand why the algorithm comes to particular conclusions.
To do so, we need to “white-box” the outcome of the predictions.
Of course, there are many ways to do that, but there is one (among others) important dichotomy:

global interpretability: in this case, we seek to understand how the model works on a large set of observations, for instance, on the whole training set.
local interpretability: in this case, the focus is set on one (or a few) observations and the aim is to understand how the model behaves for this particular point.

Global interpretability can also be either model-specific or model-independent.
For instance, for tree models, variable importance is one way to understand when a model gives more (decision) weight to particular features. The process is the following: each time a feature is selected as a splitting variable, the resulting gain (in loss reduction) is allocated to this feature. After the growing of a tree, each feature has its total importance, i.e., the sum of all gains that it has permitted. With random forests of boosted trees, it suffices to aggregate these values (possibly discounted by a learning rate).

All tree-based packages propose these kinds of metrics. Below, we show the result from the lightGBM model that we trained earlier.

lgb.importance(bst) |>
  ggplot(aes(x = Gain, y = reorder(Feature, Gain))) + geom_col(fill = "#22AABB", alpha = 0.7) +
  theme_bw() + theme(axis.title.y = element_blank())

The amount seems to be an important variable in the model.
Clearly importance is concentrated in a handful of features…

In lightGBM, there is also a local interpretation that is proposed!

LGB_intepret <- lgb.interprete(bst, fraud_test[1:700,] |> select(-is_fraud) |> data.matrix(), 1:2)
LGB_intepret

[[1]]
                Feature Contribution
 1:      .data_misc_net -1.152586742
 2:   .data_grocery_pos -0.312105320
 3:  .data_shopping_net -0.199847327
 4:                 amt -0.147336306
 5: .data_gas_transport -0.074064349
 6:              gender  0.031763639
 7:            city_pop  0.010097187
 8:  .data_shopping_pos -0.009884630
 9:                 lat -0.008081910
10:           merch_lat -0.007441658
11:     .data_kids_pets -0.006353981
12:          merch_long -0.004323646
13:      .data_misc_pos -0.003758473
14:                long  0.002902084
15:        .data_travel -0.002307757

[[2]]
                Feature Contribution
 1: .data_gas_transport  0.490558257
 2:   .data_grocery_pos -0.330355672
 3:  .data_shopping_net -0.199847327
 4:                 amt -0.168985618
 5:              gender -0.067639874
 6:                 lat -0.050716066
 7:           merch_lat -0.026985780
 8:            city_pop -0.022459477
 9:      .data_misc_net  0.021629711
10:          .data_home  0.005141175
11:   .data_grocery_net -0.003246968
12:  .data_shopping_pos  0.002371183

lgb.plot.interpretation(
  tree_interpretation_dt = LGB_intepret[[1L]]
  , top_n = 8
)

Let’s try to make sense of this with actual numbers.
The average prediction over the testing sample is 0.0074 whereas the prediction is 9.2976331^{-4}. The bars above show which features contribute the most to the discrepancy between these two values. This interpretation follows that of Shapley values.