Anonymized credit card transactions labeled as fraudulent or genuine.
Fraud is a growing concern for companies all over the globe. While there are many ways to fight and identify fraud, one method that is gaining increased attention is the use of unsupervised learning methods to detect anomalies within customer or transactions data. By analyzing customers or transactions relative to each other, we’re able to spot unusual observations.
These methods are referred to as unsupervised because there is no historical information about fraudulent cases that is used to train the model.Instead, unsupervised methods are used to find anomalies by locating observations within the data set that are separated from other heavily populated areas of the data set.
The assumption behind this is that fraudulent behavior can often appear as anomalous within a data set. It should be noted that just because an observation is anomalous, it doesn’t mean it is fraudulent or of interest to the user. Similarly, fraudulent behavior can be disguised to be hidden within more regular types of behavior. However, without labeled training data, unsupervised learning is a good method to use to begin to identify deviant accounts or transactions.
Trying to find new types of fraud that may not have been captured within the historical data. Fraud patterns can evolve or change and so it is important to constantly be searching for ways to identify new patterns as early as possible. If purely relying on supervised models built with historical data, these new patterns can be missed. However, since the unsupervised methods are not limited by the patterns present in the historical data, they can potentially identify these new patterns as they may represent behavior that is unusual or anomalous.
Three fundamental approaches to detect anomalies are based on:
The real challenge in anomaly detection is to construct the right data model to separate outliers from noise and normal data.
Anomaly == Outlier == Deviant or Unsual Data Point
Before we dive into the anomaly detection, let’s initialize the h2o cluster and load our data in. We will be using the credit card data set, which contains information on various properties of credit card transactions. There are 492 fraudulent and 284,807 genuine transactions, which makes the target class highly imbalanced. We will not use the label during the anomaly detection modeling, but we will use it during the evaluation of our anomaly detection.
credit_card_tbl <- vroom("data/creditcard.csv")
Exploratory Data Analysis is an initial process of analysis, in which you can summarize characteristics of data such as pattern, trends, outliers, and hypothesis testing using descriptive statistics and visualization.
fraud_class <- credit_card_tbl %>%
group_by(Classes) %>%
summarize(Count = n()) %>%
ggplot(aes(x=Classes, y=Count, fill = Classes)) +
geom_col() +
theme_tufte() +
scale_fill_manual(values=c("#377EB8","#E41A1C")) +
geom_text(aes(label = Count), size = 3, vjust = 1.2, color = "#FFFFFF") +
theme(plot.title = element_text(face = "bold", hjust = 0.5)) +
labs(title="Credit card transactions", x = "Classes", y = "Count")
fraud_class_percentage <- credit_card_tbl %>%
group_by(Classes) %>%
summarise(Count=n()) %>%
mutate(percent = round(prop.table(Count),2) * 100) %>%
ggplot(aes("", Classes, fill = Classes)) +
geom_bar(width = 1, stat = "identity", color = "white") +
theme_tufte() +
scale_fill_manual(values=c("#377EB8","#E41A1C")) +
coord_polar("y", start = 0) +
ggtitle("Credit card transactions(%)") +
theme(plot.title = element_text(face = "bold", hjust = 0.5)) +
geom_text(aes(label = paste0(round(percent, 1), "%")), position = position_stack(vjust = 0.5), color = "white")
plot_grid(fraud_class, fraud_class_percentage, align="h", ncol=2)
g <- credit_card_tbl %>%
select(Amount, Class) %>%
ggplot(aes(Amount, fill = as.factor(Class))) +
# geom_histogram() +
geom_density(alpha = 0.3) +
facet_wrap(~ Class, scales = "free_y", ncol = 1) +
scale_x_log10(label = scales::dollar_format()) +
scale_fill_tq() +
theme_tq() +
labs(title = "Fraud by Amount Spent",
fill = "Fraud")
ggplotly(g)
Let’s understand in detail what isolation forest is and how it can be helpful in identifying the anomaly.
The term isolation means separating an instance from the rest of the instances
. Since anomalies are “few and different” and therefore they are more susceptible to isolation.
Advantages of using Isolation Forest:
First, we need to initialize the Java Virtual Machine (JVM) that H2O uses locally.
h2o.init()
Next, we change our data to an h2o object that the package can interpret.
credit_card_h2o <- as.h2o(credit_card_tbl)
target <- "Class"
predictors <- setdiff(names(credit_card_h2o), target)
# Let’s train isolation forest.
isoforest <- h2o.isolationForest(
training_frame = credit_card_h2o,
x = predictors,
ntrees = 100,
seed = 1234
)
##
|
| | 0%
|
|= | 1%
|
|== | 3%
|
|===== | 7%
|
|======== | 11%
|
|=========== | 16%
|
|=============== | 21%
|
|=================== | 27%
|
|======================= | 33%
|
|============================ | 40%
|
|=============================== | 44%
|
|=================================== | 50%
|
|======================================= | 56%
|
|=========================================== | 61%
|
|=============================================== | 67%
|
|================================================== | 72%
|
|======================================================= | 78%
|
|========================================================== | 83%
|
|============================================================== | 88%
|
|================================================================== | 94%
|
|======================================================================| 100%
isoforest
## Model Details:
## ==============
##
## H2OAnomalyDetectionModel: isolationforest
## Model ID: IsolationForest_model_R_1615735708480_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 100 100 70553 8
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 8 8.00000 15 94 51.48000
##
##
## H2OAnomalyDetectionMetrics: isolationforest
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
We can see that the prediction h2o frame contains two columns:
predictions <- predict(isoforest, newdata = credit_card_h2o)
##
|
| | 0%
|
|======================================================================| 100%
predictions
## predict mean_length
## 1 0.04724409 6.82
## 2 0.04986877 6.81
## 3 0.16010499 6.39
## 4 0.11023622 6.58
## 5 0.06561680 6.75
## 6 0.04986877 6.81
##
## [284807 rows x 2 columns]
Predicting Anomalies using Quantile
How do we go from the average number of splits / anomaly score to the actual predictions? Using a threshold If we have an idea about the relative number of outliers in our dataset, we can find the corresponding quantile value of the score and use it as a threshold for our predictions.
We can see that most of the observations are low percentage likelihood, but there are some with high likelihood and that is anomaly.
h2o.hist(predictions[,"predict"])
Most of the observations are around 7 trees / splits to be able separate the data points.
h2o.hist(predictions[,"mean_length"])
quantile <- h2o.quantile(predictions, probs = 0.99)
quantile
## predictQuantiles mean_lengthQuantiles
## 0.3412073 7.0000000
thresh <- quantile["predictQuantiles"]
predictions$outlier <- predictions$predict > thresh %>% as.numeric()
predictions$class <- credit_card_h2o$Class
predictions
## predict mean_length outlier class
## 1 0.04724409 6.82 0 0
## 2 0.04986877 6.81 0 0
## 3 0.16010499 6.39 0 0
## 4 0.11023622 6.58 0 0
## 5 0.06561680 6.75 0 0
## 6 0.04986877 6.81 0 0
##
## [284807 rows x 4 columns]
predictions_tbl <- as_tibble(predictions) %>%
mutate(class = factor(class, levels = c("1","0"))) %>%
mutate(outlier = factor(outlier,levels = c("1","0")))
predictions_tbl
## # A tibble: 284,807 x 4
## predict mean_length outlier class
## <dbl> <dbl> <fct> <fct>
## 1 0.0472 6.82 0 0
## 2 0.0499 6.81 0 0
## 3 0.160 6.39 0 0
## 4 0.110 6.58 0 0
## 5 0.0656 6.75 0 0
## 6 0.0499 6.81 0 0
## 7 0.0551 6.79 0 0
## 8 0.168 6.36 0 0
## 9 0.0971 6.63 0 0
## 10 0.0525 6.8 0 0
## # ... with 284,797 more rows
We have 300 anomalies which are considered as Fraud.
predictions_tbl %>% conf_mat(class, outlier)
## Truth
## Prediction 1 0
## 1 300 2511
## 0 192 281804
auc <- predictions_tbl %>%
roc_auc(class, predict) %>%
pull(.estimate) %>%
round(3)
predictions_tbl %>%
roc_curve(class, predict) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity)) +
geom_path(color = palette_light()[1], size = 2) +
geom_abline(lty = 3, size = 1) +
theme_tq() +
labs(title = str_glue("ROC AUC: {auc}"),
subtitle = "Using H2O Isolation Forest")
Stabilize predictions to increase anomaly detection performance
# Repeatable Prediction Function
iso_forest <- function(seed) {
target <- "Class"
predictors <- setdiff(names(credit_card_h2o), target)
isoforest <- h2o.isolationForest(
training_frame = credit_card_h2o,
x = predictors,
ntrees = 100,
seed = seed
)
predictions <- predict(isoforest, newdata = credit_card_h2o)
quantile <- h2o.quantile(predictions, probs = 0.99)
thresh <- quantile["predictQuantiles"]
# predictions$outlier <- predictions$predict > thresh %>% as.numeric()
# predictions$class <- credit_card_h2o$Class
predictions_tbl <- as_tibble(predictions) %>%
# mutate(class = as.factor(class)) %>%
mutate(row = row_number())
predictions_tbl
}
iso_forest(123)
##
|
| | 0%
|
|= | 2%
|
|==== | 5%
|
|======= | 10%
|
|=========== | 16%
|
|=============== | 22%
|
|==================== | 28%
|
|======================= | 33%
|
|=========================== | 39%
|
|================================ | 45%
|
|================================== | 48%
|
|==================================== | 52%
|
|======================================== | 57%
|
|=========================================== | 62%
|
|============================================== | 66%
|
|================================================== | 71%
|
|===================================================== | 76%
|
|========================================================= | 81%
|
|============================================================= | 87%
|
|================================================================ | 92%
|
|===================================================================== | 98%
|
|======================================================================| 100%
##
|
| | 0%
|
|======================================================================| 100%
## # A tibble: 284,807 x 3
## predict mean_length row
## <dbl> <dbl> <int>
## 1 0.0431 6.83 1
## 2 0.0152 6.94 2
## 3 0.157 6.38 3
## 4 0.0533 6.79 4
## 5 0.0330 6.87 5
## 6 0.0203 6.92 6
## 7 0.00761 6.97 7
## 8 0.190 6.25 8
## 9 0.0457 6.82 9
## 10 0.0279 6.89 10
## # ... with 284,797 more rows
multiple_predictions_tbl <- tibble(seed = c(158, 8546, 4593)) %>%
mutate(predictions = map(seed, iso_forest))
multiple_predictions_tbl
## # A tibble: 3 x 2
## seed predictions
## <dbl> <list>
## 1 158 <tibble [284,807 x 3]>
## 2 8546 <tibble [284,807 x 3]>
## 3 4593 <tibble [284,807 x 3]>
# Calculate average predictions
stabilized_predictions_tbl <- multiple_predictions_tbl %>%
unnest(predictions) %>%
select(row, seed, predict) %>%
# Calculate stabilized predictions
group_by(row) %>%
summarize(mean_predict = mean(predict)) %>%
ungroup() %>%
# Combine with original data & important columns
bind_cols(
credit_card_tbl
) %>%
select(row, mean_predict, Time, V12, V15, Amount, Class) %>%
# Detect Outliers
mutate(outlier = ifelse(mean_predict > quantile(mean_predict, probs = 0.99), 1, 0)) %>%
mutate(Class = as.factor(Class))
stabilized_predictions_tbl %>% pr_auc(Class, mean_predict)
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 pr_auc binary 0.991
stabilized_predictions_tbl %>%
ggplot(aes(V12, V15, color = as.factor(outlier))) +
geom_point(alpha = 0.2) +
theme_tq() +
scale_color_tq() +
labs(title = "Anomaly Detected?", color = "Is Outlier?")
stabilized_predictions_tbl %>%
ggplot(aes(V12, V15, color = as.factor(outlier))) +
geom_point(alpha = 0.2) +
theme_tq() +
scale_color_tq() +
labs(title = "Fraud Present?", color = "Is Fraud?")