Imbalance, Overlap and Classification in Credit Card Fraud Datasets.
Credit card fraud is a growing problem in modern society, especially with the broadening of online purchases and payments. Thus, there is a high demand for fraud detection systems that are reliable and robust. Fraud detection can be treated as a classification problem. Given this, multiple authors mention the difficulty in training a classifier with fraud data sets, which can present high levels of imbalance and overlapping between classes. Sampling techniques, such as over-sampling and under-sampling, are frequently used at the preprocessing phase to address the imbalance problem. On the other hand, evaluation metrics such as \emph{R-Value} and \emph{Augmented R-Value} were introduced in recent years in order to measure the level of overlap between classes in the data set. Therefore, this work proposes the analysis of different classifiers when using sampling techniques on credit card fraud data sets, objectively gauging their effects on the aforementioned metrics and also their impacts on classification performance. As in many previous works, the usage of \emph{Augmented R-Value} was verified as more appropriate for imbalanced scenarios in comparison with \emph{R-Value}. However, this work concludes that the differences in classification results obtained were insignificant regarding the application or not of the selected sampling techniques to the studied data set. Furthermore, decision-tree-based algorithms performed well with this data set, considering the circumstances of highly skewed and significantly overlapped classes.