Weighted XGBoost Based Active Learning Framework for Fraud Detection with Using Small Number of Samples from Imbalanced Dataset


Karaca A. C.

3rd International Conference on Advanced Engineering, Technology and Applications, 25 Mayıs 2024, ss.674-686

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1007/978-3-031-70924-1_51
  • Sayfa Sayıları: ss.674-686
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Fraud cases occur rarely, and they cost much money when we face them. Fraud situations also have a continuously changing nature. That is because fraud detection is a widely studied problem that needs to be solved with some machine learning techniques. On the other hand, there is not enough labeled data for fraud cases, and we generally need expert opinions to label. Active learning suggests some opportunities to select the right items to label as the most uncertain and distinguishable samples for experts to label, and it is used with machine learning algorithms to enhance fraud detection. In our study, we proposed a method that uses a hybrid active learning strategy of Least Confidence uncertainty sampling and Cluster-based diversity sampling with K-means algorithm to enhance fraud detection success over weighted XGBoost as a machine learning method. Fraud cases are rare, so we focus on a model that could work with a minimal train cluster. We found after some experiments that we could reach the same F1 score rather than using only 1/90 data on the training side compared to traditional Hold-out data splitting as 70% train and 30% test. We compared our approach over the combinations of machine learning of Isolation Forest, Support Vector Machine, Neural Network, and XGBoost combined with uncertainty active learning strategies of Margin Confidence, Ratio Confidence, and Entropy-Based as only uncertainty and hybrid approaches with Cluster-based diversity sampling method. Our proposed approach overcomes all combinations of the mentioned methods and strategies, and F1 scores are nearly the same compared with 90 times more enormous training data. Using fewer data has no disadvantages over model evaluation, and it could result in the same evaluation metrics with less complexity and possibly take less time.