FEATURE SELECTION METHODS IN MACHINE LEARNING


Creative Commons License

Uğurlu E., Şaylı A.

INTERNATIONAL CONGRESS ON NATURAL SCIENCES AND APPLIED MATHEMATICS , İstanbul, Türkiye, 7 - 08 Haziran 2024, ss.38-50

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.5281/zenodo.12678585
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.38-50
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

ABSTRACT

Introduction and Purpose: Feature Selection (FS) in machine learning is one of the most important pre-processing steps. FS purpose is to obtain a fast and high-performing feature subset by removing unnecessary features and selecting the most appropriate features. In this study, Arrhythmia dataset was used with 279 features consisting of age, gender, height, weight and 275 results obtained from 12-lead ECG. It is aimed to determine the best feature subset that can distinguish the presence and absence of arrhythmia and distinguish types of arrhythmias. Materials and Methods: 5 FS algorithms are focused during our study: 153 features are selected by L1 Regularized Logistic Regression (LR-L1), 163 by Random Forest Classifier-Boruta (RF-Boruta), 45 by Extreme Gradient Boosting-Boruta (XGB-Boruta), 198 by Recursive Feature Elimination by Random Forest with Cross Validation (RF-RFECV) and 154 by Recursive Feature Elimination by Extreme Gradient Boosting with Cross Validation (XGB-RFECV). These algorithms automatically determine the number of features. Firstly, pre-processed without FS dataset has 262 features and then 5 datasets were obtained after 5 FS algorithms (total: 6 sets). To bring to light method benefits, Decision Tree (DT), Random Forest (RF), K-Nearest Neighbor (KNN), Gaussian Naive Bayes (GNB), Logistic Regression (LR), Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and CatBoost (CatB) classification algorithms are applied to the datasets using stratified 10-fold cross validation and accuracy, recall, precision, F1-Score, AUC-ROC metrics are evaluated. Results: RF-RFECV dataset with 198 features give the highest accuracy score, 98.41% by Random Forest classifier; using XGB-Boruta dataset it is 97.32%. However, compared to RFRFECV, XGB-Boruta takes 46% less execution time and it reduces 153 features. The number of features is 217 features less than the dataset processed without any FS. Discussion and Conclusion: Consequently, it can be said that automatic FS method increases performances and provide more reliable results and saves time.

Keywords: Arrhythmia Dataset; Feature Selection Algorithm; Classification; Machine Learning