INTERNATIONAL CONGRESS ON NATURAL SCIENCES AND APPLIED MATHEMATICS , İstanbul, Türkiye, 7 - 08 Haziran 2024, ss.38-50
ABSTRACT
Introduction and Purpose: Feature Selection (FS) in
machine learning is one of the most important pre-processing steps. FS purpose
is to obtain a fast and high-performing feature subset by removing unnecessary
features and selecting the most appropriate features. In this study, Arrhythmia
dataset was used with 279 features consisting of age, gender, height, weight
and 275 results obtained from 12-lead ECG. It is aimed to determine the best
feature subset that can distinguish the presence and absence of arrhythmia and
distinguish types of arrhythmias. Materials and Methods: 5 FS
algorithms are focused during our study: 153 features are selected by L1
Regularized Logistic Regression (LR-L1), 163 by Random Forest Classifier-Boruta
(RF-Boruta), 45 by Extreme Gradient Boosting-Boruta (XGB-Boruta), 198 by
Recursive Feature Elimination by Random Forest with Cross Validation (RF-RFECV)
and 154 by Recursive Feature Elimination by Extreme Gradient Boosting with
Cross Validation (XGB-RFECV). These algorithms automatically determine the
number of features. Firstly, pre-processed without FS dataset has 262 features
and then 5 datasets were obtained after 5 FS algorithms (total: 6 sets). To
bring to light method benefits, Decision Tree (DT), Random Forest (RF),
K-Nearest Neighbor (KNN), Gaussian Naive Bayes (GNB), Logistic Regression (LR),
Support Vector Machine (SVM), Extreme Gradient Boosting (XGB) and CatBoost
(CatB) classification algorithms are applied to the datasets using stratified
10-fold cross validation and accuracy, recall, precision, F1-Score, AUC-ROC
metrics are evaluated. Results: RF-RFECV dataset with 198
features give the highest accuracy score, 98.41% by Random Forest classifier;
using XGB-Boruta dataset it is 97.32%. However, compared to RFRFECV, XGB-Boruta
takes 46% less execution time and it reduces 153 features. The number of
features is 217 features less than the dataset processed without any FS. Discussion
and Conclusion: Consequently, it can be said that automatic FS method
increases performances and provide more reliable results and saves time.
Keywords: Arrhythmia Dataset; Feature Selection
Algorithm; Classification; Machine Learning