Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem

Catal, Cagatay; Diri, Banu

doi:10.1016/j.ins.2008.12.001

Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem

Catal C., Diri B.

INFORMATION SCIENCES, cilt.179, ss.1040-1058, 2009 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 179
Basım Tarihi: 2009
Doi Numarası: 10.1016/j.ins.2008.12.001
Dergi Adı: INFORMATION SCIENCES
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Sayfa Sayıları: ss.1040-1058
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Software quality engineering comprises of several quality assurance activities such as testing, formal verification, inspection, fault tolerance, and software fault prediction. Until now, many researchers developed and validated several fault prediction models by using machine learning and statistical techniques. There have been used different kinds of software metrics and diverse feature reduction techniques in order to improve the models' performance. However, these studies did not investigate the effect of dataset size, metrics set, and feature selection techniques for software fault prediction. This study is focused on the high-performance fault predictors based on machine learning such as Random Forests and the algorithms based on a new computational intelligence approach called Artificial Immune Systems. We used public NASA datasets from the PROMISE repository to make our predictive models repeatable, refutable, and verifiable. The research questions were based on the effects of dataset size, metrics set, and feature selection techniques. In order to answer these questions, there were defined seven test groups. Additionally, nine classifiers were examined for each of the five public NASA datasets. According to this study, Random Forests provides the best prediction performance for large datasets and Naive Bayes is the best prediction algorithm for small datasets in terms of the Area Under Receiver Operating Characteristics Curve (AUC) evaluation parameter. The parallel implementation of Artificial Immune Recognition Systems (AIRS2Parallel) algorithm is the best Artificial Immune Systems paradigm-based algorithm when the method-level metrics are used. (C) 2008 Elsevier Inc. All rights reserved.