Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem


Catal C., Diri B.

INFORMATION SCIENCES, vol.179, pp.1040-1058, 2009 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 179
  • Publication Date: 2009
  • Doi Number: 10.1016/j.ins.2008.12.001
  • Journal Name: INFORMATION SCIENCES
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Page Numbers: pp.1040-1058
  • Yıldız Technical University Affiliated: Yes

Abstract

Software quality engineering comprises of several quality assurance activities such as testing, formal verification, inspection, fault tolerance, and software fault prediction. Until now, many researchers developed and validated several fault prediction models by using machine learning and statistical techniques. There have been used different kinds of software metrics and diverse feature reduction techniques in order to improve the models' performance. However, these studies did not investigate the effect of dataset size, metrics set, and feature selection techniques for software fault prediction. This study is focused on the high-performance fault predictors based on machine learning such as Random Forests and the algorithms based on a new computational intelligence approach called Artificial Immune Systems. We used public NASA datasets from the PROMISE repository to make our predictive models repeatable, refutable, and verifiable. The research questions were based on the effects of dataset size, metrics set, and feature selection techniques. In order to answer these questions, there were defined seven test groups. Additionally, nine classifiers were examined for each of the five public NASA datasets. According to this study, Random Forests provides the best prediction performance for large datasets and Naive Bayes is the best prediction algorithm for small datasets in terms of the Area Under Receiver Operating Characteristics Curve (AUC) evaluation parameter. The parallel implementation of Artificial Immune Recognition Systems (AIRS2Parallel) algorithm is the best Artificial Immune Systems paradigm-based algorithm when the method-level metrics are used. (C) 2008 Elsevier Inc. All rights reserved.