Unlabelled extra data do not always mean extra performance for semi-supervised fault prediction

Catal C., Diri B.

EXPERT SYSTEMS, vol.26, pp.458-471, 2009 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 26
  • Publication Date: 2009
  • Doi Number: 10.1111/j.1468-0394.2009.00509.x
  • Journal Name: EXPERT SYSTEMS
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Page Numbers: pp.458-471
  • Yıldız Technical University Affiliated: Yes


This research focused on investigating and benchmarking several high performance classifiers called J48, random forests, naive Bayes, KStar and artificial immune recognition systems for software fault prediction with limited fault data. We also studied a recent semi-supervised classification algorithm called YATSI (Yet Another Two Stage Idea) and each classifier has been used in the first stage of YATSI. YATSI is a meta algorithm which allows different classifiers to be applied in the first stage. Furthermore, we proposed a semi-supervised classification algorithm which applies the artificial immune systems paradigm. Experimental results showed that YATSI does not always improve the performance of naive Bayes when unlabelled data are used together with labelled data. According to experiments we performed, the naive Bayes algorithm is the best choice to build a semi-supervised fault prediction model for small data sets and YATSI may improve the performance of naive Bayes for large data sets. In addition, the YATSI algorithm improved the performance of all the classifiers except naive Bayes on all the data sets.