APPLIED SOFT COMPUTING, cilt.41, ss.420-427, 2016
Although next generation sequencing applications are getting dominant in molecular genetics, there are still many institutions that want to utilize their legacy sequencers as much as possible. An important concern in sequencing services is the quality of trace files presented to the customers. In this respect, the quality of the trace files should be screened and low quality files should be handled differently before reaching to customers. The quality scores already present in the trace files provide some useful information, however by incorporating auxiliary information we can improve to reliability of these scores. To this end, we used a feature based supervised classification strategy which requires a set of training and testing trace files qualities of which are determined manually. We tested several machine learning algorithms, namely k-nearest neighbors, Naive Bayes, Support Vector Machines and Random Forest, on a public DNA trace repository. Our results indicate that RF method with only 4 simple features provides a classification accuracy rate of 94.68% with a high level of reliability of concurrence (Kappa = 0.8679). (C) 2016 Elsevier B.V. All rights reserved.