DNA Chromatogram Classification Using Entropy-Based Features and Supervised Dimension Reduction Based on Global and Local Pattern Information


International Journal of Pattern Recognition and Artificial Intelligence, vol.37, no.12, 2023 (SCI-Expanded) identifier

  • Publication Type: Article / Article
  • Volume: 37 Issue: 12
  • Publication Date: 2023
  • Doi Number: 10.1142/s0218001423560190
  • Journal Name: International Journal of Pattern Recognition and Artificial Intelligence
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, Metadex, Civil Engineering Abstracts
  • Keywords: dimension reduction, DNA chromatogram, DNAC Finder, Entropy features, gene sequencing classification
  • Yıldız Technical University Affiliated: Yes


Gene sequence classification can be seen as a challenging task due to the nonstationary, noisy and nonlinear characteristics of sequential data. The primary goal of this research is to develop a general solution approach for supervised DNA chromatogram (DNAC) classification in the absence of sufficient training data. Today, deep learning comes to the fore with its achievements, however this requires a lot of training data. Finding enough training data can be exceedingly challenging, particularly in the medical area and for rare disorders. In this paper, a novel supervised DNAC classification method is proposed, which combines three techniques to classify hepatitis virus DNA trace files as HBV and HCV. The features that are capable of reflecting the complex-structured sequential data are extracted based on both embedding and spectral entropies. After the supervised dimension reduction step, not only global behavior of the entropy features but also local behavior of the entropy features is taken into account for classification purpose. A memory-based learning, which cannot lose any information coming from training data as its nature, is being used as a classifier. Experimental results show that the proposed method achieves good results that although 19% training data is used, a performance of 92% is obtained.