Markovian encoding models in human splice site recognition using SVM


Creative Commons License

Pashaei E., Aydın N.

Computational Biology and Chemistry, vol.73, pp.159-170, 2018 (SCI-Expanded) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 73
  • Publication Date: 2018
  • Doi Number: 10.1016/j.compbiolchem.2018.02.005
  • Journal Name: Computational Biology and Chemistry
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Page Numbers: pp.159-170
  • Keywords: Markovian model, Splice sites, Machine learning, DNA encoding method, MMSVM, PREDICTION, RNA, IDENTIFICATION, GENERATION, MUTATIONS, VARIANTS, FEATURES
  • Yıldız Technical University Affiliated: Yes

Abstract

Splice site recognition is among the most significant and challenging tasks in bioinformatics due to its key role in gene annotation. Effective prediction of splice site requires nucleotide encoding methods that reveal the characteristics of DNA sequences to provide appropriate features to serve as input of machine learning classifiers. Markovian models are the most influential encoding methods that highly used for
pattern recognition in biological data. However, a direct performance comparison of these methods in splice site domain has not been assessed yet. This study compares various Markovian encoding models for splice site prediction utilizing support vector machine, as the most outstanding learning method in the domain, and conducts a new precise evaluation of Markovian approaches that corrects this limitation. Moreover, a novel sequence encoding approach based on third order Markov model (MM3) is proposed. The experimental results show that the proposed method, namely MM3-SVM, performs significantly better than thirteen best known state-of-the-art algorithms, while tested on HS3D dataset considering several performance criteria. Further, it achieved higher prediction accuracy than several well-known tools like NNsplice, MEM, MM1, WMM, and GeneID, using an independent test set of 50 genes. We also developed MMSVM, a web tool to predict splice sites in any human sequence using the proposed
approach.