Combat Mobile Evasive Malware via Skip-Gram-Based Malware Detection

Creative Commons License

EĞİTMEN A., Bulut I., Aygun R. C., Gunduz A. B., Seyrekbasan O., YAVUZ A. G.

SECURITY AND COMMUNICATION NETWORKS, vol.2020, 2020 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 2020
  • Publication Date: 2020
  • Doi Number: 10.1155/2020/6726147
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Applied Science & Technology Source, Communication Abstracts, Compendex, INSPEC, Metadex, Directory of Open Access Journals, Civil Engineering Abstracts
  • Yıldız Technical University Affiliated: Yes


Android malware detection is an important research topic in the security area. There are a variety of existing malware detection models based on static and dynamic malware analysis. However, most of these models are not very successful when it comes to evasive malware detection. In this study, we aimed to create a malware detection model based on a natural language model called skip-gram to detect evasive malware with the highest accuracy rate possible. In order to train and test our proposed model, we used an up-to-date malware dataset called Argus Android Malware Dataset (AMD) since the AMD contains various evasive malware families and detailed information about them. Meanwhile, for the benign samples, we used Comodo Android Benign Dataset. Our proposed model starts with extracting skip-gram-based features from instruction sequences of Android applications. Then it applies several machine learning algorithms to classify samples as benign or malware. We tested our proposed model with two different scenarios. In the first scenario, the random forest-based classifier performed with 95.64% detection accuracy on the entire dataset and 95% detection accuracy against evasive only samples. In the second scenario, we created a test dataset that contained zero-day malware samples only. For the training set, we did not use any sample that belongs to the malware families in the test set. The random forest-based model performed with 37.36% accuracy rate against zero-day malware. In addition, we compared our proposed model's malware detection performance against several commercial antimalware applications using VirusTotal API. Our model outperformed 7 out of 10 antimalware applications and tied with one of them on the same test scenario.