Derin Öğrenme Tabanlı Oltalama Saldırılarının URL Analizi ile Tespiti


Tezin Türü: Doktora

Tezin Yürütüldüğü Kurum: Yıldız Teknik Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği, Türkiye

Tezin Onay Tarihi: 2023

Tezin Dili: Türkçe

Öğrenci: Mehmet Korkmaz

Asıl Danışman (Eş Danışmanlı Tezler İçin): Banu Diri

Eş Danışman: Özgür Koray Şahingöz

Özet:

The use of technological devices and the internet makes our daily lives much easier. The cyber world has expanded with this ease, reaching large masses with widespread internet use. Attackers targeting this large audience manipulate them with a technique called phishing attack, with a content that is important and urgent, and direct them to make the wrong decision as the weakest link. In this way, it is aimed to steal some sensitive information such as user identities, passwords, bank account and credit card numbers. Informing users against these attacks and developing artificial intelligence-supported phishing detection systems are effective solutions to protect users. Today, effective defense systems are being created with studies in the field of machine learning and deep learning. The working logic of these systems is based on detecting phishing websites by analyzing their URL addresses, content, or both. While these systems generally focus on URL analysis, there are a limited number of content analysis systems and a limited number of systems that combine URL and content analysis. In systems where URL and content analysis are done together, it works on feature concatenation. In URL analysis-based approaches, a dataset consisting of 14,782,355 data, including 11,720,749 legitimate and 3,061,606 phishing URLs, was created in this thesis for better learning of the model. There are two sub-datasets within this dataset. One of them is the high-risk URL dataset with 113,189 legitimate and 113,189 phishing URLs. The other is the high-risk content dataset, which contains 45,631 legitimate and 36,123 phishing URLs and their content. Thus, we have contributed to the literature by first creating a dataset of this size and characteristics.

A series of experiments were conducted on these datasets using deep learning models and five-fold cross-validation within the scope of URL, content, URL and content analysis based approach. In these experiments, in the GDNN-CNN model, which is a combination of GDNN which we named and CNN models, automatic and 73 manual features were combined within the scope of URL analysis and an accuracy rate of 99.05% was achieved. Within the scope of the content analysis-based approach, the DNN model trained with 57 manually generated features in the high-risk content dataset, which is one of the sub-datasets, achieved an accuracy rate of 94.07%. One of the main objectives of this thesis is to develop a hybrid system that analyzes both URLs and content. This approach, which we call a two-stage hybrid phishing detection system, is designed as follows: The suspected phishing attack is first evaluated according to the URL analysis approach and then, if the website is active, according to the content analysis approach. The decision maker mechanism developed in this system works as follows: If the result is predicted as phishing according to URL analysis, the result is phishing. If the result is predicted as legitimate according to the URL analysis, the content is searched, and if it is not found, the result of the URL analysis again determines the result. If content is found, the result is determined by looking at the ratio of URL and content analysis prediction result. As a result of a series of experiments, a system is proposed in which it is more appropriate to use the GDNN-CNN model in the first stage and the DNN model in the second stage. In the experiments conducted on the high-risk content dataset from the sub-datasets, the accuracy rates obtained according to the proposed system were improved by 2.34% according to URL analysis and 4.33% according to content analysis. In the experimental studies conducted on the dataset we created within the scope of big data, the 99.05% accuracy rate obtained within the scope of URL analysis was improved by 0.01% with the hybrid approach and reached 99.06%. In addition, the false negative rate and error rate were improved by 0.01%. When this seemingly low improvement is examined in detail, it is seen that the two-stage hybrid phishing attack detection system is 70.23% better at detecting URLs that are not detected in URL analysis but are analyzed for content, if any. As a result, the proposed approach will work as an effective phishing attack detection system as it will encounter more phishing website content when run in real life.