Tezin Türü: Doktora
Tezin Yürütüldüğü Kurum: Yıldız Teknik Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği, Türkiye
Tezin Onay Tarihi: 2023
Tezin Dili: Türkçe
Öğrenci: Mehmet Korkmaz
Asıl Danışman (Eş Danışmanlı Tezler İçin): Banu Diri
Eş Danışman: Özgür Koray Şahingöz
Özet:
The use of technological devices and the internet makes our daily lives
much easier. The cyber world has expanded with this ease, reaching large masses
with widespread internet use. Attackers targeting this large audience
manipulate them with a technique called phishing attack, with a content that is
important and urgent, and direct them to make the wrong decision as the weakest
link. In this way, it is aimed to steal some sensitive information such as user
identities, passwords, bank account and credit card numbers. Informing users
against these attacks and developing artificial intelligence-supported phishing
detection systems are effective solutions to protect users. Today, effective
defense systems are being created with studies in the field of machine learning
and deep learning. The working logic of these systems is based on detecting
phishing websites by analyzing their URL addresses, content, or both. While
these systems generally focus on URL analysis, there are a limited number of
content analysis systems and a limited number of systems that combine URL and
content analysis. In systems where URL and content analysis are done together,
it works on feature concatenation. In URL analysis-based approaches, a dataset
consisting of 14,782,355 data, including 11,720,749 legitimate and 3,061,606
phishing URLs, was created in this thesis for better learning of the model.
There are two sub-datasets within this dataset. One of them is the high-risk
URL dataset with 113,189 legitimate and 113,189 phishing URLs. The other is the
high-risk content dataset, which contains 45,631 legitimate and 36,123 phishing
URLs and their content. Thus, we have contributed to the literature by first
creating a dataset of this size and characteristics.
A series of experiments were conducted on these datasets using deep
learning models and five-fold cross-validation within the scope of URL,
content, URL and content analysis based approach. In these experiments, in the GDNN-CNN model, which is a
combination of GDNN which we named and CNN models, automatic and 73 manual
features were combined within the scope of URL analysis and an accuracy rate of
99.05% was achieved. Within the scope of the content analysis-based approach,
the DNN model trained with 57 manually generated features in the high-risk
content dataset, which is one of the sub-datasets, achieved an accuracy rate of
94.07%. One of the main objectives of this thesis is to develop a hybrid system
that analyzes both URLs and content. This approach, which we call a two-stage
hybrid phishing detection system, is designed as follows: The suspected
phishing attack is first evaluated according to the URL analysis approach and
then, if the website is active, according to the content analysis approach. The
decision maker mechanism developed in this system works as follows: If the
result is predicted as phishing according to URL analysis, the result is
phishing. If the result is predicted as legitimate according to the URL
analysis, the content is searched, and if it is not found, the result of the
URL analysis again determines the result. If content is found, the result is
determined by looking at the ratio of URL and content analysis prediction
result. As a result of a series of experiments, a system is proposed in which
it is more appropriate to use the GDNN-CNN model in the first stage and the DNN model
in the second stage. In the experiments conducted on the high-risk content
dataset from the sub-datasets, the accuracy rates obtained according to the
proposed system were improved by 2.34% according to URL analysis and 4.33%
according to content analysis. In the experimental studies conducted on the
dataset we created within the scope of big data, the 99.05% accuracy rate
obtained within the scope of URL analysis was improved by 0.01% with the hybrid
approach and reached 99.06%. In addition, the false negative rate and error
rate were improved by 0.01%. When this seemingly low improvement is examined in
detail, it is seen that the two-stage hybrid phishing attack detection system
is 70.23% better at detecting URLs that are not detected in URL analysis but
are analyzed for content, if any. As a result, the proposed approach will work
as an effective phishing attack detection system as it will encounter more
phishing website content when run in real life.