Türkçe Duruş Tespit Analizi

Polat, Kaan; Güler Bayazıt, Nilgün; Yıldız, Olcay

doi:10.31590/ejosat.851584

Türkçe Duruş Tespit Analizi

Polat K. K., Güler Bayazıt N., Yıldız O. T.

Avrupa Bilim ve Teknoloji Dergisi, sa.23, ss.99-107, 2021 (TRDizin)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2021
Doi Numarası: 10.31590/ejosat.851584
Dergi Adı: Avrupa Bilim ve Teknoloji Dergisi
Derginin Tarandığı İndeksler: TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.99-107
Yıldız Teknik Üniversitesi Adresli: Evet

İnternet kullanımının yaygınlaşmasıyla birlikte insanlar düşüncelerini, o anki duygu durumlarını sosyal medya araçları ve çevrimiçi forumlar üzerinden paylaşmaya başladılar. Bu durum metin verilerinin miktarında büyük bir artışa neden oldu. Başta Twitter platformundan elde edilen veriler olmak üzere sosyal medya kaynaklı veriler duygu analizi, metin sınıflandırma, konu modelleme, ironi tespiti, görüş madenciliği gibi pek çok çalışmada kullanılmaktadır. Bu çalışmalardan biri de duruş tespitidir. Duruş tespiti, bir hedef-yorum çifti için yorum yazarının hedefe yönelik duruşunun yorum metninden otomatik olarak çıkarılması işlemidir. Burada hedef bir insan, olay, durum veya bir ürün olabilir. Duruş tespitinde amaç bir yorumun sahibinin belirli bir hedefe yönelik duruşunun “Destekliyor” / “Desteklemiyor” / “Duruş Yok” olarak sınıflandırılmasıdır. Türkçe dilinde duruş tespiti çalışmalarında kullanılmak üzere hazırlanmış kapsamlı bir veri kümesi bildiğimiz kadarıyla bulunmamaktadır. Çalışmada ilk olarak bir çevrimiçi forumdan veri kazıma yöntemi ile 6 hedef için toplanmış yorumlardan oluşan bir Türkçe Duruş Veri Seti oluşturulmuştur. Veri seti toplam 5031 hedef-yorum çiftinden oluşmaktadır. Her bir hedef-yorum çifti üniversite dil bölümü mezunu kişilerce etiketlenmiştir. Veri seti üzerinde Naive Bayes, Destek Vektör Makinesi, AdaBoost, XGBoost, Rastgele Orman ve Evrişimli Sinir Ağı yöntemleri ile duruş tespit analizi yapılarak sonuçlar paylaşılmıştır. Metin temsili olarak sözcük torbası, terim frekansı – ters doküman frekansı ve kelime gömme yöntemleri kullanılmıştır. Performans değerlendirmesinde Matthews Korelasyon Katsayısı kullanılmıştır. Yapılan deneylerde en iyi sonuçların XGBoost ve Evrişimli Sinir Ağı yöntemleri ile elde edildiği gözlemlenmiştir. Oluşturulan Evrişimli Sinir Ağı modelinden çıkartılan özniteliklere entegre grandyanlar yöntemi uygulanarak girdi verilerindeki özniteliklerin model tahminine katkıları incelenmiş; yazılan bir yorumdaki her kelimenin modelin tahminine katkısı görselleştirilerek örneklerle sunulmuştur.

With the widespread use of internet, people began to share their thoughts and their current moods through social media platforms and online forums. This led to a larger increase in the amount of generated text data. Data from social media, especially data obtained from Twitter, are used in many studies such as sentiment analysis, text classification, topic modelling, irony detection, opinion mining. One of these is stance detection. Stance detection is the process of automatically extracting the stance of a person commenting on a text from a target-comment pair. Here the target can be a person, event, case or a product. In stance detection, the purpose is to classify the stance of the commenting person as “Favor” / “Against” / “Neither”. As far as we know, there is no comprehensive dataset ready for use in stance detection studies in the Turkish language. The first contribution of the current work is the creation of a Turkish Stance Dataset consisting of comments collected for 6 targets by web scraping from an online forum. The dataset consists of a total of 5031 target-comment pairs. Each target-comment pair has been tagged by Language Department graduates. The Bag of Words, Term Frequency – Inverse Document Frequency and Word embedding methods have been used for text representation. The analysis of the results for stance detection based on Naive Bayes, Support Vector Machine, AdaBoost, XGBoost, Random Forest and Convolution Neural Networks methods are presented. Matthews Correlation Coefficient has been used for performance assessment. It has been observed that the best results have been obtained with the XGBoost and Convolutional Neural Network methods. By applying the integrated gradients method to the features extracted by the Convolutional Neural Network model, the contribution of the features input to this method to the prediction performance has been analyzed and the contribution of each word in a comment to the prediction performance has been presented by visual examples.