Hate Speech Dataset from Turkish Tweets


Creative Commons License

Mayda İ., Demir Y. E., Dalyan T., Diri B.

2021 Innovations in Intelligent Systems and Applications Conference (ASYU), Elazığ, Türkiye, 6 - 08 Ekim 2021, ss.1-4

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/asyu52992.2021.9599042
  • Basıldığı Şehir: Elazığ
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.1-4
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Today, while the content produced by users on online platforms increases rapidly due to the spread of the internet, hate speech expressions on these platforms also increase similarly. Social media platforms with millions ofusers are especially among the areas where hate speech expressions are shared frequently. Popular social media companies form their own policies within the scope of combating hate speech. However, the size of the data on the internet makes it almost impossible to do this manually. Consequently, especially in recent years, many studies have been conducted on the automatic detection of hate speech. While most of the studies in the literature are on English, there are published studies on hate speech detection in many languages such as German, French, Arabic, Indonesian, Portuguese. One of the main reasons for fewer studies in languages other than English is the smaller number and size of publicly shared hate speech datasets in those languages. There is a similar situation for Turkish. Therefore, within the scope of the study, a hate speech dataset comprising 10,224 Turkish tweets was generated and shared publicly. Tweets were labeled as hate, offensive, and none, and tweets tagged as hate were assigned subclass labels such as ethnic, religious, sexist, and political, which express the type of hate. In the first step of the labeling process, two annotators labeled all tweets separately. In the comparison made after this process, it was seen that the agreement rate in the given labels was 92.5%. Afterwards, the two annotators discussed the tweets they gave different labels by exchanging ideas and increased the agreement rate to 98.4%. For the remaining tweets, the opinion of the third evaluator was sought. After the labeling process, it was seen that the rate of hate speech in the data set was 22.8%. This publicly available data set, which is a first for Turkish in terms of its scope and size, is expected to be an important resource for automatic hate speech detection studies in Turkish.