2022 7th International Conference on Computer Science and Engineering (UBMK), Diyarbakır, Türkiye, 14 - 16 Eylül 2022, ss.1-6
The massive amount of data generated by social media possess a great deal of toxic content that lead to serious content filtering problems including hate speech, cyberbullying and insulting. Offensive content even without profanity may result in psychological and physical harms to, particularly children and sensitive people. As of 2022, Turkey houses 7th largest Twitter community among all countries in terms of the active user size exceeding 16 million users, which represents a high diversity of people considering its population. That said, there is a growing need for a comprehensive and high-quality dataset in Turkish that can be utilized in development of NLP models for robust detection of offensive language usage in social media. Related studies in literature have mostly focused on small, synthetic and label-imbalanced datasets. Machine learning models trained on such datasets tend to favor majority class for accuracy and possess generalizability issues. However, it is challenging to create an unbiased dataset containing hate speech without offensive words, and build an accurate detection model to identify the actual hate speech Tweets. The models may lack sufficient context due to the absence of swear words. Therefore, we propose a data augmentation approach based on data mining methods utilizing the linguistic features of Turkish that can help enhance the generalizability of machine learning models without further human interaction. Furthermore, we evaluated the impact of our comprehensive dataset in detection of offensive language in social media. The NLP models training using the augmented dataset improved the macro average detection accuracy by 7.60% in comparison to the baseline approach.