Computer Science, vol.23, no.3, pp.377-396, 2022 (SCI-Expanded)
Data analysis becomes difficult with the increase of large amounts of data. More specifically, extracting meaningful insights from this vast amount of data and grouping them based on their shared features without human intervention requires advanced methodologies. There are topic modeling methods to overcome this problem in text analysis for downstream tasks, such as sentiment analysis, spam detection, and news classification. In this research, we benchmark several classifiers, namely Random Forest, AdaBoost, Naive Bayes, and Logistic Regression, using the classical LDA and n-stage LDA topic modeling methods for feature extraction in headlines classification. We run our experiments on 3 and 5 classes publicly available Turkish and English datasets. We demonstrate that n-stage LDA as a feature extractor obtains state-of-the-art performance for any downstream classifier. It should also be noted that Random Forest was the most successful algorithm for both datasets.