Implementation of Data Preprocessing Techniques on Big Data Platforms

ÇELİK O., HASANBAŞOĞLU M., AKTAŞ M. S., KALIPSIZ O., KANLI A., TURGUT U.

International Conference on Computer Science and Engineering (UBMK 2019), 11 - 15 Eylül 2019

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/ubmk.2019.8907230
Anahtar Kelimeler: Big Data, Distributed Computing, Outlier Analysis, Missing Value Imputation
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

We are now in the era of Big Data, and the need for tools which can process and analyze such data is yet to be fulfilled. Big data mining aims to extract meaningful and valuable information from voluminous data that traditional data mining tools can not handle. One of the most vital steps of any data mining process is the preprocessing of the data. Our aim was to provide distributed implementation of some algorithms for two of the data preprocessing steps: outlier analysis and missing value imputation. The algorithms were implemented on Spark and this paper will focus on the details and performance of these algorithms on different distributed system setups.