Implementation of Data Preprocessing Techniques on Big Data Platforms


ÇELİK O., HASANBAŞOĞLU M., AKTAŞ M. S. , KALIPSIZ O. , KANLI A., TURGUT U.

International Conference on Computer Science and Engineering (UBMK 2019), 11 - 15 September 2019 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/ubmk.2019.8907230
  • Keywords: Big Data, Distributed Computing, Outlier Analysis, Missing Value Imputation

Abstract

We are now in the era of Big Data, and the need for tools which can process and analyze such data is yet to be fulfilled. Big data mining aims to extract meaningful and valuable information from voluminous data that traditional data mining tools can not handle. One of the most vital steps of any data mining process is the preprocessing of the data. Our aim was to provide distributed implementation of some algorithms for two of the data preprocessing steps: outlier analysis and missing value imputation. The algorithms were implemented on Spark and this paper will focus on the details and performance of these algorithms on different distributed system setups.