International Conference on Computer Science and Engineering (UBMK 2019), 11 - 15 Eylül 2019
We are now in the era of Big Data, and the need for tools which can process and analyze such data is yet to be fulfilled. Big data mining aims to extract meaningful and valuable information from voluminous data that traditional data mining tools can not handle. One of the most vital steps of any data mining process is the preprocessing of the data. Our aim was to provide distributed implementation of some algorithms for two of the data preprocessing steps: outlier analysis and missing value imputation. The algorithms were implemented on Spark and this paper will focus on the details and performance of these algorithms on different distributed system setups.