On the Use of Data Parallelism Technologies for Implementing Statistical Analysis Functions


Oskooei A. R.

14th International Workshop on Computer Science and Engineering, WCSE 2024, Phuket Island, Thailand, 19 - 21 June 2024, pp.94-102, (Full Text) identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.18178/wcse.2024.06.015
  • City: Phuket Island
  • Country: Thailand
  • Page Numbers: pp.94-102
  • Keywords: Apache Spark, Big Data Analytics, Data Parallelism, MapReduce, Parallel Processing, Statistical Functions
  • Yıldız Technical University Affiliated: No

Abstract

This study presents a comparative analysis of data parallelism technologies for implementing statistical analysis functions using the Apache Spark big data processing framework. As data volume and complexity continues to grow exponentially, selecting the right parallel processing framework is crucial for efficient big data analysis. Through a comprehensive methodology, we evaluate the performance and suitability of Spark's data parallelism capabilities for implementing descriptive, exploratory, and inferential statistical functions. By comparing Apache Spark with Hadoop MapReduce, the study highlights Spark's superior performance, especially in handling complex and iterative analytical tasks. The findings show significant performance gains with Spark, positioning it as the preferred framework for a variety of statistical analysis needs in the big data era. The findings of this research offer valuable insights for researchers and practitioners looking to optimize their data analysis workflows and leverage the full potential of big data technologies.