An experimental and comparative benchmark study examining resource utilization in managed Hadoop context


Creative Commons License

Ozdil U. E. , Ayvaz S.

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2022 (Peer-Reviewed Journal) identifier

  • Publication Type: Article / Article
  • Publication Date: 2022
  • Doi Number: 10.1007/s10586-022-03728-7
  • Journal Name: CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS
  • Journal Indexes: Science Citation Index Expanded, Scopus, Academic Search Premier, PASCAL, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC
  • Keywords: Big Data, Managed Hadoop, Hadoop-on-PaaS, HiBench, Performance evaluation

Abstract

Transitioning cloud-based Hadoop frameworks from IaaS to PaaS, which are commercially conceptualized as pay-as-you-go or pay-per-use, often reduces the associated system costs. However, the managed Hadoop systems obscure the inner performance dynamics of the platform and present a black-box behavior to the end-users. The aim of this study was to investigate the resource utilization of current managed Hadoop platforms. Thus, we explored three prominent Hadoop-on-PaaS proposals as they come out-of-the-box and conducted Hadoop-specific workloads using the HiBench Benchmark Suite. During the benchmark executions, the system resource utilization data from the worker nodes were collected and analyzed. The results indicated that the same property specifications among cloud services neither do guarantee similar performance outputs, nor produce consistent results based on different workloads within themselves. We anticipate that the managed systems' architectures and pre-configurations play a crucial role in the performance outcomes.