On the big data processing algorithms for finding frequent sequences


Can A. B., Zaval M., Uzun-Per M., AKTAŞ M. S.

Concurrency and Computation: Practice and Experience, vol.35, no.24, 2023 (SCI-Expanded) identifier

  • Publication Type: Article / Abstract
  • Volume: 35 Issue: 24
  • Publication Date: 2023
  • Doi Number: 10.1002/cpe.7660
  • Journal Name: Concurrency and Computation: Practice and Experience
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Aerospace Database, Applied Science & Technology Source, Communication Abstracts, Compendex, Computer & Applied Sciences, INSPEC, Metadex, zbMATH, Civil Engineering Abstracts
  • Keywords: Apache Spark, big data, distributed systems, DLA, GSP, PrefixSpan, sequential pattern mining
  • Yıldız Technical University Affiliated: Yes

Abstract

Sequential pattern mining algorithms extract trendy sequence appearances inside ordered transactional datasets such as market basket datasets. There is a lack of research employing big data processing techniques to locate frequent sequences on large-scale datasets. Furthermore, there is a need for optimized sequential pattern mining algorithms that run on ordered one-dimensional sequences. We also observe a lack of sequential pattern search studies in the literature, where the focus is centered around multi-dimensional data sequences. Existing approaches that deal with ordered one-dimensional datasets suffer from scalability issues as the amount of data to be analyzed is enormous. This research investigates the big data processing techniques used to find frequent sequences in large-scale datasets. It also proposes a scalable sequence pattern mining algorithm called Sequential Pattern Acquisition by Reducing Search Space (SPARSS) designed for distributed data processing systems that efficiently handle large datasets containing sequential one-element data. It introduces a prototype implementation of SPARSS and provides information on the SPARSS's memory and time requirements, which were calculated as part of experimental studies on a real-world dataset. The results confirm our expectations and demonstrate SPARSS's superior scalability and run-time efficiency compared to other distributed algorithms.