Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method


Luleci H. B., Yuka S. A., YILMAZ A.

Interdisciplinary Sciences - Computational Life Sciences, 2024 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2024
  • Doi Numarası: 10.1007/s12539-024-00659-2
  • Dergi Adı: Interdisciplinary Sciences - Computational Life Sciences
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Agricultural & Environmental Science Database, BIOSIS, Biotechnology Research Abstracts, EMBASE, MEDLINE
  • Anahtar Kelimeler: Alignment-free sequence comparison, Chaos game representation, Data compression, k-mer
  • Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Abstract: k-mer frequencies are crucial for understanding DNA sequence patterns and structure, with applications in motif discovery, genome classification, and short read assembly. However, the exponential increase in the dimension of frequency tables with increasing k-mer length poses storage challenges. In this study, we present a novel method for compressing k-mer data without information loss, aiming to optimize storage and analysis processes. We employed Chaos Game Representation (CGR) to map k-mers to coordinates and used these components to generate raster images of k-mers. The CGR maps were partitioned and labeled based on substrings, with each substring mapped to a subframe, creating a fractal-like structure. The entire k-mer frequency set of each genomic sequence was represented as a single image, with each pixel corresponding to a specific k-mer and its occurrence. This approach reduced file size by up to 16-fold compared to plain text and 3-fold compared to binary format. Furthermore, we demonstrated the feasibility of performing alignment-free similarity analyses on images derived from k-mer frequencies of whole genome sequences from 14 plant species. Our results highlight the potential of this method as a fast and efficient tool for accessing, processing, and analyzing large biological sequence datasets, enabling the retrieval of k-mer frequencies and image reconstruction. Graphical Abstract: (Figure presented.)