Efficient Storage and Analysis of Genomic Data: A k-mer Frequency Mapping and Image Representation Method


Luleci H. B., Yuka S., YILMAZ A.

Interdisciplinary Sciences - Computational Life Sciences, vol.17, no.3, pp.691-697, 2025 (SCI-Expanded, Scopus) identifier identifier identifier

  • Publication Type: Article / Article
  • Volume: 17 Issue: 3
  • Publication Date: 2025
  • Doi Number: 10.1007/s12539-024-00659-2
  • Journal Name: Interdisciplinary Sciences - Computational Life Sciences
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Agricultural & Environmental Science Database, BIOSIS, Biotechnology Research Abstracts, EMBASE, MEDLINE
  • Page Numbers: pp.691-697
  • Keywords: Alignment-free sequence comparison, Chaos game representation, Data compression, k-mer
  • Yıldız Technical University Affiliated: Yes

Abstract

Abstract: k-mer frequencies are crucial for understanding DNA sequence patterns and structure, with applications in motif discovery, genome classification, and short read assembly. However, the exponential increase in the dimension of frequency tables with increasing k-mer length poses storage challenges. In this study, we present a novel method for compressing k-mer data without information loss, aiming to optimize storage and analysis processes. We employed Chaos Game Representation (CGR) to map k-mers to coordinates and used these components to generate raster images of k-mers. The CGR maps were partitioned and labeled based on substrings, with each substring mapped to a subframe, creating a fractal-like structure. The entire k-mer frequency set of each genomic sequence was represented as a single image, with each pixel corresponding to a specific k-mer and its occurrence. This approach reduced file size by up to 16-fold compared to plain text and 3-fold compared to binary format. Furthermore, we demonstrated the feasibility of performing alignment-free similarity analyses on images derived from k-mer frequencies of whole genome sequences from 14 plant species. Our results highlight the potential of this method as a fast and efficient tool for accessing, processing, and analyzing large biological sequence datasets, enabling the retrieval of k-mer frequencies and image reconstruction. Graphical Abstract: (Figure presented.)