44. Yöneylem Araştırması ve Endüstri Mühendisliği (YA/EM 2025) Ulusal Kongresi, Ankara, Türkiye, 25 - 27 Haziran 2025, ss.272, (Özet Bildiri)
Bio-medical data used in bioinformatics can be categorized into omics and non-omics. Non-omics data includes clinical information such as patient records, epidemiological data, and physician assessments. They are obtained from reports created using personal health information or physician evaluations derived from patient health reports. On the other hand, omics data consists of biological attributes that serve as potential biomarkers and help explain the connections between different molecules and organisms. Omics data, including genomics, transcriptomics, and proteomics, allows for gene expression analyses, functional, and structural genomics analyses. High-throughput sequencing is the primary omics data generation method, especially for genomic (DNA sequencing), transcriptomic (RNA sequencing), and partially epigenomic (ChIP-sequencing, bisulfite sequencing, etc.) studies. High-throughput sequencing has dramatically advanced the human genome study, especially in cancer research. Initially, high-throughput sequencing is completed on RNA extracted from a tissue sample of multiple cell types. This is called bulk sequencing. However, the developing sequencing technology is now at the point where individual cells can be sequenced enabling more granular insight, which is called single-cell sequencing or scRNA-seq. Meanwhile, significant advances in artificial intelligence, especially deep learning and foundation models such as BERT and GPT-3, have expanded into bioinformatics. These models, initially developed for natural language processing, are increasingly applied to large-scale biological data. Notably, scGPT represents a pioneering effort to leverage such models for analyzing scRNA-seq data, where they tried to show the applicability of foundation models to advance cellular biology and genetic research. scFoundation, scBERT, and xTRIMOGENE can be mentioned as other foundation models that use scRNA-seq data besides scGPT. On the other hand, a wealth of knowledge has been accumulated thanks to many studies conducted with bulk sequencing data. This knowledge can advance the research regarding the scRNA-seq data. However, no research connects these two research streamlines. Our study investigates the suitability of bulk sequencing data with the models created using scRNA-seq data. For this purpose, The Cancer Genome Atlas Program (TCGA) data consisting of 33 different cancer types and approximately eleven thousand cases (observation values of the model) was fed into the scGPT model, enabling 19,318 protein-coding genes (variables) to be embedded into a 512-dimensional space. Then, the random forest (RF) algorithm, one of the machine learning techniques, was run using the embedding results, and a 73.57% accuracy value was obtained. In the next stage, a pre-selection process was performed to select highly variable genes (HVG) exhibiting the most significant expression variability across cells. The accuracy performance of the RF algorithm run due to the embedding with the HVG process increased to 86.33%.