Investigating Tabular Generative Models for Synthetic Data Generation in PDAC Bulk Gene Expression Data

TURGUT ÖGME S. S., Kurt Z., Aydın N.

7th International Conference on Statistics: Theory and Applications, ICSTA 2025, Paris, Fransa, 17 - 19 Ağustos 2025, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.11159/icsta25.174
Basıldığı Şehir: Paris
Basıldığı Ülke: Fransa
Anahtar Kelimeler: ensemble, gene expression, generative models, pancreatic cancer
Yıldız Teknik Üniversitesi Adresli: Evet

Özet

Pancreatic Ductal Adenocarcinoma (PDAC) is among the deadliest cancer types, with early detection being critical to improving survival rates. However, developing effective detection models is challenging due to the need for high-quality, class-balanced datasets. Generative models have recently gained attention for addressing this issue. In this study, we compare three tabular data-based generative models: Conditional Tabular Generative Adversarial Networks (CTGAN), Tabular Variational Autoencoder (TVAE), and Gaussian Copula (GC) using PDAC gene expression data. We first constructed an integrated dataset by curating six PDAC studies and applied an ensemble-based feature selection approach combining Differential Expression (DEG) analysis, ANOVA, Lasso, and Mutual Information. The synthetic data were evaluated both statistically (using Correlation Discrepancy (CD), Kolmogorov-Smirnov(KS), and Statistical Similarity(SS) metrics) and biologically (via PDAC marker genes), as well as visually in 2D-PCA space. The GC model produced the most realistic synthetic data with 0.1482 CD, 0.8120 KS, and 0.9529 SS metric values, similar expression level with PDAC markers, and uniform distribution with real data. TVAE followed GC. Based on these findings, we proposed an ensemble model combining GC and TVAE-generated samples. Classification experiments using Random Forest (RF) and Support Vector Machine (SVM) demonstrated that, while the ensemble generative model did not achieve the highest performance (0.8541 precision, 0.8570 recall, 0.8533 F1-measure and 0.9236 AUC) for SVM but achieved (0.8549 precision, 0.8623 recall, 0.8568 F1-measure and 0.9246 AUC) for RF, so it is a promising model for future applications.