Imbalanced generative sampling of training data for improving quality of machine learning model

Coskun U. C., DOĞAN K. M., Günpınar E.

Advanced Engineering Informatics, vol.62, 2024 (SCI-Expanded) identifier

  • Publication Type: Article / Article
  • Volume: 62
  • Publication Date: 2024
  • Doi Number: 10.1016/j.aei.2024.102631
  • Journal Name: Advanced Engineering Informatics
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Aerospace Database, Applied Science & Technology Source, Communication Abstracts, Compendex, Computer & Applied Sciences, INSPEC, Metadex, Civil Engineering Abstracts
  • Keywords: Computational fluid dynamics, Computer-aided design, Design exploration, Imbalanced sampling, Machine learning, Training data
  • Yıldız Technical University Affiliated: Yes


Design exploration in engineering applications often requires a meticulous experimental or numerical study to evaluate performance (Y) of each design, which may require great effort, time or resources. Reducing the number of these tests for finding a good design is of paramount importance in all engineering fields. This study aims at computing a machine learning (ML) model using less number of designs as training data. Uniform sampling (US) in the design space (based on predefined design parameters) to obtain a training data is a promising approach. We further extend this sampling concept to obtain designs in the design space by also employing the ML model. The designs are selected via two non-uniform (imbalanced) sampling methods (namely, height-based sampling - HBS and gradient-based sampling - GBS) while considering their Y and gradient, dY, values. These values are divided into uniform intervals, and we aim at equalizing the number of designs in the training data at each interval as much as possible. This can force designs to have minimum or maximum Y or dY values, which, in fact, lie on small portion of the design space, in general. Therefore, capturing designs from all design space portions can be enabled. Results of the proposed methods are compared against US along with two well studied non-uniform sampling strategies, Stratified Over Sampling (SOS) and Gaussian-Process Based Sampling (GPBS). To reliably investigate quality of ML models obtained using designs sampled via US, SOS, GPBS, HBS and GBS, we utilize standard test (known) functions (such as Easom and Beale) as substitutes for engineering problems. According to the results presented, ML models using HBS and GBS have either better prediction accuracy or wider applicability compared to all other tested sampling methods.