A Narrative Review of GAN-Based Synthetic Data Generation in Disease Prediction

Authors

  • Swathi Ganesan Department of Computer and Data Science, York St John University, UK
  • Nalinda Somasiri Department of Computer and Data Science, York St John University, UK

DOI:

https://doi.org/10.47852/bonviewJDSIS62028542

Keywords:

generative adversarial networks (GANs), synthetic data generation, electronic health records (EHR), AI in healthcare, disease prediction

Abstract

Synthetic data generation has emerged as an important approach in the healthcare field to address data scarcity, disease class imbalance, and privacy restrictions that limit access to patient data. Among generative approaches, generative adversarial networks (GANs) have gained increasing attention, especially because of their ability to generate realistic data across complex data distributions such as medical imaging, electronic health records (EHRs), laboratory data, and phenotype codes. This narrative review focuses on the evolution of major GAN architectures and their application in disease prediction. The original GAN introduced the adversarial paradigm, while Deep Convolutional GAN advanced image generation and became widely used in MRI, CT, and histopathology tasks. Wasserstein GAN variants (WGAN and WGAN-GP) improve training stability and prove to be more suitable for tabular and structured healthcare data such as EHRs. More specialized architectures, including Conditional Tabular GAN and Medical GAN, further extended synthetic data generation to mixed-type datasets and sparse diagnostic records. The review also examines evaluation practices based on data fidelity, downstream utility, and privacy preservation, including differential privacy and resistance to membership inference attacks. Overall, the literature shows that GAN-generated synthetic data can support disease prediction research, but important challenges remain in benchmarking, reproducibility, interpretability, and ethical deployment. Emerging directions include hybrid GAN-diffusion models, federated training strategies, and standardized evaluation frameworks to support clinically reliable and privacy-preserving adoption.

 

Received: 29 November 2025 | Revised: 10 April 2026 | Accepted: 13 May 2026

 

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

 

Data Availability Statement

No new primary datasets were generated or analyzed in this study. This article is a narrative literature review based on previously published studies.

 

The datasets discussed in the manuscript are available from the original publications and associated repositories cited below. The structured EHR datasets reviewed in this article are available at https://doi.org/10.1145/3636424, reference number [6], and https://doi.org/10.1016/j.compbiomed.2023.107188, reference number [7]. The multicenter diabetic EHR dataset is available at https://doi.org/10.1016/j.compbiomed.2023.107188, reference number [7]. Time-series and longitudinal healthcare datasets, including ECG-related datasets, are available at https://doi.org/10.1038/s41746-024-01409-w, reference number [8], and https://doi.org/10.1109/icoa62581.2024.10753742, reference number [9]. Medical imaging datasets used in GAN-based synthetic image generation, augmentation, and reconstruction studies are available at https://doi.org/10.1016/j.compbiomed.2025.109834, reference number [10]; https://doi.org/10.1148/radiol.232471, reference number [11]; https://doi.org/10.1016/j.compbiomed.2022.105382, reference number [20]; https://doi.org/10.1109/tiptekno63488.2024.10755233, reference number [26]; https://doi.org/10.1007/978-981-96-1185-0_38, reference number [29]; and https://doi.org/10.1016/j.compbiomed.2025.110094, reference number [37].

 

The Breast Cancer Wisconsin (Diagnostic), Lung Cancer Patient, and Fetal Cardiotocography (CTG) datasets are available at https://doi.org/10.1007/s41060-025-00816-w, reference number [14], and https://github.com/Halal-Abdulrahman-Ahmed/MedSynth_GANVariants. The Breast Cancer Wisconsin (Diagnostic) dataset is also available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic, and the Fetal Cardiotocography dataset is available from the UCI Machine Learning Repository at https://archive.ics.uci.edu/dataset/193/cardiotocography. The cardiovascular dataset was obtained from King Faisal Specialist Hospital and Research Centre and is subject to access restrictions as stated in the original article at https://doi.org/10.3390/s24237673, reference number [33].

 

The brain MRI dataset is available at https://brain-development.org/ixi-dataset/. Additional synthetic EHR and longitudinal health record datasets discussed in relation to MedGAN and mixed-type EHR generation are available at https://proceedings.mlr.press/v68/choi17a.html and https://doi.org/10.1038/s41746-023-00834-7, reference number [36]. Where datasets are not separately hosted in an open repository, access conditions and availability are governed by the original cited publications.

 

Author Contribution Statement

Swathi Ganesan: Conceptualization, Methodology, Formal analysis, Investigation, Resources, Data curation, Writing – original draft, Writing – review & editing, Visualization, Funding acquisition. Nalinda Somasiri: Visualization, Supervision, Project administration, Funding acquisition.

Downloads

Published

2026-06-11

Issue

Section

Review

How to Cite

Ganesan, S., & Somasiri, N. (2026). A Narrative Review of GAN-Based Synthetic Data Generation in Disease Prediction. Journal of Data Science and Intelligent Systems. https://doi.org/10.47852/bonviewJDSIS62028542

Most read articles by the same author(s)