An Extreme Gradient Boosting Feature Selection–Based GAN-ELM for Classification of Imbalanced Big Data
DOI:
https://doi.org/10.47852/bonviewJCCE52025973Keywords:
Big Data, oversampling, feature selection, classification, vanishing gradientsAbstract
The high-volume, velocity, and variety nature of Big Data introduces tremendous difficulties in accurate classification, especially when dealing with class imbalance. Traditional computational methods often fail to handle the imbalanced nature of datasets that may result in predictions biasing toward majority classes and poor model accuracies. This work presents a novel classification framework that combines state-of-the-art methodologies to provide a solution for the classic problem of class imbalance in Big Data environments. The proposed method is based on Generative Adversarial Network with Extreme Learning Machine for classification, Extreme Gradient Boosting with Bayesian Hyperparameter Optimization for feature selection, the Coati Optimization Algorithm for gradient optimization, and Fuzzy Adaptive SMOTE for oversampling. In addition to this, we put a Physics-Informed Policy Gradient to achieve interpretability of the model and classification decisions with respect to the domain classification rules. This framework provides better performance in terms of accuracy, robustness and scalability than other approaches for various types of imbalanced medical imaging datasets, such as histopathological images. The collaborative use of these state-of-the-art algorithms, taking into consideration common challenges like noisy data, redundant samples, and overfitting, will result in improved classification and provides a feasible solution for Big Data problems.
Received: 21 April 2025 | Revised: 30 June 2025 | Accepted: 5 August 2025
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
The Breast Cancer Histopathology image data that support the findings of this study are openly available at https://www.kaggle.com/code/paultimothymooney/predict-idc-in-breast-cancer-histology-images/notebook. The CelebA data that support the findings of this study are openly available at https://www.kaggle.com/datasets/jessicali9530/celeba-dataset. The ImageNet-LT data that support the findings of this study are openly available at https://www.tensorflow.org/datasets/catalog/imagenet_lt.
Author Contribution Statement
Rithani Mohan: Conceptualization, Software, Validation, Resources, Data curation, Writing – original draft. Prasanna Kumar Rangarajan: Methodology, Formal analysis, Investigation, Writing – review & editing, Visualization, Supervision, Project administration.
Metrics
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.