An Extreme Gradient Boosting Feature Selection–Based GAN-ELM for Classification of Imbalanced Big Data

Rithani Mohan; Prasanna Kumar Rangarajan

doi:10.47852/bonviewJCCE52025973

Authors

Rithani Mohan Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, India https://orcid.org/0009-0008-8096-2790
Prasanna Kumar Rangarajan Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham, India https://orcid.org/0000-0001-6103-259X

DOI:

https://doi.org/10.47852/bonviewJCCE52025973

Keywords:

Big Data, oversampling, feature selection, classification, vanishing gradients

Abstract

The high-volume, velocity, and variety nature of Big Data introduces tremendous difficulties in accurate classification, especially when dealing with class imbalance. Traditional computational methods often fail to handle the imbalanced nature of datasets that may result in predictions biasing toward majority classes and poor model accuracies. This work presents a novel classification framework that combines state-of-the-art methodologies to provide a solution for the classic problem of class imbalance in Big Data environments. The proposed method is based on Generative Adversarial Network with Extreme Learning Machine for classification, Extreme Gradient Boosting with Bayesian Hyperparameter Optimization for feature selection, the Coati Optimization Algorithm for gradient optimization, and Fuzzy Adaptive SMOTE for oversampling. In addition to this, we put a Physics-Informed Policy Gradient to achieve interpretability of the model and classification decisions with respect to the domain classification rules. This framework provides better performance in terms of accuracy, robustness and scalability than other approaches for various types of imbalanced medical imaging datasets, such as histopathological images. The collaborative use of these state-of-the-art algorithms, taking into consideration common challenges like noisy data, redundant samples, and overfitting, will result in improved classification and provides a feasible solution for Big Data problems.

Received: 21 April 2025 | Revised: 30 June 2025 | Accepted: 5 August 2025

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

Data Availability Statement

The Breast Cancer Histopathology image data that support the findings of this study are openly available at https://www.kaggle.com/code/paultimothymooney/predict-idc-in-breast-cancer-histology-images/notebook. The CelebA data that support the findings of this study are openly available at https://www.kaggle.com/datasets/jessicali9530/celeba-dataset. The ImageNet-LT data that support the findings of this study are openly available at https://www.tensorflow.org/datasets/catalog/imagenet_lt.

Author Contribution Statement

Rithani Mohan: Conceptualization, Software, Validation, Resources, Data curation, Writing – original draft. Prasanna Kumar Rangarajan: Methodology, Formal analysis, Investigation, Writing – review & editing, Visualization, Supervision, Project administration.