Adaptive Gaussian-Based Kernel K-Means: Scalable Adaptive Kernel-Based Clustering for Big Data

Bakhshali Bakhtiyarov; Aynur Jabiyeva; Anakhanim Mutallimova; Rukhsara Novruzova; Mahabbat Khudaverdiyeva

doi:10.47852/bonviewJCCE52026511

Authors

Bakhshali Bakhtiyarov Department of Instrumentation Engineering, Azerbaijan State Oil and Industry University, Azerbaijan https://orcid.org/0009-0006-2172-4632
Aynur Jabiyeva Department of Instrumentation Engineering, Azerbaijan State Oil and Industry University, Azerbaijan https://orcid.org/0000-0002-0336-8586
Anakhanim Mutallimova Department of Instrumentation Engineering, Azerbaijan State Oil and Industry University, Azerbaijan https://orcid.org/0000-0001-8327-192X
Rukhsara Novruzova Department of Instrumentation Engineering, Azerbaijan State Oil and Industry University, Azerbaijan https://orcid.org/0000-0002-0959-6520
Mahabbat Khudaverdiyeva Department of Instrumentation Engineering, Azerbaijan State Oil and Industry University, Azerbaijan https://orcid.org/0000-0002-5090-4628

DOI:

https://doi.org/10.47852/bonviewJCCE52026511

Keywords:

big data clustering, Apache Spark, K-Means, DBSCAN, Fuzzy C-Means, GBK-Means, industrial sensor data

Abstract

In the field of big data, clustering remains a fundamental component of data mining and knowledge discovery, particularly in high-volume, heterogeneous environments. Although traditional clustering algorithms have been widely used, they often exhibit poor scalability, limited robustness, and sensitivity to noise when applied to large industrial-scale datasets. To address these limitations, this paper proposes a Spark-based clustering framework that incorporates an Adaptive Gaussian-Based Kernel K-Means (A-GBK-Means) algorithm, designed to mitigate performance degradation in the presence of noisy and nonhomogeneous data. The A-GBK-Means algorithm integrates fuzzy logic with adaptively determined kernel widths, using local density estimation, thereby enhancing both clustering accuracy and flexibility. Extensive experiments were conducted on real-world industrial sensor data as well as synthetic benchmarks, and performance was evaluated using multiple metrics, including execution time, Silhouette Score, Davies–Bouldin Index, and noise resilience. The comparative analysis demonstrates that the A-GBK-Means algorithm consistently outperforms classical approaches—namely K-Means, Density-Based Spatial Clustering of Applications with Noise, Fuzzy C-Means, and standard GBK-Means—in terms of clustering quality and computational efficiency. Furthermore, the proposed framework exhibits superior scalability, owing to its distributed architecture built on Apache Spark. This study provides practical insights into scalable and interpretable clustering methods for intelligent manufacturing and sensor-based systems and highlights the effectiveness of adaptive kernel learning in modern big data applications.

Received: 18 June 2025 | Revised: 8 August 2025 | Accepted: 16 September 2025

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

Data Availability Statement

The data that support the findings of this study are openly available in Kaggle at https://www.kaggle.com/datasets/dnkumars/industrial-equipment-monitoring-dataset and https://www.kaggle.com/datasets/mohammedarfathr/sensor-data-from-an-industrial-machine.

Author Contribution Statement

Bakhshali Bakhtiyarov: Investigation, Data curation, Writing – original draft, Project administration. Aynur Jabiyeva: Conceptualization, Validation, Formal analysis. Anakhanim Mutallimova: Software, Resources. Rukhsara Novruzova: Methodology, Writing – review & editing, Supervision. Mahabbat Khudaverdiyeva: Visualization.

Adaptive Gaussian-Based Kernel K-Means: Scalable Adaptive Kernel-Based Clustering for Big Data

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Journal Information

CImago Journal

Make a Submission

Keywords

Announcements