Embedded Sparsity and Ensemble Learning for Real-Time Textual Anomaly Detection: A SCADA-Inspired Benchmark on Lecture Transcript Data
DOI:
https://doi.org/10.47852/bonviewJCCE62027629Keywords:
feature selection, intrusion detection, SCADA systems, text classification, gradient boostingAbstract
This study presents a comprehensive benchmark of feature selection methods for real-time text classification, evaluating the computational trade-offs between embedded sparsity techniques (Lasso regularization), filter methods (ANOVA, mutual information), and wrapper approaches (recursive feature elimination). Using a dataset of 15,746 educational comments with TF-IDF representations, we systematically compare five feature selection methods paired with both classical machine learning and shallow neural network classifiers. Our experimental model evaluates both classification performance and computational latency in the context of binary and multiclass sentiment classification. Results show that the Lasso-based feature selection, combined with XGBoost, achieves F1-scores of 0.859 for binary classification and 0.699 for multiclass classification, with inference times of less than 1 s. Recursive feature elimination takes 284 s to do similar performance. Shallow neural networks achieve higher accuracy (F1 = 0.910 for binary, 0.841 for multiclass) at the cost of 8-s training times. In contrast, convolutional neural networks (CNNs) applied directly to TF-IDF vectors perform poorly (F1 = 0.646) with excessive training overhead (126 s), indicating that standard CNN architectures are unsuitable for sparse text representations. Our findings offer practical guidance for selecting feature reduction and classification methods based on latency requirements: Lasso with ensemble methods for real-time applications that require sub-second response and shallow neural networks for batch processing, where higher accuracy justifies the additional computational cost. Although this work uses educational text data as a testbed, the methodology and findings are applicable to any high-dimensional text classification scenario requiring efficient feature selection.Received: 10 September 2025 | Revised: 3 December 2025 | Accepted: 6 January 2026
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
The data that support the findings of this study are openly available in GitHub at https://github.com/rosewang2008/sight.
Author Contribution Statement
Ayoub Alsarhan: Conceptualization, Resources, Writing – original draft, Visualization. Saja Jaradat: Conceptualization, Resources, Writing – original draft, Visualization. Suhaila Abuowaida: Validation, Writing – review & editing, Funding acquisition. Sami Aziz Alshammari: Methodology, Investigation, Data curation, Supervision. Nayef Hmoud Alshammari: Software, Project administration. Khalid Hamad Alnafisah: Software, Formal analysis, Project administration.
Downloads
Published
2026-02-24
Issue
Section
Research Articles
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
Alsarhan, A., Jaradat, S., Abuowaida, S., Alshammari, S. A., Alshammari, N. H., & Alnafisah, K. H. (2026). Embedded Sparsity and Ensemble Learning for Real-Time Textual Anomaly Detection: A SCADA-Inspired Benchmark on Lecture Transcript Data. Journal of Computational and Cognitive Engineering. https://doi.org/10.47852/bonviewJCCE62027629
Funding data
-
Northern Borders University
Grant numbers NBU-FFR-2026-2119-01