Embedded Sparsity and Ensemble Learning for Real-Time Textual Anomaly Detection: A SCADA-Inspired Benchmark on Lecture Transcript Data

Authors

  • Ayoub Alsarhan Department of Data Science and Artificial Intelligence, Al-Ahliyya Amman University, Jordan and Department of Information Technology, The Hashemite University, Jordan https://orcid.org/0000-0001-9075-2828
  • Saja Jaradat Department of Information Technology, The Hashemite University, Jordan
  • Suhaila Abuowaida Department of Data Science and Artificial Intelligence, Al al-Bayt University, Jordan
  • Sami Aziz Alshammari Department of Information Technology, Northern Border University, Saudi Arabia
  • Nayef Hmoud Alshammari Department of Computer Science, University of Tabuk, Saudi Arabia https://orcid.org/0000-0001-5739-0589
  • Khalid Hamad Alnafisah Department of Computer Sciences, Northern Border University, Saudi Arabia

DOI:

https://doi.org/10.47852/bonviewJCCE62027629

Keywords:

feature selection, intrusion detection, SCADA systems, text classification, gradient boosting

Abstract

This study presents a comprehensive benchmark of feature selection methods for real-time text classification, evaluating the computational trade-offs between embedded sparsity techniques (Lasso regularization), filter methods (ANOVA, mutual information), and wrapper approaches (recursive feature elimination). Using a dataset of 15,746 educational comments with TF-IDF representations, we systematically compare five feature selection methods paired with both classical machine learning and shallow neural network classifiers. Our experimental model evaluates both classification performance and computational latency in the context of binary and multiclass sentiment classification. Results show that the Lasso-based feature selection, combined with XGBoost, achieves F1-scores of 0.859 for binary classification and 0.699 for multiclass classification, with inference times of less than 1 s. Recursive feature elimination takes 284 s to do similar performance. Shallow neural networks achieve higher accuracy (F1 = 0.910 for binary, 0.841 for multiclass) at the cost of 8-s training times. In contrast, convolutional neural networks (CNNs) applied directly to TF-IDF vectors perform poorly (F1 = 0.646) with excessive training overhead (126 s), indicating that standard CNN architectures are unsuitable for sparse text representations. Our findings offer practical guidance for selecting feature reduction and classification methods based on latency requirements: Lasso with ensemble methods for real-time applications that require sub-second response and shallow neural networks for batch processing, where higher accuracy justifies the additional computational cost. Although this work uses educational text data as a testbed, the methodology and findings are applicable to any high-dimensional text classification scenario requiring efficient feature selection.



Received: 10 September 2025 | Revised: 3 December 2025 | Accepted: 6 January 2026



Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.



Data Availability Statement

The data that support the findings of this study are openly available in GitHub at https://github.com/rosewang2008/sight.



Author Contribution Statement

Ayoub Alsarhan: Conceptualization, Resources, Writing – original draft, Visualization. Saja Jaradat: Conceptualization, Resources, Writing – original draft, Visualization. Suhaila Abuowaida: Validation, Writing – review & editing, Funding acquisition. Sami Aziz Alshammari: Methodology, Investigation, Data curation, Supervision. Nayef Hmoud Alshammari: Software, Project administration. Khalid Hamad Alnafisah: Software, Formal analysis, Project administration.

Downloads

Published

2026-02-24

Issue

Section

Research Articles

How to Cite

Alsarhan, A., Jaradat, S., Abuowaida, S., Alshammari, S. A., Alshammari, N. H., & Alnafisah, K. H. (2026). Embedded Sparsity and Ensemble Learning for Real-Time Textual Anomaly Detection: A SCADA-Inspired Benchmark on Lecture Transcript Data. Journal of Computational and Cognitive Engineering. https://doi.org/10.47852/bonviewJCCE62027629

Funding data