Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data
Keywords:dependency parsing, phishing, topological transformer processing, transfer learning
This research establishes an optimal classification model for online SMS spam detection by utilizing topological sentence transformer methodologies. The study is a response to the increasing sophisticated and disruptive activities of malicious actors.We present a viable lightweight integration of pre-trained NLP repository models with sklearn functionality. The study design mirrors the spaCy pipeline component architecture in a downstream sklearn pipeline implementation and introduces a user-extensible spam SMS solution. We leverage large-text data models from HuggingFace (RoBERTa-base) via spaCy and apply linguistic NLP transformer methods to short-sentence NLP datasets. We compare the F1-scores of models and iteratively retest models using a standard sklearn pipeline architecture. Applying spaCy transformer modelling achieves an optimal F1-score of 0.938, a result comparable to existing research output from contemporary BERT/SBERT/‘black box’ predictive models. This research introduces a lightweight, user-interpretable, standardized, predictive SMS spam detection model that utilizes semantically similar paraphrase/sentence transformer methodologies and generates optimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitter evaluation set, indicating potential real-world suitability.
Received: 27 May 2023 | Revised: 27 June 2023 | Accepted: 12 July 2023
Conflicts of Interest:
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
The data that support the findings of this study are openly available in [Kaggle] at https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset and https://www.kaggle.com/competitions/utkmlstwitter-spam-detection-competition/overview
How to Cite
Copyright (c) 2023 Authors
This work is licensed under a Creative Commons Attribution 4.0 International License.