Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data

Authors

DOI:

https://doi.org/10.47852/bonviewJDSIS32021131

Keywords:

dependency parsing, phishing, topological transformer processing, transfer learning

Abstract

This research establishes an optimal classification model for online SMS spam detection by utilizing topological sentence transformer methodologies. The study is a response to the increasing sophisticated and disruptive activities of malicious actors.We present a viable lightweight integration of pre-trained NLP repository models with sklearn functionality. The study design mirrors the spaCy pipeline component architecture in a downstream sklearn pipeline implementation and introduces a user-extensible spam SMS solution. We leverage large-text data models from HuggingFace (RoBERTa-base) via spaCy and apply linguistic NLP transformer methods to short-sentence NLP datasets. We compare the F1-scores of models and iteratively retest models using a standard sklearn pipeline architecture. Applying spaCy transformer modelling achieves an optimal F1-score of 0.938, a result comparable to existing research output from contemporary BERT/SBERT/‘black box’ predictive models. This research introduces a lightweight, user-interpretable, standardized, predictive SMS spam detection model that utilizes semantically similar paraphrase/sentence transformer methodologies and generates optimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitter evaluation set, indicating potential real-world suitability.

 

Received: 27 May 2023 | Revised: 27 June 2023 | Accepted: 12 July 2023

 

Conflicts of Interest:

The authors declare that they have no conflicts of interest to this work.

 

Data Availability Statement

The data that support the findings of this study are openly available in [Kaggle] at https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset and https://www.kaggle.com/competitions/utkmlstwitter-spam-detection-competition/overview

Downloads

Published

2023-07-12

How to Cite

Milner, H., & Baron, M. (2023). Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data. Journal of Data Science and Intelligent Systems, 2(1). https://doi.org/10.47852/bonviewJDSIS32021131

Issue

Section

Research Articles