Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data

Helen Milner; Michael Baron

doi:10.47852/bonviewJDSIS32021131

Authors

Helen Milner University of Adelaide, Australia https://orcid.org/0009-0003-9029-5921
Michael Baron CME Department, Charles Sturt University, Australia https://orcid.org/0000-0002-5211-0274

DOI:

https://doi.org/10.47852/bonviewJDSIS32021131

Keywords:

dependency parsing, phishing, topological transformer processing, transfer learning

Abstract

This research establishes an optimal classification model for online SMS spam detection by utilizing topological sentence transformer methodologies. The study is a response to the increasing sophisticated and disruptive activities of malicious actors.We present a viable lightweight integration of pre-trained NLP repository models with sklearn functionality. The study design mirrors the spaCy pipeline component architecture in a downstream sklearn pipeline implementation and introduces a user-extensible spam SMS solution. We leverage large-text data models from HuggingFace (RoBERTa-base) via spaCy and apply linguistic NLP transformer methods to short-sentence NLP datasets. We compare the F1-scores of models and iteratively retest models using a standard sklearn pipeline architecture. Applying spaCy transformer modelling achieves an optimal F1-score of 0.938, a result comparable to existing research output from contemporary BERT/SBERT/‘black box’ predictive models. This research introduces a lightweight, user-interpretable, standardized, predictive SMS spam detection model that utilizes semantically similar paraphrase/sentence transformer methodologies and generates optimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitter evaluation set, indicating potential real-world suitability.

Received: 27 May 2023 | Revised: 27 June 2023 | Accepted: 12 July 2023

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

Data Availability Statement

The data that support the findings of this study are openly available in [Kaggle] at https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset and https://www.kaggle.com/competitions/utkmls-twitter-spam-detection-competition/data.

Establishing an Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

Journal Information

Make a Submission

Announcements

JDSIS Has Been Officially Indexed in EBSCO

STM Membership Announcement

Keywords