Establishing An Optimal Online Phishing Detection Method: Evaluating Topological NLP Transformers on Text Message Data

This research establishes an optimal classification model for online SMS spam detection by utilizing topological sentence transformer methodologies. The study is a response to the increasing sophisticated and disruptive activities of malicious actors. We present a viable lightweight integration of pre-trained NLP repository models with sklearn functionality. The study design mirrors the spaCy pipeline component architecture in a downstream sklearn pipeline implementation and introduces a user-extensible spam SMS solution. We leverage large-text data models from HuggingFace (roberta-base) via spaCy and apply linguistic NLP transformer methods to short-sentence NLP datasets. We compare the F1-scores of models and iteratively retest models using a standard sklearn pipeline architecture. Applying spaCy transformer modelling achieves an optimal F1-score of 0.938, a result comparable to existing research output from contemporary BERT/SBERT/‘black box’ predictive models. This research introduces a lightweight, user-interpretable, standardized, predictive SMS-spam detection model, that utilizes semantically similar paraphrase/ sentence transformer methodologies and generates optimal F1-scores for an SMS dataset. Significant F1-scores are also generated for a Twitter evaluation set, indicating potential real-world suitability.


Introduction
NLP linguistic machine learning frameworks have fundamentally altered text-based dataset modelling since 2016 (Brown et al., 2020).The research presented here extends foundational contemporary approaches to predictive linguistic modelling and demonstrates methodological relevance for resource constrained, shorttext spam detection tasks.
SMS spam detection remains a necessary task (Haynes et al., 2021) and threat vectors have become highly automated.Short-text message model predictions are typically serviced by older technological instruments, making them vulnerable.SMS spam data classification tasks using new transformer methods do not appear to be well-researched (Roy et al., 2020) or methodologically aligned to transformer architectures.This study presents exploratory research from The University of Adelaide examining the effect of both sophisticated topological transformer and large vector methods on short-text SMS spam classification.
The current iteration of transformers can imbed dense tensors into topological frameworks from sentence-based inputs, making these architectures a good fit for short-text data.High-rank embeddings and pre-trained libraries have proved crucial for modelling NLP tasks (Yang et al., 2017) e.g., RoBERTa (Liu et al., 2019).The selfsupervised, sentence-based processing of roberta-base is embedded as the statistical model for the spaCy transformer pipeline.Roberta-base generates encoded weights for use in downstream standard classifiers, making this model a suitable choice for a task-based transparent solution.SpaCy provides a tagger tokenizer, a (dependency) parser and an ner (named entity recognizer) to listen to the transformer component (output).The generated output is used by our study to classify spam text data within an sklearn pipeline.
Spam messaging content relevance is typically short-lived, and this research focuses on identifying an optimal state-ofthe-art design, based on typical classification modelling metrics.A major issue for all models is the choice of language sample sizes and the specific vocabulary included in datasets (Conneau et al., 2020).Haynes et al. (2021) highlight the need to avoid visiting dangerous sites and advocate using publicly available datasets in phishing detection research.This methodological constraint immediately limits the sample size of available data and presents a persistent problem for SMS researchers.The issue for spam detection systems is that evolving illicit SMS message generation techniques can result in redundant training datasets, largely unrepresentative of current trends.We find that this issue is not significant when using spaCy component-modelling pipelines.
This research applies NLP-based classification methods in a time constrained environment.The study responds to the short-lived nature of SMS messaging and develops a suitable solution for time-critical and resource constrained environments.Addressing constraints encourages an iterative, agile design implementation.

Our Approach
A basic NLTK Regex model is initially implemented and tested to replicate prior base level research.This model produces highly accurate predictions using pattern-matching techniques, however subsequent testing reveals overfitted results considered inadequate for benchmarking purposes.Pattern matching spam filters are redundant technological instruments in the contemporary literature.These methods are not considered future-facing or fit for purpose when designing spam identification solutions (Shirazi et al., 2023).
Our design provides a suitable template for developing enhanced, futurefacing models.The study iteratively fits a classic SMS dataset (Kaggle, 2017) to a predictive classification model, using opensource component pipeline architectures.Constant checking of the model outputs enables the development of two lightweight statistical NLP models, both leveraging pretrained neural network (NN) embeddings and completing in <5mins.The first comparison model uses a large language collection of unique vectors from the web; the second used a transformer pipeline insertion.Modelling is conducted using package defaults.This work compares wordvector similarity modelling with sentencebased transformer spam classification methods and finds that spaCy-transformer statistical modelling generates superior F1 scores.
The en_core_web_trf spaCy transformer model is chosen for transformer modelling as it seamlessly incorporates pretrained CommonCrawl data from the roberta-base model (Liu et al., 2019), hosted on the HuggingFace (2023) repository.The transformer pipeline architecture generates embedded weights using dependency parsers and entity detection technologies (Gormley et al., 2015).The pipeline architecture uses standard components (e.g.sentencizer vs senter) to generate outputs for downstream classification tasks.Downstream implementation uses sklearn pipeline functionality to create a custom predictor and fine-tunes the spaCy statistical models on a CPU.The pipeline incorporates a custom SMOTE oversampling method to balance the SMS dataset and prevent overfitting (Abid et al., 2022).
We assess the generated F1-score for each modelling cycle and an iterative implementation approach provides confidence that the ultimate result is reproduceable and optimal.State-of-the-art short-text binary transformer modelling identifies inferences and creates contextual topological embeddings to feed into a pipeline and generate predictions.Opensource semantic similarity detection techniques on sentences are implemented and we validate this work against a Twitter dataset, as per previous research (Liu et al., 2021).
This work achieves superior classification results (accuracy and F1scores) using spaCy-transformer pipeline modelling, compared to previous research that implements transformer architectures from scratch.Open-source, extensible architectures provided superior alternatives to new-build NN research and our lightweight CPU-based implementation achieved comparable accuracies to GPU processing (0.9845 with an SVC classifier) when pre-trained, default spaCy modelling pipelines were utilised.
A significant contribution of our work is the successful generation of a userextensible, highly accurate, topological transformer-based spam detection method for SMS data.The study demonstrates that state-of-the-art transformer solutions provide a clear direction for the future of SMS classifiers.

Literature Review
SMS spam detection has been relatively neglected within the security research domain (Roy et al., 2020) and prevalent spam detection studies have prioritised fraudulent email modelling (Chiew et al., 2019;Tan et al., 2020).
The SMS dataset has been employed to predict spam by utilizing complex, newbuild NN implementations (Liu et al., 2021;Roy et al., 2020).New NN builds are memory intensive (Haynes et al., 2021) and do not appropriately compartmentalize the necessary classification methods with userextensible components (e.g.pipelines).
Older research does not employ sentence-transformers or generally consider solution time complexities.Previous research using Hidden Markov Models (HMM) on the SMS dataset reported favourable time complexity processing overheads (Xia & Chen, 2020), however HMM models are word-centric and utilize methods intrinsically linked with forwardfed deep-learning, considering only one prior state.Contemporary post-2019 research has largely discarded word embedding modelling for sophisticated sentence/ paraphrasing methods.Naively employed vectorizers, for example GloVe and word2Vec word embedding methods fail bias assessment tests and cannot appropriately capture linguistic meaning embedded within paraphrases (May et al., 2019).The HuggingFace site explicitly highlights this weakness of RoBERTa models.HuggingFace states that bias is an inherent limitation of pre-trained models (HuggingFace, 2023), therefore identifying and actioning bias in models is a live problem.Implementing transformer-based solutions over word embedding methods is gaining acceptance as a solution to the problem of algorithmic bias (Islam et al. 2020).
Tensor topologies capture spatial/ relational information (Tumas et al., 2023) and low-rank topological vectorization has been used to enable task-specific approaches to classification (Brown et al., 2020;Shwartz-Ziv & Armon, 2021).Gormley et al. (2015) examined lexical feature embeddings via low-rank tensors and focused on training efficient dependency parsers for text.(Reimers & Gureych (2019) utilised dependency parsing techniques in their model SBERT (SentenceBERT) to investigate topological semantic similarities for sentence-pairs, utilizing similarly large training data.These authors examined sentence paraphrasing mining techniques to compare short blocks of text, an important option for classifying SMS spam (Reimers & Gureych, 2019).
Spam SMS identification in the wild is recognized as a non-trivial task (Shirazi et al., 2023) and back-propagation transformer modelling is required to succinctly capture embedded latent complexities (Xu et al., 2022).Transformer models use back propagation to predict output from a large pre-trained corpus and applying pre-trained transformer models to a targeted topology has achieved high accuracies on unlabeled data/unsupervised modelling (Jain, 2022).
De Kok and Hinrichs (2019) demonstrated the importance of topological field analysis for discriminating between German paraphrases.Interrogating a topological rendering of sentence sentiment is particularly applicable to SMS spam detection.SMS data cannot be adequately analysed if latent sentence relationships are not fully captured (Gormley et al., 2015).Components of NNs identified as important for short-textual modelling, for example sentencizers and dependency parsers, are implemented within spaCy (Hu et al., 2022).A RoBERTa derived model is adapted for spaCy transformer implementation and is an extension of SBERT concepts.Backpropagation for NN sentence assessment processing and sentence encoding is implemented in the spaCy en_core_web_trf model (Honnibal & Johnson, 2014) and embedded in the dependency parser pipeline component.This adapts the vanilla RoBERTa transformer model and introduces a SoftMax change to enable high-rank processing (Yang et al., 2017).This novel adaption generates lexical embeddings to accommodate sentences and support paraphrase mining, providing a nuanced representation of raw data when assessing vector cosine similarities.SpaCy utilizes the masking techniques of RoBERTa (Liu et al., 2019) and provides excellent inputs to the downstream modelling, enabling high output accuracies.SpaCy pipeline tools enable easy visualisation of results and enhance user understanding of the internal modelling.Pipelines are prioritized as tools for developing user-extensible models and are intrinsic to the spaCy platform.These adaptations and developments ensure that the task-specific process of selecting an optimal model is viable.
Contextualised sentence processing/ dependency parsing and has enabled the realisation of constrained runtimes, via approximation aware methods (Gormley et al., 2015).Inherent edge approximations during topological inference generation must be implemented to avoid exceeding O( 3 T) runtime.
Implementations based on comparing specific topological sentence components (Hu et al., 2022) effectively limit exponential processing overheads.Honnibal and Johnson (2014)  A major objective of our research has been to implement lightweight processing methodologies and use bestpractice topological developments to identify and produce an optimal, implementable model.The research presented in this paper focuses on identifying an optimal SMS spam detection model from key evaluation criteria, including high F1 scores, lightweight implementation capabilities and user interpretability/ extensibility features.

Theoretical Framework
Topic identification models are suitable for classifying large text blocks via pre-trained models.Exploratory data analysis (EDA) techniques, for example rudimentary word cloud generations can provide a visual understanding of basic word counts.
Inter-topic visualizations demonstrated that initial EDA could orient the dataset for an analyst.The inter-topic visualisation generated for the SMS Kaggle dataset was informative (Kaggle, 2017), but topic clustering proved incapable of enhancing spam/ 2-dimensional dataset classification tasks.The lack of available nuance generated from discrete word analysis implied that methods such as sentencizer processing are more likely to yield important NLP modelling results.
Modern advances in NLP processing have been achieved by creating architectural frameworks based on theoretical linguistic modelling and utilization of the entire suite of linguistic theories.The linguistic domain has become more important for NLP model construction than topic collation (Sartran et al., 2022).Approaching NLP modelling holistically provides a valid foundational approach to linguistic-based transfer learning analysis (Sasikala et al., 2022) and presents a theoretical justification for our research.
Prior research has demonstrated that whole-of-linguistic approaches in machine learning contribute to improved modelling capability (Güngör et al., 2020).Incorporating the major divisions of linguistic theory into a model involves an examination of sentence structure mechanics (syntax), morphology (structure) and semantics, or meaning derivation (Harvard, 2023).Predictive linguistic classifiers built with transformer models are now recognized as necessary architectures for sequencebased learning at scale.Contemporary modelling approaches use ideas inherent within the linguistic domain to enhance machine learning capabilities, e.g.implementation of approximation aware methods (Gormley et al., 2015).
The machine learning literature does not address relationships between linguistic phonology (tonal inflections) and predictive modelling to the same extent as morphological modelling.This paper is limited to establishing the relevance of morphological or dependency parsing transformer techniques to spam detection prediction models.

Research Design
The study involved identifying an optimal processing model for SMS spam using sentence transformers.We used the open-source, topological NLP methods inherent within the spaCy models for transformer and non-transformer modelling.
Three discrete methods have been tested on two publicly available datasets, as hosted by Kaggle and used in prior research.These Kaggle datasets are considered valid and legitimate to use for this study (Kaggle, 2017;Kaggle, 2019).The SMS and Twitter datasets are both two dimensional after dropping irrelevant features.The text-based (domain) dataframe column contain English language sentences of varying length.These sentences are classified as 'spam' or 'ham'.The SMS dataset is of length 5572 before deleting duplicates and 5169 once initial cleaning has been undertaken.The metrics of the Twitter dataset are 11968 and 11787 respectively.The following research design was implemented: 1. Run standard predictive modelling without spaCy objects: This modelling used NLTK classification tools from sklearn and primitive processing.
We generated visualisations from the highest occurring words as a 'Word-Cloud'.Other EDA included production of word frequency histograms to demonstrate inherent properties of the spam dataset.We ultimately discarded the results from this model as the overfitting excluded any baseline use.We also considered the processing overheads to be excessive for CPU implementations.The validation data (Twitter) was deemed relevant to use as an evaluation dataset as it contained substantially more records (11968) than the SMS data (5572).These records were similar in both datasets, consisting of short sentences, slang and excessive punctuation.The initial EDA provided information on the binary class numbers and revealed that the Twitter dataset was balanced but the SMS dataset was highly imbalanced, especially when duplicates were removed.Data cleaning of the text involved applying imported, inbuilt spaCy functions to remove stop words.Lemmatisations, stemming, 'split' and 'lower' methods were applied to the datasets to ensure easy tokenization and processing within spaCy (Dataquest, 2019).
An initial processing tactic was to implement a Bag-of-Words (BoW) vectorizer and compare it to a pipeline containing a TF-IDF vectorizer.A design decision was made to only process with TF-IDF vectorizers when a BoW vectorizer could not converge on the Twitter dataset in _trf model tests.BoW was verified as only appropriate for topic modelling and discarded as a vectorizer method.
Topic modelling visualisations were generated to understand topic clustering.This pre-processing step identified the variability of word-based methods and resulted in a design decision to process the dataset using superior sentencizer methods.The spaCy 'DisplaCy' tool was used to visualise renderings of the same SMS message and demonstrated that topological sentence processes were superior to wordbased methods.Sentence similarity preprocessing methods were used to demonstrate the effectiveness of cosine similarity comparisons utilised for generating sentence similarity metrics (L2 norm dot products).Pre-trained transformers from the Hugging Face (n.d.) repository were imported and _trf models for transfer learning/dependency parsing were generated.Each message was fed into the statistical model as an input sentence string, to enable spaCy sentencizing (transfer learning between discrete paraphrases).spaCy passed 'remembered' sentence embeddings between pipeline components to retain training contexts.Sklearn pipeline components were subsequently implemented to action the sklearn imbal SMOTE method, call tokenized spaCy embeddings, call cleaning operations and initiate a classifier prediction component.

Figure 3
Components of spaCy architecture (spaCy, n.d.) The tok2vec pipeline component is used by the _lg model and the transformer component is used by the _trf (roberta-base) model.We generated higher F1 scores for the SMS dataset using the _trf model.Final modelling was conducted without hyperparameter tuning of the sklearn downstream models (Peters et al., 2019), and all spaCy pipeline components were instantiated.
We used a standard train_test_split method, test size 0.2 and seed=42 to train and test the SMS NLP data.An sklearn transformer class was implemented to utilise tensor processing/ inherent topological functionality and avoid additional deep-learning product (e.g.Keras) dependencies.The _lg pre-trained model was chosen to generate comparison metrics because it produced similar results to the _trf model (e.g.entity recognition).The _lg model is designed to compute similarities via tensors shared with the pipline (spaCy, 2023) and both models can operate using a CPU.
The following modelling constructions were used: 1

Ethics Statement
The two datasets used for the research are publicly available and widely used.The SMS dataset contains SMS messages from approx.2012 and released with the consent of the participants.The dataset was originally used in a PhD thesis (Tagg, 2012) and the ethics statement on Kaggle (2017) references this study.The Twitter dataset is not generated for a specific paper, although it has been used in prior research (Liu et al., 2021) and a Kaggle competition, hosted by the University of Tennessee (Kaggle, 2019).The mitigation strategy used by the data owners of the Twitter dataset and is not clearly specified on Kaggle.We mitigated personal identification risks by only using two columns from the dataset, namely the Tweet content column and the Type identifier.This strategy replicates the approach taken in the Liu et al. (2021) paper and adheres to methodologies used by other researchers in this field.The datasets are a collection of Englishspeaking sentences and therefore the study cannot be extrapolated to make assertions on other languages.

Original Baseline
The basic NLTK analysis easily identified key words and generated a word cloud but modelling with this method proved to be inadequate.Training on the Twitter dataset was cancelled after 24 hours of processing and adjustments on the NLTK implementation were infeasible, defeating the object of developing a sustainable and reusable model.Processing issues were also observed when constructing modified NLTK models for the SMS dataset.The degree of overfitting from modelling a small dataset rendered the NLTK results meaningless.Gormley et al. (2015) commented that training on a small dataset is not appropriate because of the ease with which extremely high accuracies can be achieved.This provided a good justification for including data augmentation (SMOTE) and exploring transformer-based models that do not rely on pattern-matching techniques (Clark et al., 2019;Vaswani et al., 2017).

Topological modelling
It was noted that NN Learning Rates for the _trf model did not need to be adjusted and that the default settings achieved extremely good results on this dataset.Minimal accuracies gained from fine-tuning classifier hyperparameters generated exponentially large processing time-load costs to the system.This resulted in a design decision to train without hypermeters (Peters et al., 2019) and use untuned sklearn classifiers.
This research identified model variance when testing entity recognition techniques.The smaller (_md) statistical spaCy model could not leverage inbuilt topological functionality to successfully categorize or characterize the SMS data.Comparisons were only made between _lg and _trf implementations, as they produced similar results in most scenarios.The _lg and _trf models were generated on a CPU with excellent processing times (all <5mins) and implemented as lightweight systems.These strategic implementation decisions ensured that model deployment and output was both accurate and efficient.An iterative approach to acceptance testing was used throughout this work to support the evaluation of topological embeddings.SpaCy was used in an evaluation capacity to generate optimal lexical results with a transparent pipeline technology (Spring & Johnson, 2022).SpaCy pipeline processing enabled a better understanding of the transformer output than typical 'black-box' models (Honnibal & Montani, 2019).
RF was generally identified by other research as a superior modelling choice, however we found that F1-scores were highest for the SVC models.The 1.0 precision achieved on RF (see Table 3) appears anomalous.Results from RF _lg models demonstrate extreme variability.A topic/word-modelling approach was not implemented in this research, diverging from the current consensus approach for SMS data analysis.Correct ner recognition and dependency parsing could only be generated on the _trf and _lg models using tf_IDF vectorizers.
SpaCy visualisations also confirmed the superior processing capabilities of the _trf sentencizer model over the _lg model.This choice of base architecture resulted in improved predictive accuracy (Table 3).The SVC model with SMOTE augmentation method application produced optimal results.The optimal model was evaluated using the accuracy, precision, recall and F1 metrics with predictions generated by standard sklearn classifiers.Due to effectiveness of the transformer (tensor) topological pre-processing methods on the data, models were generated efficiently on a CPU.F1 was used as a comparison metric over accuracy due to the class imbalance and high cost of misclassification when predicting spam (Statology, 2021).

Discussion of Key Findings
This work has identified an optimal classification model for short-text SMS data.The model achieves an SVC F1-score of 0.938 and consistently low processing times.This solution is user-extensible and interpretable, due to transparent implementation methods.Optimal topological rendering is achieved with sentence-encoders inherent within the spaCy models.spaCy leverages pipeline methods to imbed dense vectors/ tensors into a topological architecture and successfully fits the tested SMS modelling data.A spaCy pipeline with oversampling achieved the reported F1-scores on the SMS data and the evaluation dataset.
Short-text modelling methods do not appear to have been investigated or tested to a sufficient extent.This study demonstrates that the application of topological sentencetransformer methods is an optimal design choice for analyzing SMS data.Approximation-aware fast dependency parsing enables topological transformers to achieve high accuracies.Applying spaCy transformers enables edge processing to resolve as approximate (Gormley et al., 2015).This research has verified that runtimes can consistently present as O( 3 ) (Gormley et al., 2015;Honnibal & Johnson, 2014), even with CPU use.Our research verified that transformer application extensions must be correctly implemented to leverage optimal runtime complexities on a CPU.The incorrect use of these methods has a devastating effect on runtime performance.
Untuned statistical spaCy transformer models achieved an excellent F1-score compared to _lg non-transformer methods.Fitting SMS data on embedded topological clustering optimized the SVC classifier and leveraged inherent topologies from dependency parsing methods (De Kok & Hinrichs, 2016).High-rank data renderings and the implemention of SoftMax extension applications (Yang et al., 2017) enabled access to previously unutilized topological data expressions.Previous research examining the effects of tuning on sequential inductive transfer learning (Peters et al., 2019) appears to be relevant for shorttext and large-text NLP tasks.We have tested our model from a perspective of minimal user intervention, at both the transformer level and the downstream classifier level.The SMS dataset is comprised of short-sentence components and the application of untuned pipelines to classification modelling is, to the best of our knowledge, a new approach.
Dependency parsing is incorporated as a native constituent of the spaCy pipelines and used to mine similar paraphrases within sentences.Pipeline processing provides a fundamental approach for utilising spaCy models.The study implements a pipeline design to ensure extensible production functionality.Sklearn pipeline modelling mirrored the spaCy architecture and enabled seamless generation of F1-scores.The inherent sentence transformer topology was illustrated via part-of-speech tagging and entity recognition visualisations.Classification of semantically similar sentences initiated contextual 'learning' within the pre-trained model and enabled the identification of latent relationships (Gormey, Dredze & Eisner, 2015).SpaCy open-source technologies ensured access to optimal processing methods and enabled nuanced learning of short-message data.Effective topological renderings of tensors (Moliner et al., 2020) were achieved by implementing concise sentence dependencyparsing methods.Correct system design choices effectively captured semantically similar latent expressions embedded within SMS messages.RoBERTa is subject to bias as it uses BERT pre-trained models, sourced from internet-scraped data (Hugging Face, n.d.;Jain, 2022).Precision and Recall can be used to inform bias and influence iterative reprocessing adjustments, via transparent modelling (Bartička et al., 2022).Prioritising precision outcomes on imbalanced SMS data supports minority class predictions (Brownlee, 2020).SMS data is extremely variable and interrogating topological transformer methodologies via adversarial sampling, precision and F1scoring could improve algorithmic fairness (Zhang et al., 2018).Contemporary work on unbiased transformer modelling prioritises transparent, user-centric evaluation methods (Modarressi et al., 2022).
This work uses open-source methods to generate a lightweight, user-extensible, state-of-the-art solution to the SMS spamdetection problem.SpaCy was chosen for implementation tasks because it proved straightforward to process and could be integrated with sklearn methods.Inherent implementation risks of open-source technological reliance include the inability of a system to respond to specific SMS data requirements, without significant model deconstruction.SoftMax developments have not been fully realised for discrete data, and work on Bernoulli latent variables requires monitoring (Yang et al., 2017).System modifications may be required to effectively process evolved versions of SMS text spam.

Conclusion
The results from this study demonstrate that modern NLP processing methods are suitable for use with SMS data.The tested models produced a variety of results for the SMS and Twitter datasets, based on a combination of classifiers and sampling techniques.The work identified that spaCy sentence transformers and sklearn pipeline implementations generated a maximum F1-score of 0.938, optimally utilising topological data.The modelling can be considered optimal because a lightweight, transparent, user-extensible architecture was leveraged to produce excellent F1-scores.These attributes were considered appropriate evaluation mechanisms to objectively assess production suitability.The research demonstrated that SMS text-based datasets of short sentences could be treated as documents and optimally classified.SpaCy is a constantly evolving product and this study presents a design approach requiring minimal end-user intervention.

Implications for Further Research
There are various avenues to use this work as a baseline for future research, including adversarial data augmentation strategies (Shirazi et al., 2023).Adversarial sampling incorporates synthetic data generation and could benefit projects working with dangerous/hard to retrieve data.
The opaque/ 'black box' nature of NN's does not generally afford users the opportunity to investigate statistical models and assess efficacy or bias.Textual meanings are interpretable and a proven ability to demonstrate generated inferences is of paramount importance to a system (Spring & Johnson, 2022).Speech is notoriously difficult to categorise, and assessment of bias must be an ongoing maintenance task in a production environment.These mitigations do not guarantee a model free from bias but do allow user transparency (May et al., 2019;Sartran et al., 2022).
Dense embedding is used by RoBERTa, but Multiple Negative Ranking (MNR) loss sentence embedding has surpassed recorded accuracies since 2019.MNR manipulates the cosine similarity metric by comparing opposing vectors (Nguyen et al., 2022).Spam datasets could be tested using this technique once the opensource implementation version becomes available.
Additional hyperparameter tuning was not enacted on classifiers, and no complementary sampling strategies were applied to the Twitter dataset, excluding the inbuilt _trf generation sampling.Application of additional sampling could improve F1scores (Peters et al., 2019) for Twitter data.SpaCy is RoBERTa-based and provides scope to adjust parameters of pre-trained weights in the pre-processing stage, circumventing sklearn hyperparameter tuning.Intra-model hyperparameter tuning degrades the performance of BERT, requiring careful manipulation, but could be achieved with new attribution techniques (Modarressi et. al., 2022;Xu et al., 2022).

Figure 1
Figure 1 Inter-topic Modelling using BERT Topic methods-SMS data