Automated Classification of Self-Admitted Technical Debt Using Advanced Word Embedding Techniques
DOI:
https://doi.org/10.47852/bonviewJCCE52025976Keywords:
self-admitted technical debt, n-gram IDF, word embeddings, feature extraction, natural language processingAbstract
This research uses advanced word embedding techniques to improve the automatic classification of Self-Admitted Technical Debt (SATD). We evaluate how successfully n-gram Inverse Document Frequency (IDF) creates machine learning classifier-friendly feature sets. A publicly available dataset including Java source code comments from 10 open-source projects was used to assess SATD classification methods. This category included Random Forest, SVM, Logistic Regression, and XGBoost. We used instance hardness undersampling to handle the imbalance in the SATD dataset. We tested the classifier using accuracy, recall, F1-score, and Macro-Averaged Mean Cost-Error (MMCE). The Random Forest classifier with n-gram IDF features achieved an average accuracy of 87%. It performed similarly to the traditional TF-IDF and Bag-of-Words methods on average, and on certain projects and MMCE values, it performed better. In rare circumstances, n-gram IDF may reveal contextual phrase patterns and improve SATD recognition, especially when combined with ensemble learning. To enhance generalisability, future research will expand the dataset, investigate hybrid and deep learning models, and increase applicability across various programming languages and project areas.
Received: 21 April 2025 | Revised: 12 September 2025 | Accepted: 7 October 2025
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
Data are available on request from the corresponding author upon reasonable request.
Author Contribution Statement
Satya Mohan Chowdary Gorripati: Methodology, Software, Formal analysis, Investigation, Resources, Data curation, Writing – original draft. Ali Altalbe: Writing – review & editing. Prasanna Kumar Rangarajan: Conceptualization, Validation, Visualization, Supervision, Project administration.
Metrics
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.