Automated Classification of Self-Admitted Technical Debt Using Advanced Word Embedding Techniques

Authors

  • Satya Mohan Chowdary Gorripati Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham-Chennai, India https://orcid.org/0000-0002-1952-6403
  • Ali Altalbe Faculty of Computing and Information Technology, King Abdulaziz University, Saudi Arabia
  • Prasanna Kumar Rangarajan Department of Computer Science and Engineering, Amrita Vishwa Vidyapeetham-Chennai, India https://orcid.org/0000-0001-6103-259X

DOI:

https://doi.org/10.47852/bonviewJCCE52025976

Keywords:

self-admitted technical debt, n-gram IDF, word embeddings, feature extraction, natural language processing

Abstract

This research uses advanced word embedding techniques to improve the automatic classification of Self-Admitted Technical Debt (SATD). We evaluate how successfully n-gram Inverse Document Frequency (IDF) creates machine learning classifier-friendly feature sets. A publicly available dataset including Java source code comments from 10 open-source projects was used to assess SATD classification methods. This category included Random Forest, SVM, Logistic Regression, and XGBoost. We used instance hardness undersampling to handle the imbalance in the SATD dataset. We tested the classifier using accuracy, recall, F1-score, and Macro-Averaged Mean Cost-Error (MMCE). The Random Forest classifier with n-gram IDF features achieved an average accuracy of 87%. It performed similarly to the traditional TF-IDF and Bag-of-Words methods on average, and on certain projects and MMCE values, it performed better. In rare circumstances, n-gram IDF may reveal contextual phrase patterns and improve SATD recognition, especially when combined with ensemble learning. To enhance generalisability, future research will expand the dataset, investigate hybrid and deep learning models, and increase applicability across various programming languages and project areas.

 

Received: 21 April 2025 | Revised: 12 September 2025 | Accepted: 7 October 2025

 

Conflicts of Interest

The authors declare that they have no conflicts of interest to this work.

 

Data Availability Statement

Data are available on request from the corresponding author upon reasonable request.

 

Author Contribution Statement

Satya Mohan Chowdary Gorripati: Methodology, Software, Formal analysis, Investigation, Resources, Data curation, Writing – original draft. Ali Altalbe: Writing – review & editing. Prasanna Kumar Rangarajan: Conceptualization, Validation, Visualization, Supervision, Project administration.


Metrics

Metrics Loading ...

Downloads

Published

2025-11-21

Issue

Section

Research Articles

How to Cite

Gorripati, S. M. C., Altalbe, A., & Rangarajan, P. K. (2025). Automated Classification of Self-Admitted Technical Debt Using Advanced Word Embedding Techniques. Journal of Computational and Cognitive Engineering. https://doi.org/10.47852/bonviewJCCE52025976