Chinese Toxic Comment Detection: A Comparative Study of Traditional ML, Deep Learning, Encoder-Based and Decoder-Based Models
DOI:
https://doi.org/10.47852/bonviewJCCE52026117Keywords:
Chinese toxic comment detection, machine learning, deep learning, encoder-based model, decoder-based model, binary classification, multilingual NLPAbstract
In recent years, with the fast growth of the internet and the continuous expansion of technological applications such as social media, the health and safety of the online environment has become a matter that requires serious attention. In the Chinese context, due to the complexity and diversity of syntactic expression, the detection of toxic language in Chinese faces unique challenges. This study focuses on the performance of traditional machine learning, deep learning, encoder-based transformation models, and decoder-based transformation models (LLMs) in the identification of toxic comments in Chinese, and compares the performance characteristics of different models. The study combines two main datasets, COLD and TOCAB, into a binary classification task, using accuracy, F1-score, precision, and recall as evaluation metrics to assess all the tested models. The final results show that among the tested models, the decoder-based Qwen1.5-7B (8-bit quantization) has the highest accuracy (94.71%), the traditional machine learning models and encoder-based transformation models perform moderately, while the deep learning models have lower accuracy (77%–80%) due to the limited context understanding, indicating that decoder-based large language models have advantages in the detection of toxic comments in Chinese.
Received: 8 May 2025 | Revised: 24 September 2025 | Accepted: 15 October 2025
Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.
Data Availability Statement
The data that support the findings of this study are openly available in GitHub at https://github.com/TanYouxi/Chinese-Toxic-Comment-Detection.
Author Contribution Statement
Youxi Tan: Conceptualization, Software, Formal analysis, Datacuration, Writing – original draft, Writing – review & editing, Visualization. Mingjie Fang: Software, Validation, Data curation, Writing – original draft, Writing – review & editing, Visualization. Da Shen: Validation, Formal analysis, Writing – original draft, Writing – review & editing. Jiayi Xu: Writing – review & editing. Baha Ihnaini: Methodology, Investigation, Resources, Supervision, Project administration, Funding acquisition.
Metrics
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Authors

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
Funding data
-
Wenzhou-Kean University
Grant numbers IRSPK2023005