RESEARCH ON MACHINE LEARNING MODELS FOR TOXIC CONTENT DETECTION

Authors

DOI:

https://doi.org/10.28925/2663-4023.2026.32.1189

Keywords:

toxicity detection; hate speech; Deep Learning; Natural Language Processing; Large Language Models.

Abstract

The rapid growth of digital communication and the proliferation of online platforms have intensified the need to address the problem of detecting toxic content, including hate speech, cyberbullying, threats, and discriminatory statements. Such manifestations negatively affect both individual users and the broader information environment, undermining trust in digital services and contributing to the spread of social bias. Given the impracticality of large-scale manual moderation, automated methods based on Deep Learning and Natural Language Processing (NLP) have become increasingly important.

This study presents a comparative analysis of modern transformer-based models for toxicity detection in text, including Toxic BERT, Toxic Comment Model, RoBERTa Toxicity Classifier, DehaBERT Mono English, as well as the general-purpose large language model Phi-3 mini 4k without task-specific fine-tuning. The evaluation was conducted on two publicly available datasets – Measuring Hate Speech and the Jigsaw Toxic Comment Dataset – using the metrics Accuracy, Precision, Recall, and F1-score. The primary focus was placed on optimizing the F1-score as a balanced measure between precision and recall. Additionally, a threshold-based filtering mechanism was applied to enable objective comparison of models with different sensitivity levels to toxic content.

The experimental results demonstrate that specialized transformer-based models, particularly the RoBERTa Toxicity Classifier, achieve the highest effectiveness in detecting explicit forms of toxic language, reaching accuracy above 90% for categories such as severe hate speech, threats, and calls for violence. However, performance decreases when addressing context-dependent and indirect manifestations of toxicity, including subtle aggression, disrespect, and offensive humor. It was also shown that class imbalance within the datasets significantly affects classification quality, resulting in lower performance for underrepresented categories. Furthermore, model bias toward specific identity subgroups and sensitivity to profanity were identified.

The study additionally demonstrates that applying a general-purpose LLM without task-specific fine-tuning is both ineffective and computationally inefficient during inference. The obtained results confirm the appropriateness of using specialized models for content moderation tasks and highlight the перспективity of combining general-purpose large language models with domain-specific classifiers to improve context-aware toxicity detection. Future research should focus on developing ensemble approaches and reducing model bias to ensure fairer and more reliable automated moderation systems.

Downloads

Download data is not yet available.

References

Guo, K., et al. (2023). An investigation of large language models for real-world hate speech detection. In 2023 International Conference on Machine Learning and Applications (ICMLA) (pp. 1568–1573). IEEE.

Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating neural toxic degeneration in language models. arXiv Preprint. https://arxiv.org/abs/2009.11462

Biswas, P., & Haritha, D. (2024). Automatic hate speech detection and the hassle of offensive language. Educational Administration: Theory and Practice, 30(5), 12663–12668. https://kuey.net/index.php/kuey/article/view/4005

Hee, M. S., et al. (2024). Recent advances in online hate speech moderation: Multimodality and the role of large models. In Findings of the Association for Computational Linguistics: EMNLP 2024 (pp. 4407–4419).

Prem, E., & Krenn, B. (2024). On algorithmic content moderation.

Roy, S., Harshavardhan, A., Mukherjee, A., & Saha, P. (2023). Probing LLMs for hate speech detection: Strengths and vulnerabilities. arXiv Preprint. https://arxiv.org/abs/2310.12860

Kumar, D., AbuHashem, Y. A., & Durumeric, Z. (2024). Watch your language: Investigating content moderation with large language models. In Proceedings of the International AAAI Conference on Web and Social Media, 18 (pp. 865–878).

Davidson, T., Warmsley, D., Macy, M., & Weber, I. (2017). Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, 11(1), 512–515.

Anjum, & Katarya, R. (2024). Hate speech and toxicity detection in online social media: A survey of the state of the art and opportunities. International Journal of Information Security, 23(1), 577–608.

Faria, F. T. J., Baniata, L. H., & Kang, S. (2024). Investigating the predominance of large language models in low-resource Bangla language over transformer models for hate speech detection: A comparative analysis. Mathematics, 12(23), 3687.

Badjatiya, P., Gupta, S., Gupta, M., & Varma, V. (2017). Deep learning for hate speech detection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion (pp. 759–760).

Mienye, I. D., Swart, T. G., & Obaido, G. (2024). Recurrent neural networks: A comprehensive review of architectures, variants, and applications. Information, 15(9), 517.

Akhtar, M. S., Kumar, A., Ekbal, A., & Bhattacharyya, P. (2016). A hybrid deep learning architecture for sentiment analysis. In Proceedings of COLING 2016 (pp. 482–493).

Syam, S. S., Irawan, B., & Setianingsih, C. (2019). Hate speech detection on Twitter using long short-term memory (LSTM). In 2019 International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE) (pp. 305–310). IEEE.

Yang, S., Yu, X., & Zhou, Y. (2020). LSTM and GRU neural network performance comparison study: Yelp dataset case. In International Workshop on Electronic Communication and Artificial Intelligence (IWECAI) (pp. 98–101). IEEE.

Gillioz, A., Casas, J., Mugellini, E., & Abou Khaled, O. (2020). Overview of transformer-based models for NLP tasks. In Proceedings of FedCSIS 2020 (pp. 179–183). IEEE.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186).

Samghabadi, N. S., Patwa, P., Pykl, S., Mukherjee, P., Das, A., & Solorio, T. (2020). Aggression and misogyny detection using BERT: A multi-task approach. In Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying (pp. 126–131).

Marvin, G., Hellen, N., Jjingo, D., & Nakatumba-Nabende, J. (2023). Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics (pp. 387–402). Springer.

Luong, T. S., Le, T.-T., Van, L. N., & Nguyen, T. H. (2024). Realistic evaluation of toxicity in large language models. arXiv Preprint. https://arxiv.org/abs/2405.10659

Khoma, V., Sabodashko, D., Kolchenko, V., Perepelytsia, P., & Baranowski, M. (2024). Investigation of vulnerabilities in large language models using an automated testing system. In CEUR Workshop Proceedings, 3826 (pp. 220–228).

Davidson, T., Bhattacharya, D., & Weber, I. (2019). Racial bias in hate speech and abusive language detection datasets. arXiv Preprint. https://arxiv.org/abs/1905.12516

Hanu, L., & Unitary AI. (2020). Detoxify [Software]. GitHub. https://github.com/unitaryai/detoxify

Adams, C. J., Borkan, D., Sorensen, J., Dixon, L., Vasserman, L., et al. (2019). Jigsaw unintended bias in toxicity classification [Dataset]. Kaggle. https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification

Martin-ha. (2022). Toxic-comment-model [Model]. Hugging Face. https://huggingface.co/martin-ha/toxic-comment-model

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv Preprint. https://arxiv.org/abs/1907.11692

Logacheva, V., Dementieva, D., Ustyantsev, S., Moskovskiy, D., Dale, D., Krotova, I., et al. (2022). ParaDetox: Detoxification with parallel data. In Proceedings of ACL 2022 (pp. 6804–6818). https://aclanthology.org/2022.acl-long.469

Aluru, S. S., Mathew, B., Saha, P., & Mukherjee, A. (2020). Deep learning models for multilingual hate speech detection. arXiv Preprint. https://arxiv.org/abs/2004.06465

UC Berkeley D-Lab. (2024). Measuring hate speech [Dataset]. Hugging Face. https://huggingface.co/datasets/ucberkeley-dlab/measuring-hate-speech

Kennedy, C. J., Bacon, G., Sahn, A., & von Vacano, C. (2020). Constructing interval variables via faceted Rasch measurement and multitask deep learning: A hate speech application. arXiv Preprint. https://arxiv.org/abs/2009.10277

Adams, C. J., Sorensen, J., Elliott, J., Dixon, L., McDonald, M., et al. (2017). Toxic comment classification challenge [Dataset]. Kaggle. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge

Downloads


Abstract views: 14

Published

2026-03-26

How to Cite

Kolchenko, V., Sabodashko, D., Piataiev, K., Horodnyk, V., Shchudlo, I., & Khoma, Y. (2026). RESEARCH ON MACHINE LEARNING MODELS FOR TOXIC CONTENT DETECTION. Electronic Professional Scientific Journal «Cybersecurity: Education, Science, Technique», 4(32), 242–258. https://doi.org/10.28925/2663-4023.2026.32.1189