SECURITY OF LARGE LANGUAGE MODELS: RISKS, THREATS, AND SECURITY APPROACHES
DOI:
https://doi.org/10.28925/2663-4023.2025.29.918Keywords:
large language models; Generative AI; cybersecurity; AI firewall; prompt injection; guardrails; watermarking; LLM vulnerability.Abstract
The article provides a comprehensive analysis of current security challenges related to Large Language Models (LLMs), which have become a key element of digital transformation across multiple sectors. It examines typical threats arising both from targeted attacks on models and from their malicious use in cybercrime. The main risk vectors are identified, including prompt injection - embedding hidden instructions in user queries to alter model logic, and jailbreaking - crafting prompts that bypass built-in restrictions and trigger undesirable behavior. Special attention is given to the risks of confidential data leakage from training datasets, generation of vulnerable or malicious code that can enter production environments, and the dissemination of disinformation, including multimedia deepfakes. Based on the analysis, a conceptual LLM security model is proposed, combining technical, architectural, and regulatory elements of protection. Particular emphasis is placed on assessing and applying mechanisms such as AI firewalls - intermediary systems that filter model inputs and outputs; built-in security modules integrated into model architectures; and guardrails - restrictions on outputs without altering parameters. Additionally, watermarking methods for synthetic content identification and AI-generated content detection tools are considered. Regulatory measures are highlighted as an essential component to establish usage frameworks for powerful models and mitigate misuse risks. It is concluded that traditional cybersecurity measures, focused on static or signature-based detection, are insufficient for generative systems operating in dynamic natural language environments. Enhancing security requires multilayered strategies covering all stages of the LLM lifecycle - from design and training to deployment and regulatory oversight.
Downloads
References
OpenAI. (2025). GPT-5 system card. OpenAI. https://openai.com/index/gpt-5-system-card/
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., … & Song, D. (2021). Extracting training data from large language models. In Proceedings of the 30th USENIX Security Symposium (pp. 2633–2650). USENIX Association.
Weidinger, L., Uesato, J., Rauh, M., Griffin, C., Huang, P. S., Mellor, J., … & Gabriel, I. (2021). Ethical and social risks of harm from language models. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) (pp. 214–229). ACM.
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT) (pp. 610–623). ACM.
European Commission. (2024). Proposal for a regulation of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). EUR-Lex. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206
Microsoft. (2023). Responsible AI standard v2. Microsoft. https://www.microsoft.com/ai/responsible-ai
Anthropic. (2023). Constitutional AI: Harmlessness from AI feedback. arXiv Preprint. arXiv:2212.08073. https://arxiv.org/abs/2212.08073
National Institute of Standards and Technology. (2023). AI risk management framework (AI RMF 1.0). U.S. Department of Commerce. https://www.nist.gov/itl/ai-risk-management-framework
Anthropic. (2023). Model card and evaluations for Claude models. Anthropic. https://www-cdn.anthropic.com/bd2a28d2535bfb0494cc8e2a3bf135d2e7523226/Model-Card-Claude-2.pdf
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … & Metzler, D. (2023). Emergent abilities of large language models. arXiv Preprint. arXiv:2307.02483. https://arxiv.org/pdf/2307.02483
Azaria, A., & Mitchell, T. (2023). The internal state of an LLM knows when it’s lying. arXiv Preprint. arXiv:2305.07243. https://arxiv.org/pdf/2305.07243
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. arXiv Preprint. arXiv:1909.08593. https://arxiv.org/pdf/1909.08593
Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. (2020). Learning to summarize with human feedback. arXiv Preprint. arXiv:2104.05218. https://arxiv.org/pdf/2104.05218
Liu, A., Pang, R. Y., Zeng, K., Wang, A., Xie, T., Chen, X., & Zhou, D. (2023). Trustworthy AI: A computational perspective. arXiv Preprint. arXiv:2301.10226. https://arxiv.org/pdf/2301.10226
Glukhov, D., Wiggers, K., & Young, T. (2023). The malicious use of generative AI for cyberattacks: A survey. Journal of Information Security and Applications, 75, 103666. Elsevier. https://doi.org/10.1016/j.jisa.2023.103666
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Галина Гайдур; Вадим Власенко; Олександра Петрова

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.