EVALUATION OF THE SCALABILITY OF VOICE EMBEDDING MODELS IN BIOMETRIC SPEAKER VERIFICATION SYSTEMS
DOI:
https://doi.org/10.28925/2663-4023.2025.31.1042Keywords:
voice biometrics; scalability; speaker verification; embeddings; authentication; ECAPA-TDNN; Pyannote; WavLM.Abstract
The rapid expansion of digital platforms in the financial sector, public administration, e-commerce, and service systems has created a growing demand for highly reliable and scalable user authentication technologies. In this context, biometric methods — particularly voice-based authentication systems — demonstrate significant potential due to their natural ease of interaction, minimal hardware requirements, and seamless integration into voice-driven interfaces. However, the increasing number of users and the diversity of usage scenarios introduce new challenges for researchers and developers. Modern systems must ensure high accuracy in real time, maintain stable performance as data volumes grow, and provide resilience against cyberattacks, including those involving synthetic or manipulated speech. A critical requirement is the ability of models to generate compact, invariant, and robust voice embeddings that enable efficient comparison and classification within large-scale databases. This paper presents a comparative analysis of the scalability of contemporary neural architectures for speaker verification, with emphasis on their performance, computational complexity, and behavior as the number of enrolled users increases. The study examines model optimization techniques, indexed embedding-based search methods, and the role of representative multilingual corpora in enhancing accuracy under conditions of acoustic and linguistic variability. Particular attention is given to protection against spoofing attacks and the use of specialized synthetic speech detection methods as an essential component of scalable voice biometric systems. The results highlight the need for a comprehensive approach to designing modern voice authentication systems, in which architectural engineering decisions are combined with requirements for information security, high performance, and adaptability in the rapidly evolving landscape of digital services.
Downloads
References
Biostatistics.io. (n.d.). Implementing biometrics for large-scale applications: Overcoming 6 challenges. https://biostatistics.io/qa/implementing-biometrics-for-large-scale-applications-overcoming-6-challenges
Ruda, K. (2025). Study of the scalability of biometric authentication systems based on voice embeddings. Social Development and Security, 15(1), 161–170. https://doi.org/10.33445/sds.2025.15.1.15
Brydinskyi, V., Khoma, Y., Sabodashko, D., Podpora, M., Khoma, V., Konovalov, A., & Kostiak, M. (2024). Comparison of modern deep learning models for speaker verification. Applied Sciences, 14(4), Article 1329. https://doi.org/10.3390/app14041329
Thienpondt, J., & Demuynck, K. (2023). ECAPA2: A hybrid neural network architecture and training strategy for robust speaker embeddings. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) (pp. 1–8). IEEE. https://doi.org/10.1109/ASRU57964.2023.10389750
Deng, F., Huang, R., Jiang, P., & Deng, L. (2025). Dense-Fusion2Net: A more efficient and lightweight short speech speaker recognition system with time-frequency channel attention. Scientific Reports, 15, 9601. https://doi.org/10.1038/s41598-025-93873-x
Sharma, R., Govind, D., Mishra, J., Dubey, A. K., Deepak, K. T., & Prasanna, S. R. M. (2024). Milestones in speaker recognition. Artificial Intelligence Review, 57, Article 58. https://doi.org/10.1007/s10462-023-10688-w
Chen, G., et al. (2023). Towards understanding and mitigating audio adversarial examples for speaker recognition. IEEE Transactions on Dependable and Secure Computing, 20(5), 3970–3987. https://doi.org/10.1109/TDSC.2022.3220673
Chen, Z., & Xu, S. (2023). Learning domain-heterogeneous speaker recognition systems with personalized continual federated learning. EURASIP Journal on Audio, Speech, and Music Processing, 2023, Article 33. https://doi.org/10.1186/s13636-023-00299-2
RudderAnalytics. (n.d.). Building a robust speaker verification system for secure voice authentication. Medium. https://medium.com/@rudderanalytics/voice-based-security-implementing-a-robust-speaker-verification-system-12c5fd98f1c1
Sharif-Noughabi, M., Razavi, S. M., & Mohamadzadeh, S. (2025). Improving the performance of speaker recognition system using optimized VGG convolutional neural network and data augmentation. International Journal of Engineering, 38(10), 2414–2425. https://doi.org/10.5829/ije.2025.38.10a.17
Amazon Science Blog. (n.d.). On-device speech processing makes Alexa faster, lower bandwidth. https://www.amazon.science/blog/on-device-speech-processing-makes-alexa-faster-lower-bandwidth
Google Research. (n.d.). An overview of speech recognition techniques. https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42535.pdf
Hugging Face. (2023). ua-polit-tiny [Dataset]. https://huggingface.co/datasets/vbrydik/ua-polit-tiny
Alice Biometrics. (2023). Defining the core accuracy metrics of biometric systems. https://alicebiometrics.com/en/defining-the-core-accuracy-metrics-of-biometric-systems
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Христина Руда, Дмитро Сабодашко, Ігор Кос, Аліна Ахмедова

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.