ADAPTIVE PHONEME-SPECTRAL ENVELOPE RECONSTRUCTION UNDER SEVERE ACOUSTIC MASKING
DOI:
https://doi.org/10.28925/2663-4023.2025.28.794805Keywords:
spectral envelope; formants; wavelet denoising; energy centroid μ; σ; SNR; STI; active acoustic jamming.Abstract
This article presents a phoneme-spectral methodology for restoring severely noise-corrupted speech while preserving parameter interpretability and a physically consistent spectral envelope under active acoustic masking. The approach unifies local peak parameters (amplitudes Aᵢ, frequencies Fᵢ, widths σᵢ) with the global energy centroid μ, which stabilizes formant-contour reconstruction at low SNR and maintains the energy balance between low- and high-frequency regions. The signal is first transformed into the spectral domain using the FFT, amplitudes are normalized, and the operating frequency range is defined. A Gaussian-mixture envelope is initialized from local energy maxima and iteratively refined by minimizing the discrepancy between the spectral envelope of the /ž/ phoneme and its physical amplitude spectrum. Noise suppression is performed using a discrete wavelet transform (sym8, L=3) with a threshold λ = κσₙ√(2lnN), which preserves peak shapes and minimizes reconstruction artifacts. Spline smoothness is governed by a frequency-dependent logarithmic law s(f)=a·log₁₀(1+b·f), which increases node density at high frequencies and enables accurate approximation of rapid spectral variations without overfitting noise. The combined σᵢ×f(μ) model adapts local widths using the global energy centroid, improving the consistency between low- and high-frequency regions of the envelope. The procedure is compatible with psychoacoustic frequency scales (Bark/Mel) and objective intelligibility metrics (STI), enabling analytical interpretation of parameter contributions to perceived quality. Experiments on the Ukrainian fricative phoneme /ž/ demonstrate RMSE ≈ 4% at SNR ≈ −6 dB in the physical amplitude domain, with stable μ values under threshold variation. The proposed methodology is applicable to room-acoustics assessment under active masking conditions and to security-oriented analyses of speech-channel informativeness. Transparent validation stages (reference-formant comparison, μ-balance inspection, and evaluation of σᵢ-, μ-, and combined models) support its transferability to other phoneme classes without loss of interpretability.
Downloads
References
IEC. (2020). IEC 60268-16:2020 — Sound system equipment — Part 16: Objective rating of speech intelligibility by STI. https://webstore.iec.ch/publication/50288
ISO. (2022). ISO 3382-3:2022 — Acoustics — Measurement of room acoustic parameters — Part 3: Open plan offices. https://www.iso.org/standard/76544.html
Delle Macchie, S., Secchi, S., & Cellai, G. (2018). Acoustic issues in open plan offices: A typological analysis. Buildings, 8(11), 161. https://doi.org/10.3390/buildings8110161
Mallat, S. (1989). A theory for multiresolution signal decomposition: The wavelet representation. IEEE TPAMI, 11(7), 674–693. https://doi.org/10.1109/34.192463
Donoho, D. L. (1995). Denoising by soft-thresholding. IEEE TIT, 41(3), 613–627. https://doi.org/10.1109/18.376450
Zhou, S., Zhao, Y., Kong, Q., & Wang, H. (2021). Improved wavelet-based speech denoising using adaptive thresholds. Applied Acoustics, 178, 108043. https://doi.org/10.1016/j.apacoust.2021.108043
Kong, Q., Chen, Z., He, J., & Zhao, Y. (2023). Neural spectral envelope estimation. Speech Communication, 151, 45–56. https://doi.org/10.1016/j.specom.2023.02.004
Zezario, R. E., Lee, C., Kim, K., & Kang, H. G. (2020). STOI-Net: A deep learning-based non-intrusive speech intelligibility assessment model. arXiv:2011.04292. https://arxiv.org/abs/2011.04292
Hall, J. W. (1967). Formant frequency analysis of speech sounds. JASA, 42(4), 974–982. https://doi.org/10.1121/1.1910097
Pollack, I., & Pickett, J. M. (1958). Masking of speech by noise at high sound levels. JASA, 30(6), 575–581. https://doi.org/10.1121/1.1909555
Chen, Y., He, X., Xu, S., & Chen, X. (2022). An evaluation framework on ultrasonic microphone jammers. IEEE INFOCOM Workshops. https://doi.org/10.1109/INFOCOMWKSHPS54753.2022.9798304
Kozhamkulova, F., Shaimerdenova, N., Issayeva, A., & Varol, C. (2024). A hybrid approach to enhanced signal denoising using VMD and DFA. Applied Sciences, 14(23), 10866. https://doi.org/10.3390/app142310866
IEC. (2020). IEC 60268-16:2020 — Sound system equipment — Part 16: Objective rating of speech intelligibility by STI. https://webstore.iec.ch/publication/50288
ANSI. (2017). ANSI S3.5-1997 (R2017): Speech intelligibility index. https://webstore.ansi.org/standards/asa/ansis31997r2017
Beranek, L. (1954). Acoustics. McGraw-Hill. https://archive.org/details/acoustics_beranek
Rabiner, L., & Schafer, R. (2010). Theory and applications of digital speech processing. Prentice Hall. https://www.pearson.com/en-us/subject-catalog/p/digital-speech-processing/P200000009873
Chen, B., & Gersho, A. (1979). Adaptive filter models for speech analysis. IEEE Trans. ASSP, 27(4), 351–363. https://ieeexplore.ieee.org/document/1163208
Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. ASSP, 27(2), 113–120. https://doi.org/10.1109/TASSP.1979.1163209
Lim, J. S. (1979). Two-step noise reduction algorithms. IEEE Trans. ASSP, 27(2), 130–136. https://ieeexplore.ieee.org/document/1163210
Ephraim, Y., & Malah, D. (1984). Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans. ASSP, 32(6), 1109–1121. https://ieeexplore.ieee.org/document/1164453
Loizou, P. (2013). Speech enhancement: Theory and practice. CRC Press. https://www.routledge.com/Speech-Enhancement-Theory-and-Practice/Loizou/p/book/9781138074995
Xu, Y., Du, J., Dai, L., & Lee, C.-H. (2015). A regression approach to speech enhancement based on deep neural networks. IEEE/ACM TASLP, 23(1), 7–19. https://doi.org/10.1109/TASLP.2015.2405471
Fletcher, H., & Steinberg, J. C. (1929). Articulation testing methods. Bell System Technical Journal, 8(4), 806–854. https://ieeexplore.ieee.org/document/6731076
Collard, J. A. (1929). A theoretical study of telephone articulation. Electrical Communication, 7(3), 168–186. https://archive.org/details/electricalcommunication
MathWorks. (2023). Find local maxima — peak prominence and width. MATLAB Signal Processing Toolbox Documentation. URL: https://www.mathworks.com/help/signal/ref/findpeaks.html
Virtanen, P., et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17, 261–272.
Chen, S., et al. (2001). Wavelet transform for ECG denoising. IEEE EMBS. doi:10.1109/IEMBS.2001.1017204
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Сергій Нужний

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.