A non-linear source-filter based vocoder with prosody control

P. Giridhar; G. Ramesh; Sri Rama Murty Kodukula

doi:10.1109/NCC56989.2023.10067968

Speech signal reconstruction from its compact acoustic representation is a challenging task. Although the acoustic representations obtained from the speech processing systems (like Text-to-speech synthesis, speech enhancement, etc.) are highly accurate, the performance of the vocoder affects the naturalness of the synthesized speech. Conventional vocoders are based on the linear source-filter model of the human speech production mechanism. But, we can't incorporate them in training an end-to-end model, and they are vulnerable to the estimated acoustic representations. Neural vocoders like WaveNet can be incorporated in training end-to-end models. But the complexity and the inference time are pretty high and do not have provision to control the prosody. In this paper, we propose a neural network based compact non-linear vocoder with prosody control using the source-filter model of the human speech production mechanism. We can effectively control the prosody of the synthesized speech by controlling the prosodic parameters like fundamental frequency (f_0) without affecting the naturalness of the speech. The model achieves a better performance with a mean opinion score (MOS) of 4.09, with a much lower real-time factor and model complexity. © 2023 IEEE.

Journal	2023 National Conference on Communications, NCC 2023
Publisher	Institute of Electrical and Electronics Engineers Inc.