Header menu link for other important links
X
Virtual phone discovery for speech synthesis without text
S. Nayak, C.S. Kumar, G. Ramesh, S. Bhati,
Published in Institute of Electrical and Electronics Engineers Inc.
2019
Abstract
The objective of this work is to re-synthesize speech directly from the speech signals without using any text in a different speaker's voice. The speech signals are transformed into a sequence of acoustic subword units or virtual phones which are discovered automatically from the given speech signals in an unsupervised manner. The speech signal is initially segmented into acoustically homogeneous segments through kernel-Gram segmentation using MFCC and autoencoder bottleneck features. These segments are then clustered using different clustering techniques. The cluster labels thus obtained are considered as virtual phone units which are used to transcribe the speech signals. The virtual phones for the utterances to be resynthesized are encoded as one-hot vector sequences. Deep neural network based duration model and acoustic model are trained for synthesis using these sequences. A vocoder is used to synthesize speech in target speaker's voice from the features estimated by the acoustic model. The performance evaluation is done on ZeroSpeech 2019 challenge on English and Indonesian language. The bitrate and speaker similarity were found to be better than the challenge baseline with slightly lower intelligibility due to the compact encoding. © 2019 IEEE.