Header menu link for other important links
X
Speaker embedding extraction with virtual phonetic information
S. Sreekanth, B. Shaik Mohammad Rafi, , S. Bhati
Published in Institute of Electrical and Electronics Engineers Inc.
2019
Abstract
In the recent past, deep neural networks have been successfully employed to extract fixed-dimensional speaker embeddings from the speech signal. The commonly used x-vectors are extracted by projecting the magnitude spectral features extracted from the speech signal onto a speaker-discriminative space. As the x-vectors do not explicitly capture the speaker-specific phonological pronunciation variability, phonetic vectors extracted from an automatic speech recognition (ASR) engine were supplied as auxiliary information to improve the performance of the x-vector system. However, the development of ASR engine requires a huge amount of manually transcribed speech data. In this paper, we propose to transcribe the speech signal in an unsupervised manner with the cluster labels obtained from a mixture of autoencoders (MoA) trained on a large amount of speech data. The unsupervised labels, referred to as virtual phonetic transcriptions, are used to extract the phonetic vectors. The virtual phonetic vectors extracted using MoA are supplied as auxiliary information to the x-vector system. The performance of the proposed system is compared with the state-of-the-art x-vector system on NIST SRE-2010 data. The proposed unsupervised auxiliary information provides a relative improvement of 12.08%, 3.61% and 16.66% over the x-vector system on core-core, core-10sec and 10sec-10sec conditions, respectively. © 2019 IEEE.