CNN-Based Joint Mapping of Short and Long Utterance i-Vectors for Speaker Verification Using Short Utterances

Jinxi Guo, Usha Amrutha Nookala, Abeer Alwan


Text-independent speaker recognition using short utterances is a highly challenging task due to the large variation and content mismatch between short utterances. I-vector and probabilistic linear discriminant analysis (PLDA) based systems have become the standard in speaker verification applications, but they are less effective with short utterances. To address this issue, we propose a novel method, which trains a convolutional neural network (CNN) model to map the i-vectors extracted from short utterances to the corresponding long-utterance i-vectors. In order to simultaneously learn the representation of the original short-utterance i-vectors and fit the target long-version i-vectors, we jointly train a supervised-regression model with an autoencoder using CNNs. The trained CNN model is then used to generate the mapped version of short-utterance i-vectors in the evaluation stage. We compare our proposed CNN-based joint mapping method with a GMM-based joint modeling method under matched and mismatched PLDA training conditions. Experimental results using the NIST SRE 2008 dataset show that the proposed technique achieves up to 30% relative improvement under duration mismatched PLDA-training conditions and outperforms the GMM-based method. The improved systems also perform better compared with the matched-length PLDA training condition using short utterances.


 DOI: 10.21437/Interspeech.2017-430

Cite as: Guo, J., Nookala, U.A., Alwan, A. (2017) CNN-Based Joint Mapping of Short and Long Utterance i-Vectors for Speaker Verification Using Short Utterances. Proc. Interspeech 2017, 3712-3716, DOI: 10.21437/Interspeech.2017-430.


@inproceedings{Guo2017,
  author={Jinxi Guo and Usha Amrutha Nookala and Abeer Alwan},
  title={CNN-Based Joint Mapping of Short and Long Utterance i-Vectors for Speaker Verification Using Short Utterances},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={3712--3716},
  doi={10.21437/Interspeech.2017-430},
  url={http://dx.doi.org/10.21437/Interspeech.2017-430}
}