Deep Speaker Feature Learning for Text-Independent Speaker Verification

Lantian Li, Yixiang Chen, Ying Shi, Zhiyuan Tang, Dong Wang


Recently deep neural networks (DNNs) have been used to learn speaker features. However, the quality of the learned features is not sufficiently good, so a complex back-end model, either neural or probabilistic, has to be used to address the residual uncertainty when applied to speaker verification. This paper presents a convolutional time-delay deep neural network structure (CT-DNN) for speaker feature learning. Our experimental results on the Fisher database demonstrated that this CT-DNN can produce high-quality speaker features: even with a single feature (0.3 seconds including the context), the EER can be as low as 7.68%. This effectively confirmed that the speaker trait is largely a deterministic short-time property rather than a long-time distributional pattern, and therefore can be extracted from just dozens of frames.


 DOI: 10.21437/Interspeech.2017-452

Cite as: Li, L., Chen, Y., Shi, Y., Tang, Z., Wang, D. (2017) Deep Speaker Feature Learning for Text-Independent Speaker Verification. Proc. Interspeech 2017, 1542-1546, DOI: 10.21437/Interspeech.2017-452.


@inproceedings{Li2017,
  author={Lantian Li and Yixiang Chen and Ying Shi and Zhiyuan Tang and Dong Wang},
  title={Deep Speaker Feature Learning for Text-Independent Speaker Verification},
  year=2017,
  booktitle={Proc. Interspeech 2017},
  pages={1542--1546},
  doi={10.21437/Interspeech.2017-452},
  url={http://dx.doi.org/10.21437/Interspeech.2017-452}
}