Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition

Tianyu Liang, Yi Liu, Can Xu, Xianwei Zhang, Liang He


Currently, the most effective text-independent speaker recognition method has turned to be extracting speaker embedding from various deep neural networks. Among them, the x-vector extracted from factorized time-delay neural network (F-TDNN) has been demonstrated to be among the best performance on recent NIST SRE evaluations. In our previous works, we have proposed combined vector (c-vector) and proved that the performance can be further improved by introducing phonetic information, which is often ignored in extracting x-vectors. By taking advantages of both F-TDNN and c-vector, we propose an embedding extraction method termed as factorized combined vector (fc-vector). In the NIST SRE18 CTS task, the EER and minDCF18 of fc-vector are 12.1% and 10.5% relatively lower than the x-vector, and 3.4% and 3.9% relatively lower than the c-vector, respectively.


 DOI: 10.21437/Odyssey.2020-61

Cite as: Liang, T., Liu, Y., Xu, C., Zhang, X., He, L. (2020) Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition. Proc. Odyssey 2020 The Speaker and Language Recognition Workshop, 428-432, DOI: 10.21437/Odyssey.2020-61.


@inproceedings{Liang2020,
  author={Tianyu Liang and Yi Liu and Can Xu and Xianwei Zhang and Liang He},
  title={{Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition}},
  year=2020,
  booktitle={Proc. Odyssey 2020 The Speaker and Language Recognition Workshop},
  pages={428--432},
  doi={10.21437/Odyssey.2020-61},
  url={http://dx.doi.org/10.21437/Odyssey.2020-61}
}