Currently, the most effective text-independent speaker recognition method has turned to be extracting speaker embedding from various deep neural networks. Among them, the x-vector extracted from factorized time-delay neural network (F-TDNN) has been demonstrated to be among the best performance on recent NIST SRE evaluations. In our previous works, we have proposed combined vector (c-vector) and proved that the performance can be further improved by introducing phonetic information, which is often ignored in extracting x-vectors. By taking advantages of both F-TDNN and c-vector, we propose an embedding extraction method termed as factorized combined vector (fc-vector). In the NIST SRE18 CTS task, the EER and minDCF18 of fc-vector are 12.1% and 10.5% relatively lower than the x-vector, and 3.4% and 3.9% relatively lower than the c-vector, respectively.
Cite as: Liang, T., Liu, Y., Xu, C., Zhang, X., He, L. (2020) Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition. Proc. The Speaker and Language Recognition Workshop (Odyssey 2020), 428-432, doi: 10.21437/Odyssey.2020-61
@inproceedings{liang20_odyssey, author={Tianyu Liang and Yi Liu and Can Xu and Xianwei Zhang and Liang He}, title={{Combined Vector Based on Factorized Time-delay Neural Network for Text-Independent Speaker Recognition}}, year=2020, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2020)}, pages={428--432}, doi={10.21437/Odyssey.2020-61} }