Far-Field End-to-End Text-Dependent Speaker Verification Based on Mixed Training Data with Transfer Learning and Enrollment Data Augmentation

Xiaoyi Qin, Danwei Cai, Ming Li


In this paper, we focus on the far-field end-to-end text-dependent speaker verification task with a small-scale far-field text dependent dataset and a large scale close-talking text independent database for training. First, we show that simulating far-field text independent data from the existing large-scale clean database for data augmentation can reduce the mismatch. Second, using a small far-field text dependent data set to finetune the deep speaker embedding model pre-trained from the simulated far-field as well as original clean text independent data can significantly improve the system performance. Third, in special applications when using the close-talking clean utterances for enrollment and employing the real far-field noisy utterances for testing, adding reverberant noises on the clean enrollment data can further enhance the system performance. We evaluate our methods on AISHELL ASR0009 and AISHELL 2019B-eval databases and achieve an equal error rate (EER) of 5.75% for far-field text-dependent speaker verification under noisy environments.


 DOI: 10.21437/Interspeech.2019-1542

Cite as: Qin, X., Cai, D., Li, M. (2019) Far-Field End-to-End Text-Dependent Speaker Verification Based on Mixed Training Data with Transfer Learning and Enrollment Data Augmentation. Proc. Interspeech 2019, 4045-4049, DOI: 10.21437/Interspeech.2019-1542.


@inproceedings{Qin2019,
  author={Xiaoyi Qin and Danwei Cai and Ming Li},
  title={{Far-Field End-to-End Text-Dependent Speaker Verification Based on Mixed Training Data with Transfer Learning and Enrollment Data Augmentation}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4045--4049},
  doi={10.21437/Interspeech.2019-1542},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1542}
}