End-to-end Text-dependent Speaker Verification Using Novel Distance Measures

Subhadeep Dey, Srikanth Madikeri, Petr Motlicek

This paper explores novel ideas in building end-to-end deep neural network (DNN) based text-dependent speaker verification (SV) system. The baseline approach consists of mapping a variable length speech segment to a fixed dimensional speaker vector by estimating the mean of hidden representations in DNN structure. The distance between two utterances is obtained by computing L2 norm between the vectors. This approach performs worse than the conventional Gaussian Mixture Model-Universal Background Model (GMM-UBM) based system in publicly available corpora. We believe that poor performance is due to the employed averaging operation, which may not capture the phonetic information of an utterance. Past studies indicate that techniques exploiting phonetic information in addition to speaker is beneficial for this task. This paper therefore proposes to incorporate content information of the speech signal by computing distance function with linguistic units co-occuring between enrollment and test data. The whole network is optimized by employing a triplet-loss objective in an end-to-end fashion to output SV scores. Experiments on the RSR2015 dataset show that the proposed approach outperforms GMM-UBM system by 48% and 36% relative equal error rate for fixed-phrase and Random-digit conditions respectively.

 DOI: 10.21437/Interspeech.2018-2300

Cite as: Dey, S., Madikeri, S., Motlicek, P. (2018) End-to-end Text-dependent Speaker Verification Using Novel Distance Measures. Proc. Interspeech 2018, 3598-3602, DOI: 10.21437/Interspeech.2018-2300.

  author={Subhadeep Dey and Srikanth Madikeri and Petr Motlicek},
  title={End-to-end Text-dependent Speaker Verification Using Novel Distance Measures},
  booktitle={Proc. Interspeech 2018},