Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System

Weicheng Cai, Jinkun Chen, Ming Li


In this paper, we explore the encoding/pooling layer and loss function in the end-to-end speaker and language recognition system. First, a unified and interpretable end-to-end system for both speaker and language recognition is developed. It accepts variable-length input and produces an utterance level result. In the end-to-end system, the encoding layer plays a role in aggregating the variable-length input sequence into an utterance level representation. Besides the basic temporal average pooling, we introduce a self-attentive pooling layer and a learnable dictionary encoding layer to get the utterance level representation. In terms of loss function for open-set speaker verification, to get more discriminative speaker embedding, center loss and angular softmax loss is introduced in the end-to-end system. Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.


 DOI: 10.21437/Odyssey.2018-11

Cite as: Cai, W., Chen, J., Li, M. (2018) Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System. Proc. Odyssey 2018 The Speaker and Language Recognition Workshop, 74-81, DOI: 10.21437/Odyssey.2018-11.


@inproceedings{Cai2018,
  author={Weicheng Cai and Jinkun Chen and Ming Li},
  title={Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System},
  year=2018,
  booktitle={Proc. Odyssey 2018 The Speaker and Language Recognition Workshop},
  pages={74--81},
  doi={10.21437/Odyssey.2018-11},
  url={http://dx.doi.org/10.21437/Odyssey.2018-11}
}