Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System

Zhifu Gao, Yan Song, Ian McLoughlin, Pengcheng Li, Yiheng Jiang, Li-Rong Dai


Deep embedding learning based speaker verification (SV) methods have recently achieved significant performance improvement over traditional i-vector systems, especially for short duration utterances. Embedding learning commonly consists of three components: frame-level feature processing, utterance-level embedding learning, and loss function to discriminate between speakers. For the learned embeddings, a back-end model (i.e., Linear Discriminant Analysis followed by Probabilistic Linear Discriminant Analysis (LDA-PLDA)) is generally applied as a similarity measure. In this paper, we propose to further improve the effectiveness of deep embedding learning methods in the following components: (1) A multi-stage aggregation strategy, exploited to hierarchically fuse time-frequency context information for effective frame-level feature processing. (2) A discriminant analysis loss is designed for end-to-end training, which aims to explicitly learn the discriminative embeddings, i.e. with small intra-speaker and large inter-speaker variances. To evaluate the effectiveness of the proposed improvements, we conduct extensive experiments on the VoxCeleb1 dataset. The results outperform state-of-the-art systems by a significant margin. It is also worth noting that the results are obtained using a simple cosine metric instead of the more complex LDA-PLDA backend scoring.


 DOI: 10.21437/Interspeech.2019-1489

Cite as: Gao, Z., Song, Y., McLoughlin, I., Li, P., Jiang, Y., Dai, L. (2019) Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. Proc. Interspeech 2019, 361-365, DOI: 10.21437/Interspeech.2019-1489.


@inproceedings{Gao2019,
  author={Zhifu Gao and Yan Song and Ian McLoughlin and Pengcheng Li and Yiheng Jiang and Li-Rong Dai},
  title={{Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={361--365},
  doi={10.21437/Interspeech.2019-1489},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1489}
}