Mixup Learning Strategies for Text-Independent Speaker Verification

Yingke Zhu, Tom Ko, Brian Mak

Mixup is a learning strategy that constructs additional virtual training samples from existing training samples by linearly interpolating random pairs of them. It has been shown that mixup can help avoid data memorization and thus improve model generalization. This paper investigates the mixup learning strategy in training speaker-discriminative deep neural network (DNN) for better text-independent speaker verification.

In recent speaker verification systems, a DNN is usually trained to classify speakers in the training set. The DNN, at the same time, learns a low-dimensional embedding of speakers so that speaker embeddings can be generated for any speakers during evaluation. We adapted the mixup strategy to the speaker-discriminative DNN training procedure, and studied different mixup schemes, such as performing mixup on MFCC features or raw audio samples. The mixup learning strategy was evaluated on NIST SRE 2010, 2016 and SITW evaluation sets. Experimental results show consistent performance improvements both in terms of EER and DCF of up to 13% relative. We further find that mixup training also improves the DNN’s speaker classification accuracy consistently without requiring any additional data sources.

 DOI: 10.21437/Interspeech.2019-2250

Cite as: Zhu, Y., Ko, T., Mak, B. (2019) Mixup Learning Strategies for Text-Independent Speaker Verification. Proc. Interspeech 2019, 4345-4349, DOI: 10.21437/Interspeech.2019-2250.

  author={Yingke Zhu and Tom Ko and Brian Mak},
  title={{Mixup Learning Strategies for Text-Independent Speaker Verification}},
  booktitle={Proc. Interspeech 2019},