Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding

Hitoshi Yamamoto, Kong Aik Lee, Koji Okabe, Takafumi Koshinaka


This paper investigates a novel data augmentation approach to train deep neural networks (DNNs) used for speaker embedding, i.e. to extract representation that allows easy comparison between speaker voices with a simple geometric operation. Data augmentation is used to create new examples from an existing training set, thereby increasing the quantity of training data improves the robustness of the model. We attempt to increase the number of speakers in the training set by generating new speakers via voice conversion. This speaker augmentation expands the coverage of speakers in the embedding space in contrast to conventional audio augmentation methods which focus on within-speaker variability. With an increased number of speakers in the training set, the DNN is trained to produce a better speaker-discriminative embedding. We also advocate using bandwidth extension to augment narrowband speech for a wideband application. Text-independent speaker recognition experiments in Speakers in the Wild (SITW) demonstrate a 17.9% reduction in minimum detection cost with speaker augmentation. The combined use of the two techniques provides further improvement.


 DOI: 10.21437/Interspeech.2019-1508

Cite as: Yamamoto, H., Lee, K.A., Okabe, K., Koshinaka, T. (2019) Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding. Proc. Interspeech 2019, 406-410, DOI: 10.21437/Interspeech.2019-1508.


@inproceedings{Yamamoto2019,
  author={Hitoshi Yamamoto and Kong Aik Lee and Koji Okabe and Takafumi Koshinaka},
  title={{Speaker Augmentation and Bandwidth Extension for Deep Speaker Embedding}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={406--410},
  doi={10.21437/Interspeech.2019-1508},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1508}
}