Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning

João Monteiro, Jahangir Alam, Tiago H. Falk


In this paper, we tackle automatic speaker verification under a text-independent setting. Speaker modelling is performed by a deep convolutional neural network on top of time-frequency speech representations. Convolutions performed over the time dimension provide the means for the model to take both short-term dependencies into account, given the nature of the learned filters which operate over short-windows, as well as long-term dependencies, since depth in a convolutional stack implies dependency of outputs across large portions of input samples. Additionally, various pooling strategies across the time dimension are compared so as to effectively map varying length recordings into fixed dimensional representations while simultaneously providing the neural network with an extra mechanism to model long-term dependencies. We finally propose a training scheme under which well-known metric learning approaches, namely triplet loss minimization, is performed along with speaker recognition in a multi-class classification setting. Evaluation on well-known datasets and comparisons with state-of-the-art benchmarks show that the proposed setting is effective in yielding speaker-dependent representations, thus is well-suited for voice biometrics downstream tasks.


 DOI: 10.21437/Interspeech.2019-2974

Cite as: Monteiro, J., Alam, J., Falk, T.H. (2019) Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning. Proc. Interspeech 2019, 4015-4019, DOI: 10.21437/Interspeech.2019-2974.


@inproceedings{Monteiro2019,
  author={João Monteiro and Jahangir Alam and Tiago H. Falk},
  title={{Combining Speaker Recognition and Metric Learning for Speaker-Dependent Representation Learning}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4015--4019},
  doi={10.21437/Interspeech.2019-2974},
  url={http://dx.doi.org/10.21437/Interspeech.2019-2974}
}