This paper presents a novel approach, we term Speaker2Vec, to derive a speaker-characteristics manifold learned in an unsupervised manner. The proposed representation can be employed in different applications such as diarization, speaker identification or, as in our evaluation test case, speaker segmentation. Speaker2Vec exploits large amounts of unlabeled training data and the assumption of short-term active-speaker stationarity to derive a speaker embedding using Deep Neural Networks (DNN). We assume that temporally-near speech segments belong to the same speaker, and as such a joint representation connecting these nearby segments can encode their common information. Thus, this bottleneck representation will be capturing mainly speaker-specific information. Such training can take place in a completely unsupervised manner. For testing, our trained model generates the embeddings for the test audio, and applies a simple distance metric to detect speaker-change points. The paper also proposes a strategy for unsupervised adaptation of the DNN models to the application domain. The proposed method outperforms the state-of-the-art speaker segmentation algorithms and MFCC based baseline methods on four evaluation datasets, while it allows for further improvements by employing this embedding into supervised training methods.
Cite as: Jati, A., Georgiou, P. (2017) Speaker2Vec: Unsupervised Learning and Adaptation of a Speaker Manifold Using Deep Neural Networks with an Evaluation on Speaker Segmentation. Proc. Interspeech 2017, 3567-3571, doi: 10.21437/Interspeech.2017-1650
@inproceedings{jati17_interspeech, author={Arindam Jati and Panayiotis Georgiou}, title={{Speaker2Vec: Unsupervised Learning and Adaptation of a Speaker Manifold Using Deep Neural Networks with an Evaluation on Speaker Segmentation}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={3567--3571}, doi={10.21437/Interspeech.2017-1650} }