The performance of a state-of-the-art speaker verification system is severely degraded when it is presented with trial recordings of short duration. In this work we propose to use deep neural networks to learn short-duration speaker embeddings. We focus on the 5s-5s condition, wherein both sides of a verification trial are 5 seconds long. In our previous work we established that learning a non-linear mapping from i-vectors to speaker labels is beneficial for speaker verification [1]. In this work we take the idea of learning a speaker classifier one step further — we apply deep neural networks directly to time-frequency speech representations. We propose two feed-forward network architectures for this task. Our best model is based on a deep convolutional architecture wherein recordings are treated as images. From our experimental findings we advocate treating utterances as images or ‘speaker snapshots’, much like in face recognition. Our convolutional speaker embeddings perform significantly better than i-vectors when scoring is done using cosine distance, where the relative improvement is 23.5%. The proposed deep embeddings combined with cosine distance also outperform a state-of-the-art i-vector verification system by 1%, providing further empirical evidence in favor of our learned speaker features.
Cite as: Bhattacharya, G., Alam, J., Kenny, P. (2017) Deep Speaker Embeddings for Short-Duration Speaker Verification. Proc. Interspeech 2017, 1517-1521, doi: 10.21437/Interspeech.2017-1575
@inproceedings{bhattacharya17_interspeech, author={Gautam Bhattacharya and Jahangir Alam and Patrick Kenny}, title={{Deep Speaker Embeddings for Short-Duration Speaker Verification}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={1517--1521}, doi={10.21437/Interspeech.2017-1575} }