Auto-Encoding Nearest Neighbor i-Vectors for Speaker Verification

Umair Khan, Miquel India, Javier Hernando


In the last years, i-vectors followed by cosine or PLDA scoring techniques were the state-of-the-art approach in speaker verification. PLDA requires labeled background data, and there exists a significant performance gap between the two scoring techniques. In this work, we propose to reduce this gap by using an autoencoder to transform i-vector into a new speaker vector representation, which will be referred to as ae-vector. The autoencoder will be trained to reconstruct neighbor i-vectors instead of the same training i-vectors, as usual. These neighbor i-vectors will be selected in an unsupervised manner according to the highest cosine scores to the training i-vectors. The evaluation is performed on the speaker verification trials of VoxCeleb-1 database. The experiments show that our proposed ae-vectors gain a relative improvement of 42% in terms of EER compared to the conventional i-vectors using cosine scoring, which fills the performance gap between cosine and PLDA scoring techniques by 92%, but without using speaker labels.


 DOI: 10.21437/Interspeech.2019-1444

Cite as: Khan, U., India, M., Hernando, J. (2019) Auto-Encoding Nearest Neighbor i-Vectors for Speaker Verification. Proc. Interspeech 2019, 4060-4064, DOI: 10.21437/Interspeech.2019-1444.


@inproceedings{Khan2019,
  author={Umair Khan and Miquel India and Javier Hernando},
  title={{Auto-Encoding Nearest Neighbor i-Vectors for Speaker Verification}},
  year=2019,
  booktitle={Proc. Interspeech 2019},
  pages={4060--4064},
  doi={10.21437/Interspeech.2019-1444},
  url={http://dx.doi.org/10.21437/Interspeech.2019-1444}
}