Employing Phonetic Information in DNN Speaker Embeddings to Improve Speaker Recognition Performance

Md Hafizur Rahman, Ivan Himawan, Mitchell McLaren, Clinton Fookes, Sridha Sridharan


The recent speaker embeddings framework has been shown to provide excellent performance on the task of text-independent speaker recognition. The framework is based on a deep neural network (DNN) trained to directly discriminate between speakers from traditional acoustic features such as Mel frequency cepstral coefficients. Prior studies on speaker recognition have found that phonetic information is valuable in the task of speaker identification, with systems being based on either bottleneck features (BFs) or tied-triphone state posteriors from a DNN trained for the task of speech recognition. In this paper, we analyze the role of phonetic BFs for DNN embeddings and explore methods to enhance the BFs further. Experimental results show that exploiting phonetic information encoded in BFs is very valuable for DNN speaker embeddings. Enriching the BFs using a cascaded DNN multi-task architecture is also shown to provide further improvements to the speaker embedding system.


 DOI: 10.21437/Interspeech.2018-1804

Cite as: Rahman, M.H., Himawan, I., McLaren, M., Fookes, C., Sridharan, S. (2018) Employing Phonetic Information in DNN Speaker Embeddings to Improve Speaker Recognition Performance. Proc. Interspeech 2018, 3593-3597, DOI: 10.21437/Interspeech.2018-1804.


@inproceedings{Rahman2018,
  author={Md Hafizur Rahman and Ivan Himawan and Mitchell McLaren and Clinton Fookes and Sridha Sridharan},
  title={Employing Phonetic Information in DNN Speaker Embeddings to Improve Speaker Recognition Performance},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={3593--3597},
  doi={10.21437/Interspeech.2018-1804},
  url={http://dx.doi.org/10.21437/Interspeech.2018-1804}
}