In this work we assess the recently proposed hybrid Deep Neural Network/Gaussian Mixture Model (DNN/GMM) approach for speaker recognition considering the effects of the granularity of the phonetic DNN model, and of the precision of the corresponding GMM models, which will be referred to as the phonetic GMMs. The aim of this work is to better understand the contributions of the phonetic information provided by the DNN model with respect to the accuracy of the acoustic GMMs in fitting the distribution of the features associated to a given context-dependent phone state. The testbed for this work was the text-independent speaker recognition task defined by NIST for the 2012 Speaker Recognition Evaluation. Our experiment confirms that the acoustic and the phonetic GMMs are complementary. Thus, their score combination yields very good results if the DNN is trained on data collected in an environment similar to the one that is used for testing. We show, however, that using a single Gaussian per DNN state is not the best choice: the best single system has been obtained balancing the phonetic and acoustic precision of a DNN/GMM system.
Bibliographic reference. Cumani, Sandro / Laface, Pietro / Kulsoom, Farzana (2015): "Speaker recognition by means of acoustic and phonetically informed GMMs", In INTERSPEECH-2015, 200-204.