Traditional i-vector speaker recognition systems use a Gaussian mixture model (GMM) to collect sufficient statistics. Recently, replacing this GMM with a deep neural network (DNN) has shown promising results. In this paper, we study a number of open issues that relate to performance, computational complexity, and applicability of DNNs as part of the full speaker recognition pipeline. The experimental validation is performed on the female part of the SRE12 telephone condition 2, where our DNN-based system produces the best published results. The insights gained by our study indicate that, for the purpose of speaker recognition, not using fMLLR speaker adaptation and early stopping of the DNN training allow significant computational reduction without sacrificing performance. Also, using a full covariance universal background model (UBM) and a large set of senones produces important performance gains. Finally, the DNN-based approach does not exhibit a strong language dependence as a DNN trained on Spanish data outperforms the conventional GMM-based system on our English task.
Bibliographic reference. Garcia-Romero, Daniel / McCree, Alan (2015): "Insights into deep neural networks for speaker recognition", In INTERSPEECH-2015, 1141-1145.