This study combines a Gaussian mixture model support vector machine (GMM-SVM) system with a nonlinear feature transformation, discriminatively trained to extract speaker specific features from MFCCs. Separation of the speaker information component and non-speaker related information in the speech signal is accomplished using a regularized siamese deep network (RSDN). RSDN learns a hidden representation that well characterizes speaker information by training a subset of the hidden units using pairs of speech segments. MFCC features are input to a trained RSDN and a subset of hidden layer outputs are used as new input features in a GMM-SVM system. We demonstrate the potential of this approach for text-independent speaker verification by applying it to a subset of the NIST SRE 2006 1conv4w-1conv4w task. The hybrid RSDN GMM-SVM system achieves about 5% relative improvement over the baseline GMM-SVM system.
Bibliographic reference. Price, Ryan / Biswas, Sangeeta / Shinoda, Koichi (2013): "Combining deep speaker specific representations with GMM-SVM for speaker verification", In INTERSPEECH-2013, 2788-2792.