In most speaker recognition systems speech utterances are not constrained in content or language. In a text-dependent speaker recognition system lexical content of speech and language are known in advance. The goal of this paper is to show that this information can be used by a segmental features (SF) approach to improve a standard Gaussian mixture model with MFCC features (GMM-MFCC). Speech features such as mean energy, delta energy, pitch, delta pitch, the formants F1–F4 and their bandwidths B1–B4 and the difference between F2 and F1 are calculated on segments and are associated to phonemes and phoneme groups for each speaker. The SF and GMM-MFCC approaches are combined by multiplying the outputs of two classifiers. All the experiments are performed on the two versions of TEVOID: TEVOID16 with 16 and the upgraded TEVOID50 with 50 speakers. On TEVOID16, SF achieves 84.23%, GMM-MFCC 91.75%, and the combined approach gives 95.12% recognition rate. On TEVOID50, the SF approach gives 68.69%, while both GMM-MFCC and the combined model achieve 95.84% recognition rate. On both databases, the number of male/female confusions decreased for the combined model. These results are promising for using segmental features to improve the recognition rate of text-dependent systems.
Cite as: Milošević, M., Glavitsch, U. (2017) Combining Gaussian Mixture Models and Segmental Feature Models for Speaker Recognition. Proc. Interspeech 2017, 2042-2043
@inproceedings{milosevic17_interspeech, author={Milana Milošević and Ulrike Glavitsch}, title={{Combining Gaussian Mixture Models and Segmental Feature Models for Speaker Recognition}}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2042--2043} }