Standard Gaussian mixture modelling does not possess time sequence information (TSI) other than that which might be embedded in the acoustic features. Dynamic time warping relates directly to TSI, time-warping two sequences of features into alignment. Here, a hybrid system embedding DTW into a GMM is presented. Improved automatic speaker verification performance is demonstrated. Testing 1000 speakers in a fully text independent, world-model-adapted mode shows an equal error improvement over a standard GMM from 4.1% to 3.8%.
Cite as: Stapert, R.P., Mason, J.S. (2001) A segmental mixture model for speaker recognition. Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001), 2509-2512, doi: 10.21437/Eurospeech.2001-414
@inproceedings{stapert01_eurospeech, author={Robert P. Stapert and John S. Mason}, title={{A segmental mixture model for speaker recognition}}, year=2001, booktitle={Proc. 7th European Conference on Speech Communication and Technology (Eurospeech 2001)}, pages={2509--2512}, doi={10.21437/Eurospeech.2001-414} }