This paper describes a multi-template unsupervised speaker adaptation based on HMM-Sufficient Statistics. Multiple classdependent models based on gender and age are used to push up the adaptation performance while keeping adaptation time within few seconds with just one arbitrary utterance. Adaptation begins with the estimation of speaker's class from the N-best neighbor speakers using Gaussian Mixture Models (GMM) on the way of speaker selection. The corresponding template model is adopted as a base model. The adapted model is rapidly constructed using the selected HMM-Sufficient Statistics. Experiments in noisy environment conditions with 20dB SNR office, crowd, booth, and car noise are performed. The proposed multi-template method achieved 89.5% word correct rate compared with 88.0% of the conventional single-template method, while the baseline recognition rate without adaptation is 85.7%. Moreover, experiments using Vocal Tract Length Normalization (VTLN) and supervised Maximum Likelihood Linear Regression (MLLR) are also compared.
Cite as: Gomez, R., Lee, A., Saruwatari, H., Shikano, K. (2005) Rapid unsupervised speaker adaptation based on multi-template HMM sufficient statistics in noisy environments. Proc. Interspeech 2005, 293-296, doi: 10.21437/Interspeech.2005-161
@inproceedings{gomez05_interspeech, author={Randy Gomez and Akinobu Lee and Hiroshi Saruwatari and Kiyohiro Shikano}, title={{Rapid unsupervised speaker adaptation based on multi-template HMM sufficient statistics in noisy environments}}, year=2005, booktitle={Proc. Interspeech 2005}, pages={293--296}, doi={10.21437/Interspeech.2005-161} }