Conversational speech exhibits considerable pronunciation vari-ability, which has been shown to have a detrimental effect on the accuracy of automatic speech recognition. There have been many attempts to model pronunciation variation, including the use of decision-trees to generate alternate word pronunciations from phonemic baseforms. Use of such pronunciation models during recognition is known to improve accuracy. This paper describes the use of such pronunciation models during acous-tic model training. Subtle difficulties in the straightforward use of alternatives to canonical pronunciations are first illustrated: it is shown that simply improving the accuracy of the phonetic transcription used for acoustic model training is of little benefit. Analysis of this paradox leads to a new method of accommodat-ing nonstandard pronunciations: rather than allowing a phoneme in the canonical pronunciation to be realized as one of a few distinct alternate phones predicted by the pronunciation model, the HMM states of the phonemes model are instead allowed to share Gaussian mixture components with the HMM states of the model of the alternate realization. Qualitatively, this amounts to making a soft decision about which surface-form is realized. Quantitative experiments on the Switchboard corpus show that this method improves accuracy by 1.7% (absolute).
Cite as: Saraclar, M., Nock, H., Khudanpur, S. (1999) Pronunciation modeling by sharing gaussian densities across phonetic models. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 515-518, doi: 10.21437/Eurospeech.1999-132
@inproceedings{saraclar99_eurospeech, author={Murat Saraclar and Harriet Nock and Sanjeev Khudanpur}, title={{Pronunciation modeling by sharing gaussian densities across phonetic models}}, year=1999, booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)}, pages={515--518}, doi={10.21437/Eurospeech.1999-132} }