A statistical coarticulatory model is presented for spontaneous speech recognition, where knowledge of the dynamic, target-directed behavior in the vocal tract resonance responsible for the production of highly coarticulated speech is incorporated into the recognizer design, training, and in likelihood computation. The principal advantage of the new speech model over the conventional HMM is the use of a compact, internal structure that parsimoniously represents long-span context dependence in the observable domain of speech acoustics without using additional, contextdependent model parameters. The new model is formulated mathematically as a constrained, nonstationary, and nonlinear dynamic system, for which aversion of the generalized EM algorithm is developed and implemented for automatically learning the compact set of model parameters. Experiments for speech recognition using spontaneous speech data from SWITCHBOARD corpus are reported.
Cite as: Deng, L., Ma, J. (1999) A statistical coarticulatory model for the hidden vocal-tract-resonance dynamics. Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999), 1499-1502, doi: 10.21437/Eurospeech.1999-260
@inproceedings{deng99_eurospeech, author={Li Deng and Jeff Ma}, title={{A statistical coarticulatory model for the hidden vocal-tract-resonance dynamics}}, year=1999, booktitle={Proc. 6th European Conference on Speech Communication and Technology (Eurospeech 1999)}, pages={1499--1502}, doi={10.21437/Eurospeech.1999-260} }