This paper addresses a problem often encountered when estimating control points used in visual speech synthesis. First, Hidden Markov Models (HMMs) are estimated for each viseme present in stored video data. Second, models are generated for each triseme (a viseme in context with the previous and following visemes) in the training set. Next, a decision tree is used to cluster and relate states in the HMMs that are similar in a contextual and statistical sense. The tree is also used to estimate HMMs for any trisemes that are not present in the stored video data when control points for such trisemes are required for synthesizing the lip motion for a sentence. Finally, the HMMs are used to generate sequences of visual speech control points for those trisemes not occurring in the stored data. Comparisons of mouth shapes generated from the artificially generated control points and the control points estimated from video not used to train the HMMs indicate that the process estimated accurate control points for the trisemes tested. This paper thus establishes a useful method for synthesizing realistic audio-synchronized video facial features.
Cite as: Arb, A., Gustafson, S., Anderson, T., Slyh, R. (2001) Hidden Markov models for visual speech synthesis with limited data. Proc. Auditory-Visual Speech Processing, 84-89
@inproceedings{arb01_avsp, author={Allan Arb and Steven Gustafson and Timothy Anderson and Raymond Slyh}, title={{Hidden Markov models for visual speech synthesis with limited data}}, year=2001, booktitle={Proc. Auditory-Visual Speech Processing}, pages={84--89} }