5th International Conference on Spoken Language Processing
This paper proposes a method to re-estimate output visual parameters for speech-to-lip movement synthesis using audio-visual Hidden Markov Models (HMMs) under the Expectation-Maximization(EM) algorithm. In the conventional methods for speech-to-lip movement synthesis, there is a synthesis method estimating a visual parameter sequence through the Viterbi alignment of an input acoustic speech signal using audio HMMs. However, the HMM-Viterbi method involves a substantial problem that incorrect HMM state alignment may output incorrect visual parameters. The problem in the HMM-Viterbi method is caused by the deterministic synthesis process to assign a single HMM state for an input audio frame. The proposed method avoids the deterministic process by re-estimating non-deterministic visual parameters while maximizing the likelihood of the audio-visual observation sequence under the EM algorithm.
Bibliographic reference. Yamamoto, Eli / Nakamura, Satoshi / Shikano, Kiyohiro (1998): "Speech-to-lip movement synthesis based on the EM algorithm using audio-visual HMMs", In ICSLP-1998, paper 0756.