5th International Conference on Spoken Language Processing

Sydney, Australia
November 30 - December 4, 1998

Speech-to-Lip Movement Synthesis Based on the EM Algorithm Using Audio-Visual HMMs

Eli Yamamoto, Satoshi Nakamura, Kiyohiro Shikano

Nara Institute of Science and Technology, Japan

This paper proposes a method to re-estimate output visual parameters for speech-to-lip movement synthesis using audio-visual Hidden Markov Models (HMMs) under the Expectation-Maximization(EM) algorithm. In the conventional methods for speech-to-lip movement synthesis, there is a synthesis method estimating a visual parameter sequence through the Viterbi alignment of an input acoustic speech signal using audio HMMs. However, the HMM-Viterbi method involves a substantial problem that incorrect HMM state alignment may output incorrect visual parameters. The problem in the HMM-Viterbi method is caused by the deterministic synthesis process to assign a single HMM state for an input audio frame. The proposed method avoids the deterministic process by re-estimating non-deterministic visual parameters while maximizing the likelihood of the audio-visual observation sequence under the EM algorithm.

Full Paper

Bibliographic reference.  Yamamoto, Eli / Nakamura, Satoshi / Shikano, Kiyohiro (1998): "Speech-to-lip movement synthesis based on the EM algorithm using audio-visual HMMs", In ICSLP-1998, paper 0756.