Auditory-Visual Speech Processing (AVSP) 2009

University of East Anglia, Norwich, UK
September 10-13, 2009

HMM-based Motion Trajectory Generation for Speech Animation Synthesis

Lijuan Wang (1), Wei Han (2), Xiaojun Qian (3), Frank Soong (1)

(1) Microsoft Research Asia, Beijing, China
(2) Department of Computer Science & Engineering, Shanghai Jiao Tong University, China
(3) Department of Systems Engineering and Engineering Management, Chinese University of Hongkong, Hongkong, China

Synthesis of realistic facial animation for arbitrary speech is an important but difficult problem. The difficulties lie in the synchronization between lip motion and speech, articulation variation under different phonetic context, and expression variation in different speaking style. To solve these problems, we propose a visual speech synthesis system based on a fivestate, multi-stream HMM, which generates synchronized motion trajectories for the given text and speech input. Since the motion and the speech are modeled as different but coherent streams, the synchronization at each state is guaranteed. By considering phonetic context and suprasegmental information, the contextual dependent phone models are constructed and clustered using classification and regression, which capture the variable phonetic context and speaking style. The experiment results show that the HMMbased method can generate realistic lip animation while keeping the detailed articulation and transitions. Moreover, it is capable of presenting articulation variation under different phonetic context and expressing various speaking styles, such as emphasized speech.

Full Paper

Bibliographic reference.  Wang, Lijuan / Han, Wei / Qian, Xiaojun / Soong, Frank (2009): "HMM-based motion trajectory generation for speech animation synthesis", In AVSP-2009, 170.