Auditory-Visual Speech Processing (AVSP) 2009
University of East Anglia, Norwich, UK
Synthesis of realistic facial animation for arbitrary speech is an important but difficult problem. The difficulties lie in the synchronization between lip motion and speech, articulation variation under different phonetic context, and expression variation in different speaking style. To solve these problems, we propose a visual speech synthesis system based on a fivestate, multi-stream HMM, which generates synchronized motion trajectories for the given text and speech input. Since the motion and the speech are modeled as different but coherent streams, the synchronization at each state is guaranteed. By considering phonetic context and suprasegmental information, the contextual dependent phone models are constructed and clustered using classification and regression, which capture the variable phonetic context and speaking style. The experiment results show that the HMMbased method can generate realistic lip animation while keeping the detailed articulation and transitions. Moreover, it is capable of presenting articulation variation under different phonetic context and expressing various speaking styles, such as emphasized speech.
Bibliographic reference. Wang, Lijuan / Han, Wei / Qian, Xiaojun / Soong, Frank (2009): "HMM-based motion trajectory generation for speech animation synthesis", In AVSP-2009, 170.