Auditory-Visual Speech Processing (AVSP) 2011

Volterra, Italy
September 1-2, 2011

Photo-Realistic Visual Speech Synthesis Based on AAM Features and an Articulatory DBN Model with Constrained Asynchrony

Peng Wu (1,2), Dongmei Jiang (1,2), He Zhang (1,2), Hichem Sahli (3,4)

VUB-NPU Joint Research Group on AVSP
(1) Northwestern Polytechnic University, Xi’an 710072, China
(2) Shaanxi Provincial Key Laboratory on Speech, Image and Information Processing
(3) Vrije Universiteit Brussel (VUB) - AVSP, Department ETRO,
(4) Interuniversity Microelectronics Centre – IMEC, VUB-ETRO, Brussels, Belgium

This paper presents a photo realistic visual speech synthesis method based on an audio visual articulatory dynamic Bayesian network model (AF_AVDBN) in which the maximum asynchronies between the articulatory features, such as lips, tongue and glottis/velum, can be controlled. Perceptual linear prediction (PLP) features from the audio speech and active appearance model (AAM) features from mouth images of the visual speech are adopted to train the AF_AVDBN model for continuous speech. An EM-based optimal visual feature learning algorithm is deduced given the input auditory speech and the trained AF_AVDBN parameters. Finally, photo realistic mouth images are synthesized from the learned AAM features. In the experiments, mouth animations are synthesized for 30 connected digit audio speech sentences. Objective evaluation results show that the learned visual features using AF_AVDBN track the real parameters much more closely than those from the audio visual state synchronous DBN model (SS_DBN, the DBN implementation of multi-stream Hidden Markov Model), as well the state asynchronous DBN model (SA_DBN). Subjective evaluation results show that by considering the asynchronies between articulatory features in the AF_AVDBN (as well between audio and visual states in the SA_DBN), the synchronization between the audio speech and mouth animations are well obtained. Moreover, since AF_AVDBN captures the dynamic movements of articulatory features and model the pronunciation process more precisely, the accuracy of the mouth animations from the AF_AVDBN is much higher than those from the SA_DBN and the SS_DBN models, very accurate, clear, and natural mouth animations can be obtained through the AF_AVDBN model and AAM features.

Index Terms. visual speech synthesis, AF_AVDBN, asynchrony, AAM features

Full Paper
Video 3 (avi)    Video 7 (avi)    Video 15 (avi)    Video 17 (avi)   

Bibliographic reference.  Wu, Peng / Jiang, Dongmei / Zhang, He / Sahli, Hichem (2011): "Photo-realistic visual speech synthesis based on AAM features and an articulatory DBN model with constrained asynchrony", In AVSP-2011, 61-66.