ISCA Archive SSW 2010
ISCA Archive SSW 2010

Photo-real lips synthesis with trajectory-guided sample selection

Lijuan Wang, Xiaojun Qian, Wei Han, Frank K. Soong

In this paper, we propose an HMM trajectory-guided, real image sample concatenation approach to photo-real talking head synthesis. It renders a smooth and natural video of articulators in sync with given speech signals. An audio-visual database is used to train a statistical Hidden Markov Model (HMM) of lips movement first and the trained model is then used to generate a visual parameter trajectory of lips movement for given speech signals, all in the maximum likelihood sense. The HMM generated trajectory is then used as a guide to select, in the original training database, an optimal sequence of mouth images which are then stitched back to a background head video. The whole procedure is fully automatic and data driven. With an audio/video footage as short as 20 minutes from a speaker, the proposed system can synthesize a highly photo-real video in sync with the given speech signals. This system won the FIRST place in the Audio-Visual match contest in LIPS2009 Challenge, which was perceptually evaluated by recruited human subjects.

Index Terms: visual speech synthesis, photo-real, talking head, trajectory-guided

Cite as: Wang, L., Qian, X., Han, W., Soong, F.K. (2010) Photo-real lips synthesis with trajectory-guided sample selection. Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7), 217-222

  author={Lijuan Wang and Xiaojun Qian and Wei Han and Frank K. Soong},
  title={{Photo-real lips synthesis with trajectory-guided sample selection}},
  booktitle={Proc. 7th ISCA Workshop on Speech Synthesis (SSW 7)},