We propose a new photo-realistic, voice driven only (i.e. no linguistic info of the voice input is needed) talking head. The core of the new talking head is a context-dependent, multi-layer, Deep Neural Network (DNN), which is discriminatively trained over hundreds of hours, speaker independent speech data. The trained DNN is then used to map acoustic speech input to 9,000 tied "senone" states probabilistically. For each photo-realistic talking head, an HMM-based lips motion synthesizer is trained over the speaker's audio/visual training data where states are statistically mapped to the corresponding lips images. In test, for given speech input, DNN predicts the likely states in their posterior probabilities and photo-realistic lips animation is then rendered through the DNN predicted state lattice. The DNN trained on English, speaker independent data has also been tested with other language input, e.g. Mandarin, Spanish, etc. to mimic the lips movements cross-lingually. Subjective experiments show that lip motions thus rendered for 15 non-English languages are highly synchronized with the audio input and photo-realistic to human eyes perceptually.
Bibliographic reference. Zhang, Xinjian / Wang, Lijuan / Li, Gang / Seide, Frank / Soong, Frank K. (2013): "A new language independent, photo-realistic talking head driven by voice only", In INTERSPEECH-2013, 2743-2747.