12th Annual Conference of the International Speech Communication Association

Florence, Italy
August 27-31. 2011

Text Driven 3D Photo-Realistic Talking Head

Lijuan Wang, Wei Han, Frank K. Soong, Qiang Huo

Microsoft Research Asia, China

We propose a new 3D photo-realistic talking head with a personalized, photo realistic appearance. Different head motions and facial expressions can be freely controlled and rendered. It extends our prior, high-quality, 2D photo-realistic talking head to 3D. Around 20-minutes of audio-visual 2D video are first recorded with read prompted sentences spoken by a speaker. We use a 2D-to-3D reconstruction algorithm to automatically adapt a general 3D head mesh model to the individual. In training, super feature vectors consisting of 3D geometry, texture and speech are formed to train a statistical, multi-streamed, Hidden Markov Model (HMM). The HMM is then used to synthesize both the trajectories of geometry animation and dynamic texture. The 3D talking head animation can be controlled by the rendered geometric trajectory while the facial expressions and articulator movements are rendered with the dynamic 2D image sequences. Head motions and facial expression can also be separately controlled by manipulating corresponding parameters. The new 3D talking head has many useful applications such as voice-agent, tele-presence, gaming, social networking, etc.

Full Paper

Bibliographic reference.  Wang, Lijuan / Han, Wei / Soong, Frank K. / Huo, Qiang (2011): "Text driven 3d photo-realistic talking head", In INTERSPEECH-2011, 3307-3308.