ESCA Workshop on Audio-Visual Speech Processing (AVSP'97)

September 26-27, 1997
Rhodes, Greece

Videorealistic Talking Faces: A Morphing Approach

Tony Ezzat, Tomaso Poggio

MIT Center for Biological and Computational Learning, Cambridge, MA, USA

We present a method for the construction of a video-realistic text-to-audiovisual speech synthesizer. A visual corpus of a subject enunciating a set of key words is initally recorded. The key words are chosen so that they collectively contain most of the American English viseme images, which are subsequently identified and extracted from the data by hand. Next, using optical flow methods borrowed from the computer vision literature, we compute realistic transitions between every viseme to every other viseme. The images along these transition paths are generated using a morphing method. Finally, we exploit phoneme and timing information extracted from a text-to-speech synthesizer to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a videorealistic talking face.

Full Paper

Bibliographic reference.  Ezzat, Tony / Poggio, Tomaso (1997): "Videorealistic talking faces: a morphing approach", In AVSP-1997, 141-144.