ESCA Workshop on Audio-Visual Speech Processing (AVSP'97)
September 26-27, 1997
We present a method for the construction of a video-realistic text-to-audiovisual speech synthesizer. A visual corpus of a subject enunciating a set of key words is initally recorded. The key words are chosen so that they collectively contain most of the American English viseme images, which are subsequently identified and extracted from the data by hand. Next, using optical flow methods borrowed from the computer vision literature, we compute realistic transitions between every viseme to every other viseme. The images along these transition paths are generated using a morphing method. Finally, we exploit phoneme and timing information extracted from a text-to-speech synthesizer to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression of a videorealistic talking face.
Bibliographic reference. Ezzat, Tony / Poggio, Tomaso (1997): "Videorealistic talking faces: a morphing approach", In AVSP-1997, 141-144.