14thAnnual Conference of the International Speech Communication Association

Lyon, France
August 25-29, 2013

Photo-Realistic Expressive Text to Talking Head Synthesis

Vincent Wan (1), Robert Anderson (2), Art Blokland (1), Norbert Braunschweiler (1), Langzhou Chen (1), BalaKrishna Kolluru (1), Javier Latorre (1), Ranniery Maia (1), Björn Stenger (1), Kayoko Yanagisawa (1), Yannis Stylianou (1), Masami Akamine (3), M. J. F. Gales (1), Roberto Cipolla (1)

(1) Toshiba Research Europe Ltd., UK
(2) University of Cambridge, UK
(3) Toshiba, Japan

A controllable computer animated avatar that could be used as a natural user interface for computers is demonstrated. Driven by text and emotion input, it generates expressive speech with corresponding facial movements. To create the avatar, HMM-based text-to-speech synthesis is combined with active appearance model (AAM)-based facial animation. The novelty is the degree of control achieved over the expressiveness of both the speech and the face while keeping the controls simple. Controllability is achieved by training both the speech and facial parameters within a cluster adaptive training (CAT) framework. CAT creates a continuous, low dimensional eigenspace of expressions, which allows the creation of expressions of different intensity (including ones more intense than those in the original recordings) and combining different expressions to create new ones. Results on an emotion-recognition task show that recognition rates given the synthetic output are comparable to those given the original videos of the speaker.

Full Paper

Bibliographic reference.  Wan, Vincent / Anderson, Robert / Blokland, Art / Braunschweiler, Norbert / Chen, Langzhou / Kolluru, BalaKrishna / Latorre, Javier / Maia, Ranniery / Stenger, Björn / Yanagisawa, Kayoko / Stylianou, Yannis / Akamine, Masami / Gales, M. J. F. / Cipolla, Roberto (2013): "Photo-realistic expressive text to talking head synthesis", In INTERSPEECH-2013, 2667-2669.